DBMS can help us manage data conveniently and significantly improve the efficiency of data retrieval.
PostgreSQL is a popular open-source RDBMS known for its robustness, advanced features, and strong compliance with SQL standards. openGauss is an enterprise-grade open-source RDBMS developed by Huawei, designed for high performance, security, and scalability in demanding business environments.
System: Alibaba Cloud Linux 3.2104 LTS 64位
IP: 47.115.128.238
Version: openGauss-Docker-6.0.2-x86_64
Port: 15432:5432
User: omm (Operation & Maintenance Manager)
Password: opengauss
Version: 17.6
Port: 5432
User: postgres
Password: postgres
The Clickstream dataset aggregates counts of
(referrer, resource)pairs from Wikipedia request logs, showing how readers arrive at an article and what they click next.
Run DownloadDatasets.sh to download all the datasets.
To import datasets to PostgreSQL, run:
xxxxxxxxxx11psql "postgresql://postgres:postgres@127.0.0.1:5432/project1" -f './ImportDatasets.sql'Similar when importing to openGauss.
Results:
xxxxxxxxxx31 rows_loaded 2-------------3 35512282xxxxxxxxxx21EXPLAIN (ANALYZE)2SELECT * FROM clickstream.events WHERE curr ILIKE '%main%';xxxxxxxxxx111QUERY PLAN2---------------------------------------------------------------------------------------------------------------------------------3Gather (cost=1000.00..547907.69 rows=124279 width=47) (actual time=0.437..24740.854 rows=162828 loops=1)4Workers Planned: 25Workers Launched: 26-> Parallel Seq Scan on events (cost=0.00..534479.79 rows=51783 width=47) (actual time=2.401..24640.354 rows=54276 loops=3)7Filter: (curr ~~* '%main%'::text)8Rows Removed by Filter: 117831519Planning Time: 1.428 ms10Execution Time: 24750.391 ms11(8 rows)
During 5 tests, the average result is 24662.1ms.
xxxxxxxxxx11g++ ./SelectAll.cpp -o SelectAll -std=c++17 -O2 && ./SelectAllxxxxxxxxxx21Find 162828 results in total221234.4 msDuring 5 tests, the average result is 20116.2ms.
x
1EXPLAIN (ANALYZE)2UPDATE clickstream.events SET curr = REPLACE(curr, '_', '^') WHERE curr LIKE '%_%';xxxxxxxxxx81QUERY PLAN2------------------------------------------------------------------------------------------------------------------------------3Update on events (cost=0.00..882401.93 rows=0 width=0) (actual time=225011.949..225014.086 rows=0 loops=1)4-> Seq Scan on events (cost=0.00..882401.93 rows=35522510 width=38) (actual time=0.024..41868.876 rows=35512282 loops=1)5Filter: (curr ~~ '%_%'::text)6Planning Time: 10.490 ms7Execution Time: 225019.108 ms8(5 rows)
The result is 225019.1ms.
xxxxxxxxxx11g++ ./UpdateAll.cpp -o UpdateAll -std=c++17 -O2 && ./UpdateAllxxxxxxxxxx21Update 30866250 results in total261515.7 msThe result is 61515.7ms.
xxxxxxxxxx71SET search_path = clickstream, public;2EXPLAIN (ANALYZE)3SELECT curr, SUM(n) AS clicks4 FROM events5 GROUP BY curr6 ORDER BY clicks DESC7 LIMIT 20;xxxxxxxxxx31...2 Planning Time: 9.009 ms3 Execution Time: 101260.848 msThe result is 101260.8ms.
xxxxxxxxxx11g++ ./SelectTopK.cpp -o SelectTopK -std=c++17 -O2 && ./SelectTopKxxxxxxxxxx1169683.3 msThe result is 69683.3ms.
Overall, C++ streaming program beat the DBMS(PostgreSQL) on all tests.
And there're multiple reasons why DBMS seems to be slower than C++:
All tasks are one-shot full-scan or rewriting, thus C++ can lightly and easily streams files, avoiding DBMS overheads.
We are using basic DBMS without optimization like pg_trgm, B-Tree. (The efficiency won't increase too much even with pg_trgm.)
Thus, the results doesn't negate the DBMS strengths.
At some circumstances include reusable queries, concurrency, strong consistency, complex joins/transactions, e.t.c., DBMS will perform significantly better than data operations in files.
xxxxxxxxxx21SET max_parallel_workers_per_gather = 4;2SET work_mem = '256MB';xxxxxxxxxx21EXPLAIN (ANALYZE)2SELECT * FROM clickstream.events WHERE curr ILIKE '%main%';Results:
PostgreSQL: 24662.1ms.
openGauss: 27012.6ms.
x
1EXPLAIN (ANALYZE)2UPDATE clickstream.events SET curr = REPLACE(curr, '_', '^') WHERE curr LIKE '%_%';Results:
PostgreSQL: 225019.1ms.
openGauss: 244217.9ms.
xxxxxxxxxx71EXPLAIN (ANALYZE)2SELECT a.prev AS src, b.curr AS dst, COUNT(*) AS paths3 FROM clickstream.events a4 JOIN clickstream.events b ON a.curr = b.prev5 GROUP BY a.prev, b.curr6 ORDER BY paths DESC7 LIMIT 20;Results:
PostgreSQL: 180567.2ms.
openGauss: 191124.3ms.
xxxxxxxxxx71SET search_path = clickstream, public;2EXPLAIN (ANALYZE)3SELECT curr, SUM(n) AS clicks4 FROM events5 GROUP BY curr6 ORDER BY clicks DESC7 LIMIT 20;Results:
PostgreSQL: 97628.7ms.
openGauss: 115431.5ms.
Overall, we can conclude that PostgreSQL and openGauss have almost the same efficiency. They are very similar and openGauss is usually a little bit slower than PostgreSQL. Therefore, at some specific circumstances, like looser settings, openGauss will perhaps performs better, but still a little.
As results showcased, DBMS performs slower than C++ in select operation, but if we use prefix search (ILIKE 'main%'), DBMS will perform much better.
DBMS will perform better when joining large tables, no matter PostgreSQL, openGauss or other DBMS.
With pg_trgm, or some other extensions, we can obtain better operation efficiency, even still slower than C++ at some circumstances.
Overall, C++ streaming performs better than DBMS baselines for it's more basic and trivial. Nevertheless, in reusable, concurrent or consistency-critical circumstances with proper design, DBMS can significantly outperform than C++ streaming (Not mentioned too much in this report due to the requirements). And PostgreSQL and openGauss performed almost the same and PostgreSQL is a little bit faster in most circumstances. While openGauss still has its own advantages.
xxxxxxxxxx11sudo -u postgres psql -p 5432 -d project1xxxxxxxxxx11psql -h 127.0.0.1 -p 5432 -U postgres -d project1xxxxxxxxxx11psql "postgresql://postgres:postgres@127.0.0.1:5432/project1"xxxxxxxxxx1docker exec -e PGPASSWORD='opengauss' -u omm opengauss15432 \2 bash -lc "gsql -h 127.0.0.1 -p 5432 -U omm -d project1"xxxxxxxxxx1gsql -h 127.0.0.1 -p 15432 -d project1 -U omm