Duckdb as an alternative to postgres for analytics?

Does some developer looked at duckdb and parquet files as an alternative to the current postgres implementation ?

exemple

I’ve extracted as is the analytics tables to parquet

INSTALL postgres;
LOAD postgres;

ATTACH 'postgresql://user:password@dbhost:5432/dhis' AS dhis_source (TYPE postgres);


COPY (SELECT * FROM dhis_source.analytics_2016) TO 'analytics_2016.parquet' (FORMAT PARQUET);
COPY (SELECT * FROM dhis_source.analytics_2017) TO 'analytics_2017.parquet' (FORMAT PARQUET);
COPY (SELECT * FROM dhis_source.analytics_2018) TO 'analytics_2018.parquet' (FORMAT PARQUET);
COPY (SELECT * FROM dhis_source.analytics_2019) TO 'analytics_2019.parquet' (FORMAT PARQUET);
COPY (SELECT * FROM dhis_source.analytics_2020) TO 'analytics_2020.parquet' (FORMAT PARQUET);
COPY (SELECT * FROM dhis_source.analytics_2021) TO 'analytics_2021.parquet' (FORMAT PARQUET);
COPY (SELECT * FROM dhis_source.analytics_2022) TO 'analytics_2022.parquet' (FORMAT PARQUET);
COPY (SELECT * FROM dhis_source.analytics_2023) TO 'analytics_2023.parquet' (FORMAT PARQUET);
COPY (SELECT * FROM dhis_source.analytics_2024) TO 'analytics_2024.parquet' (FORMAT PARQUET);
COPY (SELECT * FROM dhis_source.analytics_2025) TO 'analytics_2025.parquet' (FORMAT PARQUET);


COPY (SELECT * FROM dhis_source._periodstructure) TO 'periodstructure.parquet' (FORMAT PARQUET);
COPY (SELECT * FROM dhis_source._orgunitstructure) TO 'orgunitstructure.parquet' (FORMAT PARQUET);
COPY (SELECT * FROM dhis_source._dataelementstructure) TO 'dataelementstructure.parquet' (FORMAT PARQUET);
COPY (SELECT * FROM dhis_source._categorystructure) TO 'categorystructure.parquet' (FORMAT PARQUET);

then create

CREATE VIEW analytics AS SELECT * FROM 'analytics_*.parquet';

then run count/sum/… queries on it
the performance are really impressive and much more constant then postgres (not paying price of transaction handling, fetching only what is needed)
there’s a jdbc driver for it.

2 Likes

Hi!

Thank you for testing and sharing with the community. I feel like this could be a conversation in the weekly server-admin group calls with @bobj, @tkipkurgat, and the other server admins? :slight_smile:

Thanks!

well it’s more about the core analytics and so development team.

Apparently there are some development around clickhouse or doris.
Is this an upcoming replacement or just for the built in analytics usage ?

1 Like

Hi @Stephan_Mestach

So is ClickHouse the same suggestion or is Duckdb better? I see that the feature was created in November 2024: Jira

In the related ticket you can read, “To allow for other database platforms than PostgreSQL, a new abstraction layer for managing tables, columns and indexes should be introduced in the analytics backend. This abstraction layer should include a data model for database entities like tables, columns and indexes, and a SQL builder for creating compatible SQL statements for each supported database platform.” by @Lars . :slight_smile:

Clickhouse and duckdb are a bit similar to make queries “count/sum/avg/group by/…” much faster.

So from the jira it’s really a new “implementation” for the analytics generation/queries.

Do you know if it’s already partially released/used in production ?
My guess it’s not yet since support is not full for events/trackers.

For version 42, support includes data elements, aggregate indicators and reporting rates.
Support for event data elements and program indicators is targeted for version 43.
Jira

I’m trying to get clickhouse working but the documentation is a bit sparse with dhis 2.42.4
I found doris docker compose but no clickhouse equivalent in the dhis2 repository.

I’m struggling with the initial setup

  • database creation
  • access rights

so I guess some initialization should happen before dhis2 connects to it for org.hisp.dhis.db.init.AnalyticsDatabaseInit to work

[drop named collection if exists "pg_dhis";]; Code: 497. DB::Exception: dhis: Not enough privileges. To execute this query, it's necessary to have the grant DROP NAMED COLLECTION ON pg_dhis. (ACCESS_DENIED) (version 25.12.1.649 (official build))

so I managed to make it work
for the one interested for a docker compose Test clickhouse by mestachs · Pull Request #10 · BLSQ/ops-dhis2-images · GitHub
(it doesn’t use standard dhis2 images, home made one “more env variable driven”)
not sure it follows best practices according to clickhouse deployment

it’s a bit sad that dhis2 requires clickhouse really at startup
the availability of dhis2 now depends on the 2 db…
while I would be happy that dhis2 to work in a degraded mode where people can still encode but not see dashboard/analytics. Especially when clickhouse is a new beast to learn how to host

Is there a recommended version of clickhouse to use ?

I’m now trying this on a much bigger dhis2/dataset

the good news I saw a few things working
but finally the analytics generation didn’t completed.

the tables managed to get populated (per types)
the analytics_temp table 143.242.618 rows
but when dhis2 tried to issue the query

update analytics_temp set "uidlevel1" = null where oulevel > 1 and dx in ( 'jyTUvieOQyr' ) 

this error was returned by clickhouse

Caused by: com.clickhouse.client.api.ServerException: Code: 48. DB::Exception: Lightweight updates are not supported. 
Lightweight updates are supported only for tables with materialized _block_number column. 
Run 'MODIFY SETTING enable_block_number_column = 1' command to enable it. 
(NOT_IMPLEMENTED) (version 26.3.9.8 (official build))