Duckdb as an alternative to postgres for analytics?

Does some developer looked at duckdb and parquet files as an alternative to the current postgres implementation ?

exemple

I’ve extracted as is the analytics tables to parquet

INSTALL postgres;
LOAD postgres;

ATTACH 'postgresql://user:password@dbhost:5432/dhis' AS dhis_source (TYPE postgres);


COPY (SELECT * FROM dhis_source.analytics_2016) TO 'analytics_2016.parquet' (FORMAT PARQUET);
COPY (SELECT * FROM dhis_source.analytics_2017) TO 'analytics_2017.parquet' (FORMAT PARQUET);
COPY (SELECT * FROM dhis_source.analytics_2018) TO 'analytics_2018.parquet' (FORMAT PARQUET);
COPY (SELECT * FROM dhis_source.analytics_2019) TO 'analytics_2019.parquet' (FORMAT PARQUET);
COPY (SELECT * FROM dhis_source.analytics_2020) TO 'analytics_2020.parquet' (FORMAT PARQUET);
COPY (SELECT * FROM dhis_source.analytics_2021) TO 'analytics_2021.parquet' (FORMAT PARQUET);
COPY (SELECT * FROM dhis_source.analytics_2022) TO 'analytics_2022.parquet' (FORMAT PARQUET);
COPY (SELECT * FROM dhis_source.analytics_2023) TO 'analytics_2023.parquet' (FORMAT PARQUET);
COPY (SELECT * FROM dhis_source.analytics_2024) TO 'analytics_2024.parquet' (FORMAT PARQUET);
COPY (SELECT * FROM dhis_source.analytics_2025) TO 'analytics_2025.parquet' (FORMAT PARQUET);


COPY (SELECT * FROM dhis_source._periodstructure) TO 'periodstructure.parquet' (FORMAT PARQUET);
COPY (SELECT * FROM dhis_source._orgunitstructure) TO 'orgunitstructure.parquet' (FORMAT PARQUET);
COPY (SELECT * FROM dhis_source._dataelementstructure) TO 'dataelementstructure.parquet' (FORMAT PARQUET);
COPY (SELECT * FROM dhis_source._categorystructure) TO 'categorystructure.parquet' (FORMAT PARQUET);

then create

CREATE VIEW analytics AS SELECT * FROM 'analytics_*.parquet';

then run count/sum/… queries on it
the performance are really impressive and much more constant then postgres (not paying price of transaction handling, fetching only what is needed)
there’s a jdbc driver for it.

Hi!

Thank you for testing and sharing with the community. I feel like this could be a conversation in the weekly server-admin group calls with @bobj, @tkipkurgat, and the other server admins? :slight_smile:

Thanks!

well it’s more about the core analytics and so development team.

Apparently there are some development around clickhouse or doris.
Is this an upcoming replacement or just for the built in analytics usage ?

Hi @Stephan_Mestach

So is ClickHouse the same suggestion or is Duckdb better? I see that the feature was created in November 2024: Jira

In the related ticket you can read, “To allow for other database platforms than PostgreSQL, a new abstraction layer for managing tables, columns and indexes should be introduced in the analytics backend. This abstraction layer should include a data model for database entities like tables, columns and indexes, and a SQL builder for creating compatible SQL statements for each supported database platform.” by @Lars . :slight_smile:

Clickhouse and duckdb are a bit similar to make queries “count/sum/avg/group by/…” much faster.

So from the jira it’s really a new “implementation” for the analytics generation/queries.

Do you know if it’s already partially released/used in production ?
My guess it’s not yet since support is not full for events/trackers.

For version 42, support includes data elements, aggregate indicators and reporting rates.
Support for event data elements and program indicators is targeted for version 43.
Jira

I’m trying to get clickhouse working but the documentation is a bit sparse with dhis 2.42.4
I found doris docker compose but no clickhouse equivalent in the dhis2 repository.

I’m struggling with the initial setup

  • database creation
  • access rights

so I guess some initialization should happen before dhis2 connects to it for org.hisp.dhis.db.init.AnalyticsDatabaseInit to work

[drop named collection if exists "pg_dhis";]; Code: 497. DB::Exception: dhis: Not enough privileges. To execute this query, it's necessary to have the grant DROP NAMED COLLECTION ON pg_dhis. (ACCESS_DENIED) (version 25.12.1.649 (official build))

so I managed to make it work
for the one interested for a docker compose Test clickhouse by mestachs · Pull Request #10 · BLSQ/ops-dhis2-images · GitHub
(it doesn’t use standard dhis2 images, home made one “more env variable driven”)
not sure it follows best practices according to clickhouse deployment

it’s a bit sad that dhis2 requires clickhouse really at startup
the availability of dhis2 now depends on the 2 db…
while I would be happy that dhis2 to work in a degraded mode where people can still encode but not see dashboard/analytics. Especially when clickhouse is a new beast to learn how to host

Is there a recommended version of clickhouse to use ?

I’m now trying this on a much bigger dhis2/dataset

the good news I saw a few things working
but finally the analytics generation didn’t completed.

the tables managed to get populated (per types)
the analytics_temp table 143.242.618 rows
but when dhis2 tried to issue the query

update analytics_temp set "uidlevel1" = null where oulevel > 1 and dx in ( 'jyTUvieOQyr' ) 

this error was returned by clickhouse

Caused by: com.clickhouse.client.api.ServerException: Code: 48. DB::Exception: Lightweight updates are not supported. 
Lightweight updates are supported only for tables with materialized _block_number column. 
Run 'MODIFY SETTING enable_block_number_column = 1' command to enable it. 
(NOT_IMPLEMENTED) (version 26.3.9.8 (official build))

Managed to get something working. Apparently when using docker and xml for clickhouse it’s not super deterministic.

The good news it’s much faster in dashboards.

The bad news the /apps/data-administration#/analytics seem to disconnect at some point.
nothing appears in the progress only “…” or nothing at all.

note the stdout seem to suggest everything is still running

2026-04-20 05:51:35 INFO  DefaultAnalyticsTableService:102 - Performed pre-create table work: 'DATA_VALUE': 00:03:14.369
2026-04-20 05:51:35 INFO  DefaultAnalyticsTableService:102 - Dropped staging tables: 00:03:14.374
2026-04-20 05:51:35 INFO  AbstractJdbcTableManager:199 - Creating table: 'analytics_temp', columns: '61'
2026-04-20 05:51:35 INFO  DefaultAnalyticsTableService:102 - Created analytics tables: 00:03:14.394
2026-04-20 05:51:35 INFO  DefaultAnalyticsTableService:239 - Populate table task number: 1
2026-04-20 06:09:34 INFO  AbstractJdbcTableManager:459 - Populating table: 'analytics_temp' [PERCENTAGE, NUMBER, INTEGER_ZERO_OR_POSITIVE, INTEGER, INTEGER_POSITIVE, INTEGER_NEGATIVE, UNIT_INTERVAL] in: 1079.340821 sec.
2026-04-20 06:12:48 INFO  AbstractJdbcTableManager:459 - Populating table: 'analytics_temp' [TRUE_ONLY, BOOLEAN] in: 194.152000 sec.

For the one interested to give clickhouse a try some queries to run to see the progress of analytics generation

queries running against postgres (viewed from postgres point of view)

SELECT *
FROM postgresql(pg_dhis, schema='pg_catalog', table='pg_stat_activity');

queries running against clickhouse

SELECT 
    query_id, 
    user, 
    elapsed, 
    read_rows, 
    formatReadableSize(memory_usage) AS memory, 
    query 
FROM system.processes 
ORDER BY elapsed DESC;

Hi @Stephan_Mestach

Thank you for sharing all these here in the CoP! Hopefully they’re going to be of good user by the community. Yes, sometimes the docs is very sparse like you mentioned and this is why it’s greatly appreciate when knowledge is shared with the community.

@jason shared this link with me, ClickHouse - Analytics Platform does it help?

Thanks!

Frankly for large instance this is a game changer, the instances were often slow, unstable, nearly unusable. In my test setup I reduced the hosting prices by 2 for much better performance and stability (even with the cache disabled).

I understand that it’s hard to promote something that is probably too fresh, just out of the oven. Especially when it’s not implemented yet for tracker related data. But I think getting feedback sooner would be more valuable for you.

Note I went for clickhouse because it’s much simpler to host then doris. There’s a risk that the db won’t remain opensource, but we will see.

My fear is that not yet fully mainstream in your developer community. And I don’t really want to get in a situation I’m the only one having issue because I jumped on the boat too soon.

Can you tell me what is the level of confidence for this feature to used in production ?
Is this already rolled out in your infra or other hosting companies ?

Dear @Stephan_Mestach ,

The DHIS2 team also consider these technologies to be almost a necessity for large instances, due to the performance advantages you outline. Even for smaller instances, separating the analytics from the “reporting” database has advantages for stability and availability.

In DHIS2 v42 the features were, to some extent, experimental and limited; but they have allowed a few implementations to test out the potential. With v43, to be released in early May, we will bring support for Tracker related data too; making this a much more complete solution.

In terms of our infrastructure, we have been including Apache Doris environments systematically in our core test infrastructure for the last six months.

In terms of production use, I think it’s probably too early to say, and you are right to be a bit cautious at the moment. However, another advantage of this being purely for analytics is that the data in the “Analytics database” (Apache Doris or Clickhouse) is ephemeral; that is to say it is not the important raw data. So, if there are issues with the analytics database, it should be simple to reconfigure to remove that connection and regenerate analytics on the base PostgrSQL DB (you may need to take that into account in the provisioning of the infrastructure, of course).

@Lars has been championing these developments and may be able to give more of a detailed outlook.

Kind regards,
Phil