Questions about the Potential Duplicates feature

chase.freeman · 30 July 2021 13:50

I have a few questions about the “Potential Duplicates” feature that are not answered in the documentation.

From the docs:
The payload of a potential duplicate looks like this:

{
  "teiA": "<id>",
  "teiB": "<id>",
  "status": "OPEN|INVALID|MERGED"
}

My first, question concerns the following from the Developer Docs:

The payload you provide needs at least teiA to be a valid tracked entity instance; teiB is optional. If teiB is set, it also needs to point to an existing tracked entity instance.

Given that the payload of a potential duplicate at the /api/potentialDuplicate endpoint could return a teiA and teiB– what is the meaning and purpose of a a potential duplicate that only has a teiA listed? Is that a way saying “maybe this is a duplicate but I don’t know of what”?

My second question is about the meaning of the statuses - only “invalid” is explicitly defined:

Open - seems to mean that the potential duplicate is currently flagged as such and no action has been taken on it.

Invalid - From Developer Docs:
You can mark a potential duplicate as invalid to tell the system that the potential duplicate has been investigated and deemed to be not a duplicate. To do so you can use the following endpoint:

Merged - Doesn’t seem to yet be a functional feature

But what about “Linked”? Is that not a relevant status as the documentation seems to highlight linking as a way of managing potential duplicates.

Lastly, when “deleting” a potential duplicate i.e. DELETE /api/potentialDuplicates/<id> is it only removing the “Flag” on the TEI and removing it from the potentialduplicate table or does it also delete the TEI?

Can anyone shed more light on this feature and what certain aspects of it mean?

Stian · 2 August 2021 07:08

Hi @chase.freeman ,

great questions! We are still developing this feature (As we speak) and some things are subject to change slightly. I will try to answer your questions though.

What is the meaning and purpose of a potential duplicate that only has a teiA listed?

The initial idea here, was to flag tracked entities users knew\suspected was duplicates. This was something we primarily added to let implementations start manually flagging potential duplicates. Currently, there is no implications of flagging a tracked entity like this, other than creating a new row in the potential duplicate table.

However, we have revisited this specific code recently, and made a small change. We decided that flagging a tracked entity as a potential duplicate should be an inherit property of the tracked entity itself, and a potential duplicate is always a pair. In other words, we created a new property on the tracked entity, that will represent the “potential duplicate” flag, while a potential duplicate will now require both a teiA AND a teiB.

The idea behind flagging tracked entities like this is split in 2 parts: First, as mentoined, we want users to be able to flag tracked entities as potential duplicates, so they can be retrieved and reviewed later. However, we haven’t finalized a design as to how this process of reviewing will end up working. Secondly, we are planning on adding a “potential duplicate”-search. Without making any promises, since we have yet to actually implement how this search will work and we don’t know the potential performance cost of these searches, but we wanted to use flagged tracked entities as a pool of tracked entities we can automatically prioritise for these searches. The idea would be that this search runs periodically, picking up any flagged tracked entities, and looking for potential duplicates of that tracked entitiy, and creating new potential duplicate records.

Statuses

Open is the default status. When creating a new potential duplicate, we only know that it might be a duplicate, and nothing has been done to verify or invalidate that assumption.

Invalid is potential duplicates that have been reviewed by a user, and deemed not to be a duplicate. Ideally we can’t piggyback on this information for our searches, so avoid false positives when looking for duplicates.

Merged should ideally only be accessible after a merge have been performed. A merge would be between teiA and teiB, whereas data from teiB is moved into teiA where applicable. After that merge, we mark the potential duplicate as merged.

Both Merged and Invalid as “completed” states, meaning they are no longer relevant for any action. In theory, when a user is reviewing potential duplicates, they will only look at potential duplicates in the “open” state, and move them to either Merged or Invalid.

Currently we have no concept of Linked. If this is something you think is useful I would be happy to discuss this further with you; Let me know

When deleting a potential duplicate, does it only remove the flag, or also delete the tracked entity?

In the currently released version, deleting the potential duplicate will just mark it as invalid. We have moved away from that, and if we do support deletion, it would be for removing the potential duplicate record, and not the tracked entity itself.

Bonus answer:
For merging, we do intend to delete teiB, while keeping teiA, so the merging operation would result in deleting a tracked entity.

Hope this clears it up!

chase.freeman · 2 August 2021 11:50

Spot on! Thank you for your detailed answer - I’ll keep an eye out for new versions.

Are the new features likely to be back-ported to supported versions?

Stian · 2 August 2021 12:14

Great!

New features are not backported in DHIS2 by default. If we are making any substantial changes to the existing feature we might backport it though. However, features like merging, searching, and such will exclusively be added to new versions