Matching strategies and algorithms

Knut_Staring · 5 May 2010 18:47

This ongoing (long) discussion on the OMRS list on name matching
strategies is also relevant for us - it is not only important for
patient, but also for matching facility names and villages/chiefdoms
etc with shapefiles for mapping.

Knut

···

---------- Forwarded message ----------
From: Jeff Rafter <jeffrafter@gmail.com>
Date: Wed, May 5, 2010 at 5:34 PM
Subject: Re: [OPENMRS-DEV] soundex architecture question
To: openmrs-devel-l@listserv.iupui.edu

i think using using minimal edit distance from dictionary words or names would be a better implementation if correcting word spelling is the goal. soundex needs to be fine tuned to bantu languages which no one has done successively yet.

Check out the implementation from earlier in this thread... the
Mateme/Ruby bantu soundex is working great in Malawi and it looks like
Dave (and others?) have already ported this to Java for work in
OpenMRS.

On Tue, May 4, 2010 at 11:10 PM, Paul Biondich <pbiondich@regenstrief.org> wrote:

Also, just another gentle reminder... there is another whole "approach" to matching mis-spelled or misrepresented names: string sequence matching or comparator algorithms.
Some examples of this are: Levenshtein Edit Distance (and edit distances in general), and Longest Common Substring (LCS), and the Jaro-Winkler methodology.
Wikipedia is your friend on these.

I think that Levenshtein distance would be tough for suggesting. Lots
of things would have the same distance but would be wildly different
right? I could be wrong on that though. I have used LCS with account
names in a previous life and it worked well, but mainly for things
like "Bank of America" and "Bank America". These aren't similar
sounding but are clearly related. I could see this having an impact on
names like the we discussed in the other thread: "Mike von der ohe Mc
Kay" versus "Mike McKay".
Secondarily, the person_name_code approach would be more difficult,
you would have to compare the typed in string to all of the names in
the database to determine the distance. We could probably use some
kind of Ur-name and mean distance but that would take some tweaking.
All the best
Jeff

-Paul
On May 1, 2010, at 1:12 AM, David Thomas wrote:

This is excellent. I'm adding this to my notes. As we move forward with kinyarwanda, it seems like an eventual logical step to try to abstract the lessons learned for other implementers working in non-european languages. Its on my 'things not to forget about' list.

d

On 4/28/2010 6:27 AM, Jeff Rafter wrote:

Awesome work Dave. I really like that you can add algorithms.
In thinking about this more it would be cool if *anyone* could add a new algorithm. The process is pretty simple and could be accomplished with a couple tables:
soundex_algorithm
soundex_algorithm_id:int
name:string
uppercase_all_letters:boolean, default:true
strip_punctuation:boolean, default:true
keep_first_digit:boolean, default:true
remove_digit_pairs, default:true
remove_leading_zero, default:true
length:int, default:4
soundex_algorithm_replacement
soundex_algorithm_step_id:int
soundex_algorithm_id:int # refers to the main algorithm
sort_weight:int
match:string
value:string

This could obviously be massaged some more. And I am more or less imagining there would be regex in the match value which might be over the heads of non-devs (although the very simple regex in this case might be okay).
The other potential problem would be that you may need to specify at what point you keep the first char (in the Bantu one we have to replace leading phonemes first). I could imagine in a language where there were spaces / ' in names that you might consider the next char a leading char as well. So this could be a crazy road to go down. Just an idea.

Great work,
Jeff
On Wed, Apr 28, 2010 at 7:18 AM, David Thomas <pihdave@gmail.com> wrote:

Hi. The namephonetic module that i checked in a couple of days ago basically mirrors everything that's said here. The one small data model change is that each row only stores one encoded name component, which is then labeled by a java enum (GIVEN_NAME, MIDDLE_NAME, FAMILY_NAME, FAMILY_NAME2). The module AOP's around savePatient and calculates the code for fields that you specify using a global property, meaning that you can control what PersonName fields you want to encode, and you can even attach them to different soundex algorithms (folks in rwanda get a local name and a christian name, which are quite different).

There's a little JSP page that gives you the ability to re-calculate all soundexes for all patients in the db, for migration purposes, or if you want to alter your soundex.

And, i ported jeff's chichewa soundex directly into java, which is included in the module -- we needed a starting point for kinyarwanda.

Finally, the module allows you to add your own algorithm. All algorithms only have to be an extension of the StringEncoder class that lives in the apache commons codec jar file. The module also includes all the algorithms that live in this jar file by default -- Soundex, RefinedSoundex, Metaphone, Double Metaphone, etc...

The module however, doesn't affect the default patient lookup in openmrs. I didn't connect them. But the module does have a bunch of methods for finding a patient, which i'm using from a touchscreen registration module i'm working on.

dave thomas

On 4/27/2010 8:19 PM, Jeff Rafter wrote:

For completeness, here is the code (obviously very simple implementation):
mateme/lib/bantu_soundex.rb at master · jeffrafter/mateme · GitHub
Also for completeness "we" here meant Rodney Tayamika and myself (with lots of help later from Evan). This was one of those
situations that absolutely required someone with local knowledge.
I also realized that in the final solution actually moved the soundexes out of person_name into a nearly identical table called
person_name_code. This was done so that the name codes would not show up in other modules and in places that didn't
make use of the preferred flag:
desc openmrs_1_5.person_name_code
-> ;
+-------------------------+-------------+------+-----+---------+----------------+
> Field | Type | Null | Key | Default | Extra |
+-------------------------+-------------+------+-----+---------+----------------+
> person_name_code_id | int(11) | NO | PRI | NULL | auto_increment |
> person_name_id | int(11) | YES | MUL | NULL | |
> given_name_code | varchar(50) | YES | MUL | NULL | |
> middle_name_code | varchar(50) | YES | MUL | NULL | |
> family_name_code | varchar(50) | YES | MUL | NULL | |
> family_name2_code | varchar(50) | YES | | NULL | |
> family_name_suffix_code | varchar(50) | YES | | NULL | |
+-------------------------+-------------+------+-----+---------+----------------+
7 rows in set (1.16 sec)
I am happy to fill in any more holes here, and am excited to see this possibly being modularized.
Cheers,
Jeff

On Tue, Apr 27, 2010 at 6:02 PM, Jeff Rafter <jeffrafter@gmail.com> wrote:

Hey All,
I am trying to catch up on this thread, somehow I missed it. We (PIH) have implemented Soundex in Mateme (the PIH/Baobab touchscreen system). When we did this re worked the phoneme set of traditional soundex to use Bantu phonemes instead (based on a very old missionary text in Malawi). This had the effect of grouping "l" and "r" for instance, a common cross-phoneme in Malawi. It works *great*. The code is very simple as well. Ultimately Evan Waters asked for a couple of adjustments based on real world clinic use after the first couple months.
Secondarily, the question of when to do this.
The best time to calculate the soundex is when you change the name. For us we had control over this in the touchscreen application, but embedding it into a module would be even better because then edits in OpenMRS proper would cause soundex refiguring. To save the soundex we simply add a second PersonName record with soundexes instead of names. This acts a giant soundex cache. When we search, we do a quick calculation on the soundex of the search string and search for the result in our names. Ultimately this makes name searching just as fast as before.
We had a back log of names so we had a quick one-time process that calculated all of the current names... from then on the system kept up with itself as it was just one additional thing to save whenever a name changed.
Jeff
On Wed, Apr 21, 2010 at 4:40 PM, Paul Biondich <pbiondich@regenstrief.org> wrote:

Dave:
The traditional criticism of Soundex for our work is that it's algorithm was inferred from English names, so it's problematic when bringing it to a Swahili or Rwandan French name space.
I know that Shaun Grannis is very up to speed on the various pros and cons of the phonetic "compression" algorithms, and I'll flag him to make sure he sees this and responds.
Least Common Substring might be better given the circumstances (it tends to "hide" misspellings and typing transpositions, which are what you are likely trying to deal with).
RE: running these algorithms on-the-fly vs. as a scheduled task... I think you need to determine first what mechanism(s) you want to use, and then measure the performance implications appropriately. In practice, Shaun's work always pregenerates these...
-Paul
On Apr 21, 2010, at 11:04 PM, David Thomas wrote:

Hi everyone. We've started a discussion of whether or not to launch into development of a soundex module, or if other strategies might be more appropriate. Here's the dialogue thus far:

Me (dave):

Hey, quick question -- the soundex algorithm looks fairly processor intensive, to the point where i don't think this should be done in real-time for existing names in the database. Do you pre-calculate soundex values for existing patient names, so that the algorithm is just applied to the search string, and then compared with the pre-calculated values?

I feel like maybe this should be an openmrs module on its own, with a little data model of pre-calculated values for givenName, middleName, familyName, familyName2 and should support multiple implementations of SoundexAlgorithm.

In case you haven't read it (mike, darius), here's the theory:
SoundEx - Phonetic Search Algorithms - How To - Source Code: C, JavaScript, Perl, VB - Creativyst(R) Software - Explored,Designed,Delivered.(SM)

Based on people's responses, i'll start (or not) the discussion of the data model.

Darius:

Interesting. I'd never read the algorithm before. More research:

Another competing algorithm is metaphone, which has apparently been replaced by double metaphone. I don't know the specific algorithm, so I don't know if it's easy to translate for Bantu pronounciations.
Apache's commons-codec has implementations of various phonetic algorithms (see API) including Soundex and RefinedSoundex (better for spellcheckers apparently) that allow you to specify the mapping. (It also has metaphone but no ability to change that algorithm.)

I did a quick test of running the commons-codec soundex algorithm on all the (few) names in the PersonServiceTest-extranames.xml file in OpenMRS and got the following result:
Took 136 ms to do 6 person names with 16 components
Presumably the implication is that if you do 6000 names it will take 2+ minutes on a laptop, which is obviously not fast enough for real-time.
So, the real question is whether we should be using a phonetic matching algorithm (in which case the module you describe is the only way to go) or if we should be considering leveraging the existing Patient Matching Module, which uses a much more sophisticated algorithm, probably has considered a real-time-check use case, but may not be complete. (That sounds like a question for the OpenMRS developers list. Anyone feel like sending that?)
-Darius

________________________________
Click here to unsubscribe from OpenMRS Developers' mailing list

________________________________
Click here to unsubscribe from OpenMRS Developers' mailing list

________________________________
Click here to unsubscribe from OpenMRS Developers' mailing list

________________________________
Click here to unsubscribe from OpenMRS Developers' mailing list

________________________________
Click here to unsubscribe from OpenMRS Developers' mailing list

________________________________
Click here to unsubscribe from OpenMRS Developers' mailing list

________________________________
Click here to unsubscribe from OpenMRS Developers' mailing list

--

Mugisha Moses
Lead Developer
Appfrica Labs
P.O. Box 1420 Kampala, Uganda
http://appfrica.org
skype name : mossplix
twitter: @mugisha

________________________________
Click here to unsubscribe from OpenMRS Developers' mailing list

________________________________
Click here to unsubscribe from OpenMRS Developers' mailing list

--
Cheers,
Knut Staring