Strategies for Systematic Entity Matching

6

One of the recurring problems with connecting data from diverse sources is ironing out inevitable inconsistencies in text strings. For example, connecting “Joe Germuska” in one database with “Joseph Germuska” in another, or recognizing that “435 N. Michigan Ave.” is the same as “435 North Michigan Avenue”.

Does anyone have good systematic methods for doing this? Here are a couple of things we’ve looked at a the Tribune News Apps team.

Related, but possibly deserving a separate question: do you have database modeling strategies for knitting together things which were originally loaded with different names but have been determined to be the same? It seems like one doesn’t want to completely redo the data each time one of these connections is made, but any design I’ve thought of seems likely to buckle under the weight of the data. It has always looked to me like LibraryThing does this really well for both authors and works, but I don’t know the mechanics.

Tags: asked April 22, 2010

Leave a Reply

2 Answers

3

CorpWatch has taken a valiant stab at entity resolution/name disambiguation with corporate data from the Securities and Exchange Commission. Their project is open source so you might learn quite a bit from them about structuring a database for entity resolution.

You might also try the folks at Sunlight Labs, who I believe are working on these issues with various nonprofits such as the National Institute on Money In State Politics and OpenSecrets. I believe both of the latter groups have applied these techniques to campaign finance data at the state and federal levels, and I wouldn't be surprised if they used them to help develop the recently announced TransparencyData.com.

On a technical level, there are numerous approaches to this problem depending on the nature of the data. Most likely you'll need to cobble together several of them to develop an accuracy ranking, which you can then use to separate "good" data from data that will need human review. Below are a few resources that might help along the way:

Finally, you might find some useful nuggets in Programming Collective Intelligence, which uses Python as the language of choice for source examples.

EDIT: On the address standardization front, Google's geocoding is your best best if you've already shelled out the $10K or so for a license (which lets you brush aside the daily limits on the service). But if you need a stand-alone library, you might want to check out the Ruby port of the Perl module you mentioned:

http://github.com/geocommons/geocoder

It appears the original developer behind the Perl module migrated over to Ruby. It ain't Python, but if it gets the job done....

  1. Hi Joe!
    I can’t recall precisely, but I *think* there was one important difference between the Perl and Ruby version of the full geocoders (not just the street parsers). I think the Perl version only worked with TIGER/Line files from 2004 or earlier? Again, can’t recall the precise limitation, but it might be something to keep on the radar.

    Re PCI, it’s probably not directly applicable, but perhaps Chpt. 6 on Document Filtering can provide some guidance on developing an accuracy ranking system.

  2. Lots of good stuff here. Re: Schuyler porting, I figure using Perl from (or instead of) Python is as good as using Ruby. Although I’ll be curious to see if he was really able to get Ruby to do the regexes as well as Perl did.

    I’m a big fan of the PCI book, but don’t recall really seeing anything which naswered the question.

    Sunlight has a Python open source project in this area. I had trouble getting Sphinx to compile and haven’t made time to dig deeper. http://github.com/sunlightlabs/datacommons-matchbox

Leave a Reply

175
1

An answer to the address part of your first question:

A hack we considered is (ab)using the Google geocoder. The API returns a normalized address (in JSON, it appears), so you could feed it "435 N. Michigan Ave." and "435 North Michigan Avenue" they both come back as:

{"lat": 41.890422000000001, "lng": -87.623701999999994, "place": "435 N Michigan Ave, Chicago, IL 60611, USA"}

... so then you know you've got one & the same.

As to your second question, it's the comparison of your new data against what's already in the data set that raises fears of buckledom?

  1. The Google geocoder (part of the Google Maps API) has been a great tool for normalizing addresses, but it has a limitation on the number of geocodes you can run on a given day (50k/day?), so be sure to not abuse the tool.

Leave a Reply

472

Your Answer

Please login to post questions.