How would you convert a huge amount of street addresses to coordinates(lat, long)?

4

This was a good question on NICAR-L that seemed worth preserving.

How would you convert a huge amount of
street addresses to coordinates(lat,
long)? I have ArcView and open source
tools.

Tags: asked June 12, 2010
  1. One significant question: Did a human type in the addresses? (If you can’t trust the input, then you’ll need more troubleshooting and error-catching on the output…)

    And then some minor questions: 1. How many is a huge amount? 100s? 1,000s? 10,000s? 2. How accurate do you need the results to be?

  2. One significant question: Did a human type in the addresses? (If you can’t trust the input, then you’ll need more troubleshooting and error-catching on the output…)

    And then some minor questions:
    1. How many is a huge amount? 100s? 1,000s? 10,000s? 2. How accurate do you need the results to be?

  3. One significant question: Did a human type in the addresses? (If you can’t trust the input, then you’ll need more troubleshooting and error-catching on the output…)

    And then some minor questions: 1. How many is a huge amount? 100s? 1,000s? 10,000s? 2. How accurate do you need the results to be?

  4. One significant question: Did a human type in the addresses? (If you can’t trust the input, then you’ll need more troubleshooting and error-catching on the output…)

    And then some minor questions:

    1. How many is a huge amount? 100s? 1,000s? 10,000s?
    2. How accurate do you need the results to be?

Leave a Reply

6 Answers

5

geopy offers a pretty straightforward Python approach:

>>> from geopy import geocoders
>>> g = geocoders.Google('YOUR_API_KEY_HERE')  
>>> place, (lat, lng) = g.geocode("10900 Euclid Ave in Cleveland")  
>>> print "%s: %.5f, %.5f" % (place, lat, lng)  
10900 Euclid Ave, Cleveland, OH 44106, USA: 41.50489, -81.61027

... and if/when you bump up against Google's 2,500 requests per day limit, geopy also includes classes for geocoding against MediaWiki (with the GIS extension), Semantic MediaWiki, the Yahoo! geocoder, geocoder.us, Virtual Earth, and GeoNames.

Also, note that the new v3 Google geocoder doesn't required signing up for a key (but it still has a limit on free geocoding requests) and has some added tricks in the response that you might want to check out. For instance, using the address above returns the following (much more elaborate) JSON response:

{
  "status": "OK",
  "results": [ {
    "types": [ "street_address" ],
    "formatted_address": "10900 Euclid Ave, Cleveland, OH 44106, USA",
    "address_components": [ {
      "long_name": "10900",
      "short_name": "10900",
      "types": [ "street_number" ]
    }, {
      "long_name": "Euclid Ave",
      "short_name": "Euclid Ave",
      "types": [ "route" ]
    }, {
      "long_name": "Cleveland",
      "short_name": "Cleveland",
      "types": [ "locality", "political" ]
    }, {
      "long_name": "Cleveland",
      "short_name": "Cleveland",
      "types": [ "administrative_area_level_3", "political" ]
    }, {
      "long_name": "Cuyahoga",
      "short_name": "Cuyahoga",
      "types": [ "administrative_area_level_2", "political" ]
    }, {
      "long_name": "Ohio",
      "short_name": "OH",
      "types": [ "administrative_area_level_1", "political" ]
    }, {
      "long_name": "United States",
      "short_name": "US",
      "types": [ "country", "political" ]
    }, {
      "long_name": "44106",
      "short_name": "44106",
      "types": [ "postal_code" ]
    } ],
    "geometry": {
      "location": {
        "lat": 41.5051404,
        "lng": -81.6097778
      },
      "location_type": "ROOFTOP",
      "viewport": {
        "southwest": {
          "lat": 41.4992554,
          "lng": -81.6106166
        },
        "northeast": {
          "lat": 41.5055506,
          "lng": -81.6043214
        }
      }
    },
    "partial_match": true
  } ]
}

HTH, John

Leave a Reply

472
4

Really big geocoding jobs (think millions of addresses) is an area where ArcView can be pretty good. Here's why.

Google is great about making sense of addresses--especially when the street names aren't exactly right (I think this is true of other geocoding services, though I'm most familiar with google). The fact that they can regularize addresses they return to you is awesome (and you might wanna record their response on your end of things). In many cases they return results that are more precise than street segment interpolation. Add in viewport biasing, and it's an incredible service (also check out Ben Welsh's fork of python-geopy which has viewport biasing built in: http://github.com/palewire/python-geopy ). That said, the ever-dropping rate limit is now 2,500 addresses a day ( http://code.google.com/apis/maps/documentation/geocoding/ ) You can get around this some with multiple keys/multiple services, but...

Aidian's approach is really cool, and it's awesome he's sharing the code. The one problem is when street names aren't quite right. Depending on what your data is, this can be a bigger problem than you think. Is it W. 7th St. or West 7th Street or W. 7 St ? You could get around that with some alias table work, postgres regexes, and maybe an initial street name lookup--but that gets to be a lot of work.

In my experience, sometimes you'll get data from an agency that has used the tiger line files (or their vendor has) so you'll get scarily good match rates with the tiger line files. If that's the case, the roll-your own approach would be great.

But geocoding is a big problem, and it's one that arcview has been working on for a while. Their stuff is pretty good. I wish I was better at configuring the geocoding options, but they can handle a fair amount of address formatting weirdness, and you can tweak how good a result has to match to be accepted (you can also build alias tables there too, though I've never had the patience). Again, the quality or results depends on the quality of your road segment data, and whether the agency has used the same data source. (In my experience, you can sometimes get street files from the same source the agency used. Good county GIS agencies sometimes build these files, and if the data you have is from a county agency that's used the locally built road files you're in good shape).

Finally, the quality of the TIGER line files really varies. My experience is that they can be surprisingly good in areas where A. new development hasn't occurred recently, and B. the streets are straight. Which is true in a lot of cities. If you're looking at new developments with curvy roads, it's a whole other story.

Leave a Reply

180
2

One of the first things I ever did in python was to build my own geocoder against a postgis database built from the tiger road grid for my county.

It's the alternative I used for a similar bind -- geocoding a whole bunch of addresses, significantly more than google would do for me for free.

It was a huge pain for me to figure out, but now seems nearly trivial. The code I used is below in case you need to go that route. The indexes refer to columns in the database. It only really works if all those addresses are in a single geographic area -- the database would quickly get un-usably large if you were trying to import the whole U.S. or even a whole state.

`import psycopg2

def getseg(address, street):
try:
conn = psycopg2.connect("dbname='gisdb' user='postgres' host='localhost' password='mypassword'")
except:
    print "I am unable to connect to the database"
cur = conn.cursor()
cur.execute("SELECT * FROM sacroads WHERE streetname='%s'" % street)
rows = cur.fetchall()
for row in rows:
    if address > int(row[23]) and address < int(row[24]):
        seg = row
    elif address > int(row[25]) and address < int(row[26]):
        seg = row
return seg

def location_along_segment(seg):
idnumer = seg[0]
if seg[23] < seg[25]:
    start = int(seg[23])
else:
    start = int(seg[25])
if seg[24] > seg[26]:
    finish = int(seg[24])
else:
    finish = int(seg[26])
base = (finish - start)
actual = (finish - address)
length = (float(actual)/float(base))
segdict = {"along":length, "section":idnumber}
return segdict

def coords(segdict):
conn2 = psycopg2.connect("dbname='gisdb' user='postgres' host='localhost'   password='mypassword'")
cur2 = conn2.cursor()
cur2.execute("""SELECT x(line_interpolate_point(GeometryN(the_geom, 1),
            '%(along)s')), y(line_interpolate_point(GeometryN(the_geom, 1),
            '%(along)s')) FROM sacroads WHERE gid = '%(section)s'""" % segdict)
cords = cur2.fetchall()
return cords

`

Leave a Reply

40
1

I actually had a bit of luck using Geo::Coder::US (the code behind geocode.us) a few years back: http://search.cpan.org/~sderle/Geo-Coder-US/US.pm

If I remember correctly however, you had to use older (2000) versions of the tiger/line shapefiles to get it to work. I found that the trick was to geocode with that first and then switch to one of the others (Yahoo, google, etc) when it couldn't find an address. That way you could stay well within the limits.

Leave a Reply

209
0

My erstwhile colleague Darnell Little told me he sometimes used this service:

https://webgis.usc.edu/Services/Geocode/Default.aspx

See https://webgis.usc.edu/About/UsageCosts.aspx for details of high volume pricing.

Leave a Reply

351
0

Ok, so this is a total asshat answer, but you could do your geocoding requests through a bunch of proxies. For example when I need to geocode more than the daily rate limit using google's API I set up a bunch of proxies like this:

http://gist.github.com/549716

and then route my requests to localhost:8080, localhost:8081, localhost:8082 randomly. It's a poor man's load balancing solution. For those interested in the Ruby side of things, you can patch graticule to do this load balancing by doing this:

http://gist.github.com/549722

Like I said it's not really above board, but it works. Also there's this thread on the Google Maps API Message Board from 2006 that says if you limit your requests to one every 1.72 seconds you don't have to deal with the limit:

http://groups.google.com/group/Google-Maps-API/browse_thread/thread/906e871bcb8c15fd

Which seemed to work well for me.

Leave a Reply

209

Your Answer

Please login to post questions.