What’s the fastest way to compare two large CSVs against each other?

6

So here’s the nut. Imagine you’ve got two huge CSV files, snapshots from the same database taken on different days.

They share a common identifier, and always have the same set of fields. But there are amendments, omissions and additions made between snapshots that can only be detected by comparing records against each other. What’s the fastest way to loop through a fresh snapshot and compare it against the previous snapshot for changes, additions and omissions?

Below is a roughed out Python routine I’ve written with a fake data set. Basically, it sets the unique ID as the key to a dictionary that contains what amounts to a CSV DictReader pull from two imaginary CSV files: one with the older “existing” snapshot; and another with the newer “fresh” snapshot.

It seems to work okay in testing, but when you run it over a large data set, it goes pretty darn slow. I’m curious whether anyone here knows a way I can make it run quicker.

existing_dict = {
    'A1': {'name': 'John', 'gender': 'M', 'value': 10},
    'A2': {'name': 'Jane', 'gender': 'F', 'value': 10},
    'A3': {'name': 'Josh', 'gender': 'M', 'value': 20},
    'A4': {'name': 'John', 'gender': 'M', 'value': 20},
    'A5': {'name': 'Janet', 'gender': 'F', 'value': 15},
    'A6': {'name': 'January', 'gender': 'F', 'value': 10},
}

fresh_dict = {
    'A2': {'name': 'Jane', 'gender': 'F', 'value': 10},
    'A3': {'name': 'Josh', 'gender': 'M', 'value': 20},
    'A4': {'name': 'John', 'gender': 'M', 'value': 20},
    'A5': {'name': 'Janet', 'gender': 'F', 'value': 15},
    'A6': {'name': 'January', 'gender': 'F', 'value': 5},
    'A7': {'name': 'Jessica', 'gender': 'F', 'value': 10},
}

def compare():
    """
    Compares two data dicts against each other.
    """
    # Set some counters to report outcome
    nochanges, amendments, omissions = 0,0,0

    # Loop through the existing data...
    for id_, existing_data in existing_dict.items():
        # Try to find the corresponding record in the fresh data
        fresh_data = fresh_dict.get(id_, None)
        # If it's there...
        if fresh_data:
            # Determine if there are any changes
            if is_diff(existing_data, fresh_data):
                amendments += 1
            else:
                nochanges += 1
            del fresh_dict[id_]
        else:
            omissions += 1
    additions = len(fresh_dict.keys())
    return nochanges, amendments, omissions, additions

def is_diff(existing_row, fresh_row):
    change_list = [field for field in existing_row.keys()
        if existing_row.get(field) != fresh_row.get(field)]
    if change_list:
        return True
    return False

if __name__ == '__main__':
    print "No change:%s; Amendments:%s; Omissions:%s; Additions:%s;" % compare()

That’s pretty much it. Here’s what the imaginary CSV files might look like, if it helps.

existing.csv

id,name,gender,value
A1,John,M,10
A2,Jane,F,10
A3,Josh,M,20
A4,John,M,20
A5,Janet,F,15
A6,January,F,10

fresh.csv

id,name,gender,value
A2,Jane,F,10
A3,Josh,M,20
A4,John,M,20
A5,Janet,F,15
A6,January,F,5
A7,Jessica,F,10

UPDATE: I’ve taken Jeremy’s advice below and tried to experiment with Python’s difflib. One method I stumbled into was replacing my list-comprehension based “is_diff” function above with the following.

def is_diff(existing_row, fresh_row):
    diff = difflib.SequenceMatcher(
        None,
        map(str, existing_row.values()),
        map(str, fresh_row.values()),
    )
    if diff.ratio() < 1:
        return True
    return False

I put that into it’s own file and attempted to compare its time against the list comprehension version.

import timeit
runs = 5

listcomp = timeit.Timer("grind.compare()", "import grind")
print "List comprehension: %s" % (sum(listcomp.repeat(runs)) / float(runs))

difflib = timeit.Timer("diff.compare()", "import diff")
print "Difflib: %s" % (sum(difflib.repeat(runs)) / float(runs))

Here’s the results

$ python test.py
List comprehension: 3.34202065468
Difflib: 3.66195659637

UPDATE II: Taking the off-site advice of another wise beard, I tried to rewrite the code using Python’s sets instead of dicts and lists. This is what I hacked out.

existing_set = set([
    ('A1','John','M',10),
    ('A2','Jane','F',10),
    ('A3','Josh','M',20),
    ('A4','John','M',20),
    ('A5','Janet','F',15),
    ('A6','January','F', 10),
])

fresh_set = set([
    ('A2','Jane','F',10),
    ('A3','Josh','M',20),
    ('A4','John','M',20),
    ('A5','Janet','F',15),
    ('A6','January','F', 5),
    ('A7','Jessica','F', 10),
])

def compare():
    sym_diff = existing_set.symmetric_difference(fresh_set)
    seen, dupes = set(), set()
    for elt in sym_diff:
        if elt[0] in seen:
            dupes.add(elt[0])
        else:
            seen.add(elt[0])

    omissions, additions, amendments  = set(), set(), set()
    for row in sym_diff.intersection(existing_set):
        if row[0] not in dupes:
            omissions.add(row)
    for row in sym_diff.intersection(fresh_set):
        if row[0] in dupes:
            amendments.add(row)
        else:
            additions.add(row)

    return len(amendments), len(omissions), len(additions)

Somehow, it actually runs slower. Would that hold with a much larger “real” dataset?

$ python test.py 
List comprehension: 3.28988599777
Difflib: 3.47232122421
Sets: 6.56343317032
Tags: asked June 29, 2010
Ben Welsh
110

Leave a Reply

11 Answers

6

If you're doing a basic publication site and already have WordPress knowledge, I agree pretty strongly with Chris. We use WordPress for most of our publication efforts at UC Berkeley. Not only is it customizable if you're willing to get your hands dirty (developer docs), but most of our students and professors have already used it, making the training time for new sites relatively short.

In addition to the major functionality that the BuddyPress and bbPress (forums) plugins provide, as of WordPress 3, you can make custom post types, so you're no longer limited to simply Pages and Posts. You can arbitrarily make up new types, like "video," "people," etc. This gives you the ability to treat WP more like a CMS. And if you need even more for custom types, check out the Pods CMS plugin, which allows you to relate custom types easily (eg: a custom "event" type can have a foreign key to a custom "place", allowing you to build calendars and location directories that are aware of each other).

Also, as of WordPress 3, WordPress MU has gone away. That multi-site functionality has been merged into the main WordPress product and can be enabled in the configuration file.

Leave a Reply

60
4

The tool you decide to you ultimately depends on the functions you need your website to perform. If it's a simple site with just pages of text and photos and no exceedingly complex functions required, you might go with Wordpress. I've set up some sites for some friends using Wordpress, and given that the software and plugins can be upgraded automatically by any admin, it helps them reach independence pretty quickly. I like to set up regular automated database backups to email as well, just in case.

Drupal, while more powerful, is a slut of a CMS (excuse the french). The documentation is written for developers, and it's not very accessible for everyday Joes like me. If you do decide to jump in that pool, I've documented some of my initial struggles with Drupal here: Getting started with Drupal - the non-retarded version.

The struggle continues. But it's fun to learn.

  1. I have never seen anyone start a Drupal-based project who did not end up needing to spend an extra $5K on developers. WordPress is sublimely simple, extremely powerful and the developers are just plain cheaper, because it takes less time for them to make the damn thing work.

  2. Drupal’s learning curve is ridiculous compared to WordPress. And it’s not really that much more powerful, IMHO.

Leave a Reply

247
3

Like Rick said, it depends on what you're doing. I'd say, err on the side of Wordpress unless there are things you absolutely cannot do in Wordpress.

Remember that there are tools like Wordpress MU and BuddyPress, plus themes like CalPress. I'm pretty sure Wired.com is built entirely (or almost entirely) on Wordpress MU.

Building in Drupal gives you a lot of modular parts, ways to build types of content that don't look like blog posts (which, you'll find, a lot of things can be made to resemble blog posts) and a single database to tie everything together. It's harder, but that might be the point.

  1. Yeah, Drupal certainly makes you solve problems. And if you want a lesson in website structure/architecture, it will certainly provide that. WordPress’ structure is not as flexible in my view, although it’s becoming more so over time. Haven’t messed with 3.0 much, but I ‘m hearing some high praise. Would love to see some examples of this bran’ new WordPress-fu, if anybody gots some.

  2. A bit more on Drupal being harder: I think a key question is what students are using either platform to learn.

    If the point is to publish content, and make it easy to do so, then I’d go with WordPress and let the CMS remain mostly out of the way. Students may or may not be aware of what platform they’re using, and that may be the point. Good technology can be invisible.

    If the point, however, is to learn to build a site, WordPress may not be enough of a challenge, or it may be a good first step. Drupal is harder, but may let students learn to solve different problems.

Leave a Reply

785
2

What if instead of doing the comparison yourself, you outsourced it to git?

Sunlight dumps its API data into a git repo here, which has the side effect of giving every file (which are huge CSVs) an instant history.

Here's the history of Congress: http://github.com/sunlightlabs/apidata/commits/master/legislators/legislators.csv

It's not a perfect comparison, but it would give you a quick look at where the changes are.

Leave a Reply

785
1

The way that you're doing it -- really parsing the CSV and iterating through each value -- seems like the most robust way to go. I'd be wary about changing it unless you really can't afford to take the time. But if you want to try something a little more quick and dirty, take a peek at this:

http://docs.python.org/library/difflib.html

Which is apparently part of the Python stdlib. (I'm not a Pythonista, so take all of this with a grain of salt). Since it looks like the CSVs are nicely line delimited, you might get great results from using a diff algorithm to hand you the lines that changed, and then you only have to parse and analyze those. difflib.Differ.compare looks like a good place to start.

  1. Thanks for the advice. I took a stab at using it above, though I’m hardly a master at using the library. Let me know what you think.

Leave a Reply

245
1

To elaborate on Jeremy Ashkenas' difflib.Differ() answer:

import difflib

a = '''
id,name,gender,value
A1,John,M,10
A2,Jane,F,10
A3,Josh,M,20
A4,John,M,20
A5,Janet,F,15
A6,January,F,10
'''
b = '''
id,name,gender,value
A2,Jane,F,10
A3,Josh,M,20
A4,John,M,20
A5,Janet,F,15
A6,January,F,5
A7,Jessica,F,10
'''

a = a.strip().splitlines()
>>> ['id,name,gender,value', 'A1,John,M,10', 'A2,Jane,F,10', 'A3,Josh,M,20', 'A4,John,M,20', 'A5,Janet,F,15', 'A6,January,F,10']

b = b.strip().splitlines()
>>> ['id,name,gender,value', 'A2,Jane,F,10', 'A3,Josh,M,20', 'A4,John,M,20', 'A5,Janet,F,15', 'A6,January,F,5', 'A7,Jessica,F,10']

d = difflib.Differ()
diff = d.compare(a, b)
print 'n'.join(list(diff))

>>>   id,name,gender,value
>>>  - A1,John,M,10
>>>    A2,Jane,F,10
>>>    A3,Josh,M,20
>>>    A4,John,M,20
>>>    A5,Janet,F,15
>>>  - A6,January,F,10
>>>  ?              ^^
>>>  
>>>  + A6,January,F,5
>>>  ?              ^
>>>  
>>>  + A7,Jessica,F,10

See http://docs.python.org/library/difflib.html for all the details.

— John

  1. Basically, the goal is fly by the no changes and then run a series of operations on the amendments, omissions and additions, different depending on which of the three it is.

  2. palewire —
    1.) Oops. Forgot the a.sort(), b.sort() to do an in-place sort on them both prior to finding the diffs.

    I’m not sure of what your overall goal is, but the difflib stuff generates a string in a format that when applied to the other item, makes the two match. That concept may or may not be exactly what you’re after.

    2.) Are the IDs unique and the name, gender, values of a given ID immutable? Again, not really knowing what exactly it is you’re after, but if the above is true, you can just check sorted ID lists against each other. Additions are prepended with a “+”

  3. John, that’s really interesting. Thanks for sharing your code.

    A couple questions:

    1) What if the records come in different sequence each snapshot, with the possibility of new records being inserted randomly throughout the file? If the ordering is staggered between the files, could it effectively compare rows with identical IDs against each other? Or would it be necessary to remove all additions and omissions prior to the diff, and then sorted them by ID for the code you posted above?

    2) What would be the most effective way of filtering the results above to get only the amendments?

Leave a Reply

472
1

Just for shits and giggles, here's a version of your original algorithm done in CoffeeScript, running on Node.js. It may or may not be a great deal faster. I'd be glad to benchmark it for you if you can provide a link to the CSV that you're using for testing.

# The counters for our results.
nochanges: amendments: omissions: additions: 0

# Compares two data objects against each other.
compare: (existing, fresh) ->
  for id, current of existing
    if next: fresh[id]
      delete fresh[id]
      if changed current, next
        amendments: + 1
      else
        nochanges: + 1
    else
      omissions: + 1
  for id of fresh
    additions: + 1

# Are the contents of two different objects different?
changed: (first, second) ->
  for key, value of first
    return yes if second[key] isnt value
  return no

Leave a Reply

245
1

Wordpress is by far the easiest system to get acquainted with. There are tons of plugins to do all sorts of things. I've been recently enamored with buddypress, which is a huge plugin kit to add social networking features to a wordpress site. I would suspect this is your best bet.

That said, wordpress is built for blogging, Drupal is a content management system. It doesn't insist on a blog shape, and allows tons of customizations. It also has a lot of modules and modifications available to make it do lots of stuff. Even if there is not a module out there for your specific need, you can use the Content Creator Kit to essentially make new modules without writing a lick of code. But like 1ricks says, Drupal is largely built by software engineers for themselves. While you can do a lot without writing code, its still super technical.

Joomla is interesting, I used it once, long ago. It was capable of a lot then, and I'm sure it improved a lot. It's worth investigating.

Leave a Reply

20
1

If your computer is anything like mine it'll have extra processors lying around begging to be used. The first thing I'd do in this situation is take a look a python's multiprocessing module, so I could do a distributed map and then single process reduce of the data.

My python's really rusty, but here's what I came up with:

from multiprocessing import Pool, Manager


existing_dict = {
    'A1': {'name': 'John', 'gender': 'M', 'value': 10},
    'A2': {'name': 'Jane', 'gender': 'F', 'value': 10},
    'A3': {'name': 'Josh', 'gender': 'M', 'value': 20},
    'A4': {'name': 'John', 'gender': 'M', 'value': 20},
    'A5': {'name': 'Janet', 'gender': 'F', 'value': 15},
    'A6': {'name': 'January', 'gender': 'F', 'value': 10},
}
# we need to make sure we place the fresh_dict in a shared memory location
# so compare removes unchanged data.
m = Manager()
fresh_dict = {
    'A2': {'name': 'Jane', 'gender': 'F', 'value': 10},
    'A3': {'name': 'Josh', 'gender': 'M', 'value': 20},
    'A4': {'name': 'John', 'gender': 'M', 'value': 20},
    'A5': {'name': 'Janet', 'gender': 'F', 'value': 15},
    'A6': {'name': 'January', 'gender': 'F', 'value': 5},
    'A7': {'name': 'Jessica', 'gender': 'F', 'value': 10},
}
fresh_dict_shared = m.dict(fresh_dict)


def compare(id_):
    """
    Compares two rows from similar dicts against each other.
    """
    nochanges, amendments, omissions = 0,0,0
    existing_data = existing_dict[id_]
    # Try to find the corresponding record in the fresh data
    fresh_data = fresh_dict.get(id_, None)
    # If it's there...
    if fresh_data:
        # Determine if there are any changes
        if is_diff(existing_data, fresh_data):
            amendments = 1
        else:
            nochanges = 1
            del fresh_dict_shared[id_]
    else:
        omissions = 1    
    return [nochanges, amendments, omissions]

def is_diff(existing_row, fresh_row):
    change_list = [field for field in existing_row.keys()
        if existing_row.get(field) != fresh_row.get(field)]
    if change_list:
        return True
    return False


def reducer(memo, item):
    return [memo[i] + item[i] for i in range(len(memo))]

if __name__ == '__main__':
     pool = Pool(processes=4) # fiddle with the number of processes for what works for you.
     reduced = reduce(reducer, pool.map(compare, existing_dict.keys()), [0, 0, 0]) # where the magic happens
     print "No change:%s; Amendments:%s; Omissions:%s; Additions:%s;" % (reduced[0], reduced[1], reduced[2], len(fresh_dict.keys()))
     print fresh_dict_shared

The trick here is the map and reduce at the bottom. map calls compare for each of the existing_dict's keys and returns a list that has values filled in for whether the data has been changed, amended, or deleted. reduce then sums those values by index. The tricky bit is that each process gets its own copy of the data and in map we're changing fresh_dict as we go, to remove unchanged rows. So python's shared memory Manager class will keep track of the changes and ensure that fresh_dict_shared is truly global across all processes.

I have no idea if this is faster at all, but I'd give it a try at least. And be warned there's a bit of overhead in starting the processes at the outset.

  1. Oh also, I indented `del fresh_dict_shared[id_]` b/c I thought you’d care about amendments as well.

Leave a Reply

209
0
Hi, I also wrote an article about comparing two data sets using Excel, text editor or database engine. If you would like to check it out, here is the link: http://efficient-work.blogspot.com/2012/10/compare-two-sets-of-data.html Feedback is welcomed!

Leave a Reply

0
-1

I would recommend you choose Joomla instead of Drupal. Wordpress is easy to use, but not very customizable. Joomal is not as hard as Drupal to learn and use, and can be designed to look however you want it to be.

If you don't know PHP there is a large community out there that has templates, mods and plugins that can help you.

http://www.joomla.org/

Leave a Reply

-2

Your Answer

Please login to post questions.