Extracting tables from a PDF

5

The mayor of Detroit just published his proposed budget as a series of PDFs. They look like they contain tables, but I think the text is only aligned using spaces — in any case, it doesn’t copy neatly into Excel.

Is there a system that can create order from poorly formatted PDFs? It seems like a long shot, but spaces are so frequently used for tabular alignment that I’m hoping something exists.

Tags: asked April 13, 2010

Leave a Reply

7 Answers

5

You want xPDF.

It gives you a command-line tool to scrape everything out of a PDF, preserving the layout. It's probably still not going to be pretty, but it's a step in the right direction.

The other thing to do, of course, is tell the city that PDFs are bad (because they lock up data and make analysis more difficult), and to release spreadsheets in the future.

Leave a Reply

785
2

I haven't used it, but Kyle Cronan has been working on a Python library that deals with tables specifically.

Leave a Reply

351
2

Hey Matt, It appears that pdftotext, part of the xPDF suite of tools, should get you along part of the way. If you're on a recent version of Ubuntu linux, the tool should come packaged with the OS.

To find out, go to the command line and type "which pdftotext". You can do the same on other flavors of Linux. If you're on Windows, you'll have to download the software.

I tested one of the budget docs using the below command, and it appears to retain the formatting of the original.

$ pdftotext -layout EB10-11CityClerk_stamped.pdf

That said, it appears that the data is wrapped by a lot of narrative (at least in this document), so you're going to have a lot of work ahead of you in terms of extracting the budget data.

But like Chris Amico said, pdftotext will at least get you started in the right direction.

Leave a Reply

175
1

A plugin I deploy on almost every site is My Page Order. Many WP-as-CMS sites use lots of Pages, as well as Posts. And the menu system is derived from the nested hierarchy of pages. Position in menu is derived from the page "order" option in the editor. My Page Order lets you set these page weights visually, via drag and drop, so site editors have the ability to control the order of menu items without going to the programmer.

However, I understand that WordPress 3 will adopt the kick-ass Woo Themes menu building system, so My Page Order may become moot pretty soon.

  1. Pingback: How to get started in data journalism | digitaljournalist.me

Leave a Reply

90
1

There are actually lots of ways PDFs handle tabular data -- sometimes the underlying structure is html (like, if you print a web page to PDF), sometimes it's spaces, tabs, etc. I think it depends on the application that created the PDF.

I believe the cool tool for the budgetarily privileged is Monarch Pro -- but I only know it by its legend.

Leave a Reply

60
1

I use Abbyy Reader OCR. It recognises tables and exports them into Excel.

Leave a Reply

10
1

I'm a big fan of Able2Extract, which has handled a wider variety of tables in a more graceful manner than anyother PDF to Excel program I've run across.

And they give you a 7-day trial, so you can see if it works for you before plunking down the cash.

http://www.investintech.com/able2extract.html

Leave a Reply

10
1

Another option for the "budgetarily privileged" is Adobe Acrobat Professional (the whole app, not just the free Reader). How much of your time does ~ US$150 buy?

It has the following options and sub-options (what the heck're PDF/A, PDF/X anyhow?):

File > Export > Word Document
                Rich Text Format
                XML 1.0
                ----------------
                HTML           >
                ----------------
                Image          >
                ----------------
                Text           >
                ----------------
                Postscript     >
                ----------------
                PDF/A
                PDF/X

Leave a Reply

472

Your Answer

Please login to post questions.