Nice! I didn’t know about pdftotext, but it works pretty well. The other tool that is liable to come in handy here is pdftk (http://accesspdf.com/pdftk) with which you can break out just the pages you want.
Extracting tables from a PDF
The mayor of Detroit just published his proposed budget as a series of PDFs. They look like they contain tables, but I think the text is only aligned using spaces — in any case, it doesn’t copy neatly into Excel.
Is there a system that can create order from poorly formatted PDFs? It seems like a long shot, but spaces are so frequently used for tabular alignment that I’m hoping something exists.
Leave a Reply
You must be logged in to post a comment.
7 Answers
You want xPDF.
It gives you a command-line tool to scrape everything out of a PDF, preserving the layout. It's probably still not going to be pretty, but it's a step in the right direction.
The other thing to do, of course, is tell the city that PDFs are bad (because they lock up data and make analysis more difficult), and to release spreadsheets in the future.
Leave a Reply
You must be logged in to post a comment.
I haven't used it, but Kyle Cronan has been working on a Python library that deals with tables specifically.
Leave a Reply
You must be logged in to post a comment.
Hey Matt, It appears that pdftotext, part of the xPDF suite of tools, should get you along part of the way. If you're on a recent version of Ubuntu linux, the tool should come packaged with the OS.
To find out, go to the command line and type "which pdftotext". You can do the same on other flavors of Linux. If you're on Windows, you'll have to download the software.
I tested one of the budget docs using the below command, and it appears to retain the formatting of the original.
$ pdftotext -layout EB10-11CityClerk_stamped.pdf
That said, it appears that the data is wrapped by a lot of narrative (at least in this document), so you're going to have a lot of work ahead of you in terms of extracting the budget data.
But like Chris Amico said, pdftotext will at least get you started in the right direction.
Leave a Reply
You must be logged in to post a comment.
A plugin I deploy on almost every site is My Page Order. Many WP-as-CMS sites use lots of Pages, as well as Posts. And the menu system is derived from the nested hierarchy of pages. Position in menu is derived from the page "order" option in the editor. My Page Order lets you set these page weights visually, via drag and drop, so site editors have the ability to control the order of menu items without going to the programmer.
However, I understand that WordPress 3 will adopt the kick-ass Woo Themes menu building system, so My Page Order may become moot pretty soon.
Leave a Reply
You must be logged in to post a comment.
There are actually lots of ways PDFs handle tabular data -- sometimes the underlying structure is html (like, if you print a web page to PDF), sometimes it's spaces, tabs, etc. I think it depends on the application that created the PDF.
I believe the cool tool for the budgetarily privileged is Monarch Pro -- but I only know it by its legend.
Leave a Reply
You must be logged in to post a comment.
I'm a big fan of Able2Extract, which has handled a wider variety of tables in a more graceful manner than anyother PDF to Excel program I've run across.
And they give you a 7-day trial, so you can see if it works for you before plunking down the cash.
http://www.investintech.com/able2extract.html
Leave a Reply
You must be logged in to post a comment.
Another option for the "budgetarily privileged" is Adobe Acrobat Professional (the whole app, not just the free Reader). How much of your time does ~ US$150 buy?
It has the following options and sub-options (what the heck're PDF/A, PDF/X anyhow?):
File > Export > Word Document
Rich Text Format
XML 1.0
----------------
HTML >
----------------
Image >
----------------
Text >
----------------
Postscript >
----------------
PDF/A
PDF/X
Leave a Reply
You must be logged in to post a comment.
Your Answer
Please login to post questions.

hey wow look at that! My day: made.
There actually is a way to build a permalink from Thomas, but it’s pretty goofy http://thomas.loc.gov/home/handles/help.html
but yeah you can’t link to ‘em, sigh, try this: http://thomas.loc.gov/cgi-bin/query/z?c111:H.R.3590: and click on a bill.
What I don’t understand about that sunlight post is that thomas posts xml on these internets, here’s the healthcare bill: http://thomas.loc.gov/cgi-bin/query/D?c111:1:./temp/~c111vjgOPh::
Regarding Chris’s ‘tell the city comment,’ here’s Clay Johnson of Sunlight expounding, should you want to steal his words: http://sunlightlabs.com/blog/2009/adobe-bad-open-government/