Tagging and annotating a structured (XML) document

1

In the Medill newsroom today, I was talking to a classmate who wanted to read/research/blog about the recently passed federal health care legislation. She wanted to tag and make notes about the legislation so she could later identify which pieces of the legislation would be relevant to different health care users.

We were able to find XML versions of legislation from thomas.log.gov (e.g. http://thomas.loc.gov/home/gpoxmlc111/h3590_eh.xml). Each section of the legislation even has a unique ID.

Does anyone know of any tools that would let someone associate tags/notes with a particular DOM element in an arbitrary XML doc? What about other kinds of structured documents?

Tags: asked April 12, 2010

Leave a Reply

3 Answers

3

This is an interesting question. I think what you are imagining is some sort of tool that would put your annotations side-by-side with the XML--I don't know of anything like that and I think there is a particular reason why such a thing is unlikely to exist.

One general selling point of XML is its remarkable (some say obsessive) flexibility. You can leverage that to solve this problem by adding your own elements or attributes to the existing XML structure. To use an example from the document you referenced:

<paragraph id="H21CBDF6F55764600BD20073B7FFE5D8D"><enum>(1)</enum><text>in subparagraph (1) by striking <quote>this subsection) to offset the adverse effects on housing values as a result of a military base realignment or closure</quote> and inserting <quote>the American Recovery and Reinvestment Tax Act of 2009)</quote>, and</text> </paragraph>

Could become (using elements):

<paragraph id="H21CBDF6F55764600BD20073B7FFE5D8D"><enum>(1)</enum><text>in subparagraph (1) by striking <quote>this subsection) to offset the adverse effects on housing values as a result of a military base realignment or closure</quote> and inserting <quote>the American Recovery and Reinvestment Tax Act of 2009)</quote>, and</text> <notes>My notes go here.</notes><tags>military-base arrta</tags></paragraph>

Or (using attributes):

<paragraph id="H21CBDF6F55764600BD20073B7FFE5D8D" notes="My notes go here." tags="military-base arrta"><enum>(1)</enum><text>in subparagraph (1) by striking <quote>this subsection) to offset the adverse effects on housing values as a result of a military base realignment or closure</quote> and inserting <quote>the American Recovery and Reinvestment Tax Act of 2009)</quote>, and</text> </paragraph>

(I tend to prefer the later, but YMMV.)

Perhaps the greatest benefit of doing this is that all the best XML tools will continue to function with your new annotations: XSLT for presentation, XPath for queries, etc. XPath, for example, would allow you to easily search the document for all passages with a certain tag. Of course you will have to be careful to keep the source document intact as you edit the XML, but that should always be a consideration when working with structured data.

Now doing this effectively will require a great XML editor and unfortunately most XML editors are anything but great. If you are going to be doing a huge amount of annotating then some sort of little custom tool would probably be ideal, but obviously that requires someone having the programming expertise to build something like that. Without that person, finding an editor that suits the task is your best bet. (Check this Stack Overflow question as a jumping off point for selecting an editor.)

To summarize, XML is a language for annotated information and therefore, ironically, both your problem and your solution. Rather than looking to add another layer of abstraction, extend the existing document structure to add the information you need.

Leave a Reply

550
2

If it were me, I'd go with a database. What it sounds like you're doing is actually somewhat similar to what we did with the State of the Union.

Basically, since each section has a unique ID, you just need to store a reference to it in a database, and whatever notes you want can point to that section. Here's how to do it with Python.

Leave a Reply

785
1

This is definitely something that was in mind when XML was developed, but I think it's an example of one of those problems which is solvable in theory but which has never really gotten much traction in practice.

For the specific case of annotating bills in the US Congress, note that OpenCongress provides annotation features, as in this page for H.R.3962

Also, for general advocacy on this topic, check out the Citability.org project which advocates for governments to ensure that their documents are readily citable.

You may also want to sniff around work in the scholarly text analysis community. See, for example, http://literaryinformatics.northwestern.edu/

Leave a Reply

351

Your Answer

Please login to post questions.