Trying to get Kenyan Parliament to export XML. Need to clarify what we want.

3

My friend Ory (who you may know as founder of Ushahidi) is trying to talk the Kenyan Parliament web developer into exporting XML instead of PDFs for their official transcripts (a document called the Hansard).

The good news is, they’re receptive. The bad news is they speak heavy geek, and we can’t communicate well. She got this answer, which is pretty opaque to her and me both:


Hi ory,

Sorry for responding late. Well on the PROTOCOL REXML, When two processes located at different nodes communicate with each other,
the interface of the communication can be implemented in two ways:

1) as a protocol; the data are defined as program structures and sent
finally as binaries.
2) as exchange of XML files; the data are defined in text files by means of
XML syntax and sent as strings.

How could you compare the performance of these two approaches?
I would go for XML files but I’m afraid that this can slow down the
communication. Thus the time out if not increased counters the reXML time out.

Read http://code.google.com/apis/protocolbuffers/docs/faq.html


Huh? Can someone translate this, and suggest the best-practice answer (or some kind of documentation) to respond with?

Cheers,
Jonathan Eyler-Werve

Tags: asked May 28, 2010

Leave a Reply

3 Answers

3

As I read it, the developers are wondering whether XML is too inefficient compared to other approaches, such as Google's "protocol buffers."

This is a good question for them to be asking, but XML is the right choice. The goal is presumably to make transcripts broadly available to people, rather than enable super-fast machine-to-machine communication. So processing efficiency is not critical (and in any case, though XML can be overkill for some tasks, it's a good fit for marking up texts). The key issue is accessibility, and here XML has a huge advantage: it's widely understood by developers, and even a novice can read an XML document and make a guess at what it means.

Conveniently, the overview page on the site they mention has a clear, geek-friendly explanation. A quote:

However, protocol buffers are not always a better solution than XML – for instance, protocol buffers would not be a good way to model a text-based document with markup (e.g. HTML), since you cannot easily interleave structure with text. In addition, XML is human-readable and human-editable; protocol buffers, at least in their native format, are not. XML is also – to some extent – self-describing. A protocol buffer is only meaningful if you have the message definition (the .proto file).

So one way to respond to the developers would be to outline these reasons and point to this geek-friendly explanation. To make a "best practice" argument you could also show examples of how other governments are doing this. I don't know whether a U.S. example would be helpful in this situation, but if so then http://xml.house.gov/ might be a good reference.

Leave a Reply

80
2

What Martin said. XML, while having a bad rep, is exactly the format for text documents (as opposed to data, where it is justly criticized.)

A simple XML schema should be defined: one which requires the minimum of pre-processing before the documents are made public digitally. Schema design can become an awful drag as people seek the "perfect" structural abstraction.

I'd also ask "why not just plain text?" It's worth knowing the answers to that question instead of assuming that they're out there somewhere: sometimes plain text is just fine.

Perhaps the only fundamental benefit XML carries is 100% explicit definition of how characters are encoded in the file. XML could also make it clearer how the document breaks down into sections, although a well defined directory and filename structure can sometimes provide that.

There are more things which would be lovely to have, but if they would require big budget for more staff or system upgrades, or if they'd delay things with infinite debates, they should be logged and deferred to a future cycle.

My main concern: keep it simple. That's good advice in any case, and if the government respondent's first language is English, then you weren't getting "heavy geek," you were getting gibberish.

Leave a Reply

351
0

+1 on opting for the XML.

(I'd bet Ory's next question is one that's partially addressed in the "What are the best tools for “scraping” data off a Web page for analysis in Excel or other software?" thread.)

Leave a Reply

472

Your Answer

Please login to post questions.