Building Data

March 18, 2011

[This is a post that first appeared in my DayOfDH blog.]

There are few instances when the data you have are exactly what you need, particularly for feeding into analytic tools of various kinds. This is one of the reasons I think it’s so important for a significant subset of digital humanists to have some basic programmings skills; you don’t need to be building tools, sometimes you’re you’re just building data. Otherwise, not only are you dependent on the tools built by others, but you’re also dependent on the data provided by others.

Recently I’ve been working on better integration of Voyeur Tools with the Old Bailey archives (via Zotero) for the Digging into Data funded With Criminal Intent project. Currently we have a cool prototype that allows you to do an Old Bailey query and to save the results as a Zotero entry and/or send the document directly to Voyeur Tools. The URL that is sent to Voyeur contains parameters that cause Voyeur to go fetch the individual trial accounts from Old Bailey via its experimental API. That’s great except that going to fetch all of those documents adds considerably latency to the system, which for larger collection can cause network timeouts. It would be preferable to have a local document store in Voyeur Tools, which is some of what I’ve been working on today.

The Old Bailey archive (which I don’t believe is available for general download), comes in the form of 2,164 XML documents that weigh 1.7GB. Each XML document actually contains multiple trial accounts, so the first task is to break apart the documents into separate files (truth be told, I could have probably asked our colleagues at Old Bailey for the data in that format, but where’s the fun in that? :) . I’m not aware of a good tool for doing exactly this task: going through a set of XML documents and building another set of XML documents based essentially on an XPath expression. So I had to build one.

Inspired in part by a recent thread on DHAnswers, I decided to write an XSL stylesheet that would read all of the original XML documents in the directory, find relevant trial account nodes, and concatenate them into a new document. I considered doing this with a scripting language like PHP or Ruby but I thought maybe the XML libraries would be a little too intrusive (they would essentially read into native data structures but I wasn’t sure that those native structures could be serialized back faithfully to XML). I thought maybe I’d do it in Java using less intrusive libraries, but that seemed like overkill, especially since the XSL solution was so clean and simple. This was essentially the XML file to transform and an XSL file that did the transformation. Easy enough, except that we’re working with 1.7GB of data, which of course requires a significant amount of memory. I tend to use Oxygen in Eclipse, so I need to increase the memory settings of Eclipse – I probably could have gotten away with less, but in the end I used -Xmx6146m, or about 6GB of RAM to the JVM (I have a total of 8GB on my machine – using settings this high inevitably means a lot of page swapping, which can actually slow things down a lot, but you got to do what you got to do).

So now I have a huge XML document that I need to break up into smaller documents based on the delimiters I’ve set. I created a quick PHP script to do just that – it reads through the file one line at a time (to not overload memory) and outputs the buffer once it thinks it has read a full trial account. Now that I’m writing this up (here and now in this blog post), it’s not quite clear to me why I didn’t write the PHP script to just read through the original files to do the same operation as it’s doing on the enormous file, but it’s too late now. This is why documentation and write-ups are important though, because as you’re explaining something you’ve done you sometimes discover better ways of doing it.

So now we have a directory with a whopping 197,753 individual trial accounts (that’s right, nearly two hundred thousand files). As a side note, I’m actually pleasantly surprised how smoothly the OSX finder displays and scrolls through this large collection of files. Let’s not forget that the standard unix ls command with a query parameter usually croaks on such a request (though of course there are other ways of making it talk):

Anyway, it’s probably best to avoid so many files in a single directory, and it may actually be more convenient to deal with the files organized by year or by decade. Another quick script anyone?

With this new organization of content, it’s very fast and convenient to create custom documents based on years or decades – in fact, we can just use a unix command like this to combine all trials from the year 1900 into one document (it’s not really HTML, but since we now have multiple snippets of XML without a root tag, it’s not valid XML either):

This post, which is getting much longer than I’d anticipated, was really about pre-processing of data, especially in service of creating a local store of the Old Bailey archive for Voyeur Tools. What can we do with this collection? Stay tuned for the next post…


Comments are closed.