Analyzing Cablegate

November 30, 2010

Cablegate This may be a bit warped given the greater political and philosophical implications, but one of the things I find most fascinating (and exciting) about Cablegate is how quickly several news providers scrambled to provide text analysis and visualization interfaces for readers to explore the corpus themselves (so far only about 300 documents have been released). The Guardian created a funky (faux) spy-aesthetic interface (see also their nice infographic, the German Der Spiegel has a nice map-oriented visualization for browsing the corpus, the Wikileaks site itself has some simple graphs, and even the CBC has a simple search interface (I don’t recall the CBC doing anything like this before) – the New York Times has yet to create a visualization, but I’d bet good money something’s being cooked up as we speak. Text analysis matters to people (even if they call it by some other name or don’t know how to call it at all), and I’d argue that it matters particularly the way that digital humanists do it (but that’s a whole other post).

I thought it might be interesting to throw the corpus at Voyeur Tools. As is often the case, that’s easier said then done. First, it’s not quite obvious getting a hold of the corpus. There’s a link available on the Wikileaks Cablegate site (near the bottom), to download the entire site. The link is provided as a torrent, which makes perfect sense in terms of distribution and decentralization, but it may introduce a significant barrier to a casual user who needs to figure out what a torrent is and which of the dizzying array of clients to use (personally I use µTorrent). Next, the actual file is in the 7-Zip file format, which is fine as a modern, open-source compression container format, but that will again introduce an additional barrier for some people who will need to figure out what application to use to open the file (using the more universal zip format that’s built-in to major operating systems). Ok, so now we have the corpus downloaded, we’re good to go, right?

Well, not quite. What we’ve downloaded is the entire site, with navigation pages organized by date, origin, tag and classification. Fine, we rummage around a bit and discover that all the documents are really in subdirectories under a directory called cables. The trouble is, each cable is actually embedded in an HTML page that contains a lot of other crud for information and navigation, which would obviously introduce a lot of noise into the system. No problem, Voyeur Tools has a handy feature that allows you to extract text from XML documents using XPath. Oops, a good number of the documents aren’t valid XML, so we can’t do that. It looks like our hard-one corpus requires us to write a custom script to extract the relevant text from each document. That’s not such a big deal, let’s pound out a simple script that does just that.

Now I have a directory with the text files, I can create a compressed archive (zip) of the directory, and submit it to Voyeur Tools – here’s the result. I won’t go into analysis here, but one thing that’s fairly clear is that Voyeur isn’t particularly optimized at the moment for corpora with hundreds of documents (it doesn’t have a problem with larger documents, but the interface doesn’t really adjust well to having to show data on a lot of different documents at once). Always stuff to improve…

In any case, I think all of this should be much easier, from retrieving the data to cleaning it to analyzing it. But I think we should assume that oftentimes corpora in the wild won’t be easier – I don’t think wikileaks will provide us with nice TEI documents :) . The crucial question then becomes: How do we ensure that digital humanists working with texts have the knowledge required to solve thorny problems of working with real text collections? Writing a script to extract text is not difficult, unless you’ve never done it and don’t know where to start. Writing regular expressions, for instance, is arguably a fundamental arrow in the digital humanist’s quiver (at least for text-archers), but how many digital humanists know how? These skills are actually fairly hard to learn at first and they tend toward the practical (whereas most universities pride themselves on focusing on the theoretical), so how do we reconcile that tension? I think there are a lot of answers to that question, but one thing that emerged clearly from this summer’s panel on Understanding the Capacity of Digital Humanities at DH2010 in London is that project-based mentorship can be a powerful tool for training digital humanists. By having real-world DH projects, students will encounter challenges like the ones described here, and peers and mentors can provide just-in-time guidance for helping to solve problems. Personally, I’d like to spend less time in a classroom teaching and more time in a lab working with students on various projects, and I think we’d all get more out of it. This to me is an example of how digital humanities teaching needs to consider how and why it might differ from humanities teaching (if you’ll pardon me the gross over-generalization).

Ok, so I haven’t blogged in about 6 months and now I think I’ve stitched at least three posts together. And I didn’t even get around to doing any analysis…

Comments are closed.