Rapid Analysis of Three Years of DayOfDH

April 5, 2011

DayOfDH 2011(This is a lightly edited version of a post from my DayOfDH (next year there are rumblings that we may be able to aggregate content from our existing blogs instead of self-plagiarizing.)

We now have three years of DayOfDH blogging archives – that’s a pretty rich record of how digital humanists describe their activities in a given day. It also constitutes an interesting corpus for practising what Geoffrey Rockwell and I have been calling rapid analysis – trying to see what one might usefully glean from a relatively quick look at digital texts using specialized tools. Our interest in this is to develop techniques that might be useful to a wide range of people in digital society – for instance, students doing preliminary research or journalists compiling materials for an article.

Building the corpus was relatively, made even easier by a few tweaks kindly done by the DayOfDH team. I downloaded a full RSS archive of each years like this (I did this on the command-line, but you can just open each quoted URL in the browser and save it if it seems more convenient):

I tried uploading the documents into Voyeur Tools, but I encountered problems with the XML from the 2011 archive – it contains illegal code characters that invalidate the XML (which is produced programmatically by the PHP in WordPress, so these kinds of problems are fairly common). One of the easiest ways to resolve control character issues is using the free TextWrangler editor for Mac – there’s a function charmingly called “Zap Gremlins” that does just that. Once the problems were fixed, I could then proceed to upload files to Voyeur Tools. (This can be done by going to the main page, selecting the options icon (the gear above and to the right of the large text box), choosing RSS2 as the Input Format, clicking OK, clicking the upload button, adding each of the downloaded files, and then clicking “Upload”.) Voyeur Tools has a special input format called RSS2 that essentially concatenates all of the text in the description tag – you can also ask it to create separate documents for each blog entry, but here I wanted all blog entries in each file to be combined into one document.

My first thought was to compare all three documents using a Wordle-like visualization called Cirrus. Like Wordle, the current algorithm of Cirrus lays items out with some random variation, which means that each time the Cirrus is loaded it can look different, even with the same text. However, unlike Wordle, Cirrus provides more information when hovering over words and also allows users to click on words to further study them with other tools. I will embed static images here, but clicking on the images will open up the actual Cirrus visualization:

20092010 Cirrus2011 Cirrus

Having compared this kind of visualization between documents several times before, one of the first things that struck me is how similar the results are between documents – not all that surprising given the coherence of the three documents. Still, there are some interesting phenomena one might be interested in exploring further:

  • it’s partly the layout, but 2011 does seem to have the clearest prominence of the words day, digital and humanities
  • 2011 also seems to have more occurrences of “dh”, perhaps signalling an increased comfort and recognition with that abbreviation (influence of Twitter?)
  • 2009 has an interesting anomaly with the word “replace”, probably in large part because Martin Holmes was doing some XSL
  • the prevalence of the word “new” seems to increase over time

The Cirrus is useful for helping to perceive some things very quickly, but is too blunt an instrument in other ways. For instance, one thing that’s not clear is that the word “library” is much more common in 2010 than in the other years (but since it’s relatively small even in 2010, it doesn’t draw attention to itself). Similarly, the word “world” spikes in 2011 – possibly because of world events – but still remains understated in the visualization. A better way of capturing some of those variations among less frequent words is to display sparklines of the relative frequencies of some terms. Consider, for instance, the following terms (the mini-graphs show relative frequencies for the three documents in the corpus in chronological order):

data teaching
data students
data working
data conference

A quick read of this might suggest that discussion in blog posts of teaching and of students has decreased since 2009, but that more people are talking about working in general and about conferences (DayOfDH coincided with several conferences such as the Society for Textual Scholarship).

As some of you may know from recent blog posts, my new pet tool is the Correspondence Analysis Scatterplot visualization (thanks to Geoffrey for nagging me for several months to build it), which helps with this kind of inquiry because it can help to show affinities between terms and documents by forcing both to be plotted in a Cartesian space. In other words, though a term like “digital” appears in all documents, it needs to be plotted in a specific location, in this case closest to 2011. I consider this kind of determinism with the utmost suspicion, but at the same time it can be useful to see what the computer tells us are the terms that statistically cluster around certain documents (given the list of high frequency terms that are specified).

ScatterPlot

There’s clearly a lot to explore here. As always, I feel compelled to emphasize that I don’t think these tools produce significant results on their own, they’re really a means of exploring the text and interpreting data to search for insights that need to be further explored and studied. What do you notice about three years of DayOfDH using the ScatterPlot skin or the new Simple skin?

Comments are closed.