In early May I had the pleasure of participating in a hackfest focused on literary text mining and visualization. The group was composed of a very loosely defined intersection of people from the CWRC, INKE and Voyant Tools projects. It was a joyous event (certainly for me, and for others as well I think), in large part because it was as productive as it was enjoyable. It was intense work, but we also interacted a lot as a team, ate extremely well, basked in the sunshine, played frisbee and foosball, and built some great prototypes. I find hackfests to be a superb mix of work and play, a moment of balanced life, albeit a surreal one that’s deliberately designed by our hackfest protocols to be ephemeral. For instance, we all abandon our normal lives (and families) to congregate at a neutral location, one where the entire team stays together (no one goes home during hackfests). Stan Ruecker, Milena Radzikowska and I have written and presented about hackfests in the digital humanities before, and we’ve tried to define some of the principles that seem important to us:
- limit meeting time to an absolute minimum (resist the temptation to endlessly discuss things)
- find a pleasant location away from normal residential distractions and temptations
- encourage a flat organizational structure for participants with a good mix of competencies working in smaller groups
- pay attention to non-work aspects (meals, snacks, activities)
- document the event (the work accomplished but also including photos of the location and people)
The distribution of expertise in academia and the forces excerpted by granting agencies sometimes make it desirable to collaborate between geographical dispersed locations, even if all common sense would suggest that doing so is a less efficient way for a team to work. I like filling in endless Doodle polls, cringing at Skype feedback, and skimming through misunderstood emails as much as the next person, but sometimes an intense in-person hackfest is precisely what’s needed to provide a momentum boost for distributed collaborations.
I was mostly involved with the cryptically named non-NER non-RDF visualization team. Other small groups were working on machine learning with WEKA and named entity extraction as well as generating RDF and visualizing triples from the Orlando Project textbase. Our group, including Mark Turcato, Andrew Macdonald, Ryan Chartier and me (with Susan Brown weaving in and out) focused on visualizing the very complex and deeply encoded Orlando textbase. The version we were working with “only” has about 1,300 biographical entries, but the XML file is 129MB large and has a whopping 1.8 million tags! This tagset is primarily semantic (not structural) and is lovingly and painstakingly applied to prose descriptions (in other words, it’s more like an encyclopedia than a database). The challenge with such fluid, semi-structured data is that it can be difficult to really know what’s there and what paths one might follow to get there, which is why we wanted to try to produce a prototype interface for exploring the tagset. Our thought was that if we could produce a representation of the actual tag structures (and not just the schema), it might provide an alternative way of navigating the textbase. To accomplish this, we first generated a list of the unique XPath expressions for the elements only (i.e. no attribute – of which there are many in Orlando – and we didn’t take into account the position of elements). We stripped out all non-element nodes and then generated a hierarchy of element-based XPath expressions (in a relatively inefficient and messy way). There were 43,332 unique XPath expressions – that’s one gnarly XML document! Once we had a nice tree hiearrchy we could easily adapt several of the D3 visualizations to explore the tagset (these are static screenshots of what are mostly interactive interfaces):
Personally, I find the partition treemap to be the most useful though certainly not the most attractive. I also think the radial tree is effective at conveying the complexity of the tagging, even if it’s much more difficult to read. Our hope is to hook one of these up – maybe the collapsible tree – to an interface that could show text results from that particular XPath expression.