Digital Humanities Now

Real-time Voyeur wordcloud of this DH Twitter list

The good folks at the Center for History and New Media have initiated (yet another) fantastic resource in the form of Digital Humanities Now, a real-time, crowdsourced publication. It takes the pulse of the digital humanities community and tries to discern what articles, blog posts, projects, tools, collections, and announcements are worthy of greater attention. I’m especially happy to hear that part of the motivation for DHNow is to explore how, as Dan Cohen puts it “some version of this idea could serve as a rather decent new form of publication that focuses the attention of those in a particular field on important new developments and scholarly products”. Though perhaps not itself a quantifiable object in terms of hiring, tenure and promotion, it can certainly function effectively in that ecology to promote noteworthy content and projects. I see this as akin to participating at conferences: there’s a kind of intangible value ascribed to it by committees wanting to judge the scholarly activity of an individual (that can enhance the perception of other work). We can’t just wait for administrators to “get it”, we need to be more proactive, by providing them with other tools by which to assess the value of digital humanities scholarship. It’s partly for this reason that I think it’s so important to think, as a community, about how we can get DHNow (and similar initiatives) right.

One of my first reactions was to ask how the algorithms could get a good diversity of languages if – as I’d falsely assumed – there was linguistic analysis happening in the filtering and grouping process. It turns out that DHNow uses a much simpler and more elegant mechanism for gathering content: it (via Twittertim.es) analyzes common URLs, which really makes it language and context independent (though still likely very relevant if several DHers mention the same URL).

The URL-centric approach is useful for converging on a unique resource, but there’s a lot of discussion on Twitter that’s not oriented towards URLs (and of course there’s also a lot of DH discussion that’s not on Twitter). One huge advantage of the URL approach is that it usually produces a nice, coherent title to represent a set of tweets – it may be difficult to generate an expressive title from a tweet in the absence of a URL. In any case, what could be some possible (relatively simple) strategies for capturing a broader range of topics and discussions?

  • identify tweets that exhibit interrelatedness (within DHers) without necessarily containing URLs – replies to and retweets of DHers can suggest something of greater interest
  • identify tweets that contain common terms that are distinctive from a larger corpus of DH tweets
  • use trends

Other ideas? Tweet me @sgsinclair (with this URL http://tr.im/FnOI ;-)

I have another motivation for wanting DHNow to work optimally: I’m already overwhelmed by digital information from email, blogs, Twitter, and so on. I’m not especially keen to add yet another source of information, unless I have confidence that it will allow me to drop something else. Any filtering and aggregation obviously means compromise and loss – but if anyone should be up to the challenge of a technical and social problem, it’s the digital humanities community.

Tool APIs

In preparation for the upcoming API workshop, organized by Bill Turkel, I thought I’d try to assemble a few thoughts on APIs. This is the fruit of work on several text analysis projects, including TAPoR, HyperPo, Voyeur, BonPatron and MONK (I hesitate to associate ideas with specific people without their consent, but of course this is also the fruit of working with several talented people in digital humanities).

  1. Use REST and keep it simple. The universal KISS principle is certainly valid for APIs: the simpler things are the more likely they’ll be properly understood and adopted. The TAPoR Portal supports both SOAP tools and REST tools, but REST tools have been far less of a headache (some of the problems related specificially to Ruby’s “SOAP”: library, but even beyond that, for our purposes REST tools provide everything we need with less hassle). Part of keeping the syntax of the API simple is to plan for a wide range of calls; this doesn’t mean that all the calls should be implemented and documented, but listing them at the beginning helps to define the purpose and scope of the tool and helps prevent overly complex syntax that’s usually the product of afterthought.
  2. Document the APIs (preferably automatically). Documentation goes without saying (sometimes it even goes without doing). When tools get compared and evaluated, one of the main criteria is always the extent and quality of the documentation. Besides, good documentation usually avoids more support questions. Of course there may be cases where you want to keep some aspects of the API undocumented if they’re too much in flux: a documented API should be respected by both developer and user, even as the tool evolves. One of the best ways to ensure up-to-date documentation is to find a way of having tools document themselves (like JavaDocs). This is one reason why HyperPo used Cocoon and XForms in order to have self-documenting tools.
  3. Provide XML and JSON output. Providing two forms of output is a bit contradictory to the KISS principle, but there are good reasons for providing both: 1) XML because it’s still a powerful interchange language and can be infinitely transformed with XSL; 2) JSON because results are usually easier and faster to work with for client-side Javascript libraries (not to mention less bandwidth because results are more compact). Part of a well-documented API is of course explaining the results format.
  4. Provide paging functionality. It’s a pain when you really want 5 results but the tool gives you 5,000: it’s an unnecessary performance burden in terms of bandwidth, memory, and computation. There are rare exceptions, but most tools should provide paging funcionality to ensure they’re scalable (even if the paging doesn’t seem immediately useful). Things get trickier when you need to combine pageing and sorting or grouping, but that’s where clear API documentation helps.
  5. Create a proxy to channel traffic. For many client-side web applications, having a proxy channel requests to other tools can help avoid some constraints imposed by cross-domain Javascript security. But even beyond that, proxies can serve a useful purpose as a centralized broker of communication with other tools – there are good chances that parts of proxy code can be reusable for different types of tool requests, even when direct requests to the tools are possible (for instance, caching results or handling connection errors). One of the main benefits that we’ve found from having a proxy layer goes beyond APIs: it decouples development schedules of the interface (client-side) group and the backend (server-side) group. For instance, it’s possible for the proxy to provide fake data to the interface until the backend is ready to provide real data – but the interface code is oblivious to the difference.
  6. For rich client-side tools, create embeddable objects. We usually think of APIs as providing data-centric content that is transformed and presented to the user in a different format. However, there are some tools where the server-side and client-side components work together and it’s actually the bundled combination that’s desired. These are often called widgets or badges, and they provide stand-alone functionality (like an embedded YouTube video or a Twitter timeline). A text-analysis example of this is Voyeur panels, like on the Day of Digital Humanities. Again, because of cross-domain security constraints, it can be easiest to embed these panels in an IFRAME (though of course they won’t be allowed to interact with the rest of the page).
  7. Coordinated redundancy of services would be nice. I’m talking here primarily about academic projects, not commercial services: our servers and services go down for a variety of reasons and there’s rarely staff available 24/7 to make sure things are restored immediately. Furthermore, we’re more likely in an academic context to deploy an experimental version of something that could inadvertantly break functionality required elsewhere. The problem is that if Project 1 depends on services from Project 2 but _Project 2 _ is unavailable for some time, Project 1 may be partly or completely compromised. Projects that want to do the right thing and integrate existing remote services instead of re-inventing every wheel or having local installations of every service (that individually need to be maintained) face a network challenge. One possibility (again that’s fairly specific to the academic context) is to have a mechanism for coordinating fail-over sites for certain services. This isn’t quite as easy as it sounds since you need to maintain and distribute (presumably again through an API) a list of current installations with versioning information included. One benefit, if really there’s collaboration between sites, is that you get a form of mirroring that can provide load-balancing as well as improve network latency by calling services that are closer to you. I don’t think we have any good examples of tools that are widely used by several digital humanities projects, but that’s not entirely the fault of the existing tools, it’s that we haven’t focused enough on APIs and distributed services….

Although HyperPo has many faults (not very scalable, not to mention the fact that its development has been superceded by Voyeur), it does provide a decent API. To see it in action, you can view the list of modular tools in the HyperPoets Gallery, click on one of the tools, scroll down to near the bottom of the page and click the API link, and submit some values (please don’t be a bully – use shorter texts:-). Some tools provide alternate output formats – you’ll find those in the options section if applicable. For instance:

Some similar calls are currently possible with Voyeur (http://voyeur.hermeneuti.ca/?input=http://www.un.org/Overview/rights.html), but there’s a long way to go yet…

Postdoctoral Fellowship in Digital Humanities and High Performance Computing (HPC)

Applications are invited for a one-year Postdoctoral Fellowship in Digital Humanities and High Performance Computing (HPC), under the supervision of Dr. Stéfan Sinclair from Communications Studies and Multimedia at McMaster University. The focus of the research will be large-scale, on-demand text analysis, and especially the development of HPC modules that can operate in a web-based context. McMaster University is internationally recognized as a leader in digital humanities scholarship and tool development.

This position is made possible in large part by Sharcnet, an HPC consortium in Ontario, as well as McMaster Libraries. The postdoctoral fellow will work closely with the supervisor (Sinclair), Sharcnet, and the Libraries.

Successful candidates will have experience working on textually oriented projects, strong Java and system administration skills. We are seeking an individual who can bring strong interest and enthusiasm to an area of research ripe for innovation, and someone who will be able to integrate well into a larger team.

Salary: $45,000 plus benefits

By July 31, 2009, applicants should send a full Curriculum Vitae, letters from two referees and a cover letter highlighting their prior achievements and a brief summary of their statement of their interest and experience in this area. Electronic submissions will be accepted. Applicants are strongly encouraged to contact Sinclair as early as possible to express interest and to ask any questions.

McMaster is committed to Employment Equity and welcomes applications from all qualified applicants, including women, members of visible minorities, Aboriginal persons, members of sexual minorities, and persons with disabilities.

Dr. Stéfan Sinclair (sgs [at] mcmaster.ca)
Communication Studies & Multimedia
McMaster University
1280 Main Street West
Hamilton, ON, L8S 4M2, Canada

Twitter

I’ve finally taken the plunge into Twitter. I have to confess that I do so a more out of academic curiosity than real interest, but I have a sneaking suspicion that I’ll enjoy it, at least for a while. I’m not sure I’ll ever get into the groove of divulging details of my personal life, but I think it might be an interesting medium for exchanging interesting nuggets about research and teaching activities. My first instinct was certainly to look up colleagues whose work interests me, rather than looking up friends and family.

Soon after creating my account I found a very simple Quicksilver ActionScript for posting tweets. I also found an updated script for Growl notifications, but what I really wanted was to be warned when tweets were too long (over 140 characters). After trying a few variants with more or less success, I settled on this script (though I made the failed Growl message a bit more noticeable).

Text Analysis in the News

A neighour and friend said he thought of me when he read an article about researchers doing text analysis to study the possible effects of Alzheimer’s on the vocabulary richness of authors. I asked to see the article and was very pleasantly surprised to see our TAPoR colleague Ian Lancashire prominently featured in a recent Maclean’s article (Ian has been a wonderful pioneer and leader for the text analysis community in Canada and beyond, earning him an Outstanding Achievement Award for Computing in the Arts and Humanities). The study was looking at longitudinal trends in the writings of Agatha Christie. Among other notable findings, the study identified a 30 per cent drop in vocabulary leading into Christie’s penultimate novel Elephants Can Remember. The Maclean’s article is a wonderful example of the potential for text analysis to be accessible and broadly relevant.

Day of Digital Humanities

Along with almost 100 other colleagues, I participated in the Day of Digital Humanities, a community publication project to bring together digital humanists from around the world to document what they did today. I think this was a super initiative, in part because it offers such an unusual glimpse at what so many of our colleagues do (beyond what they might present in a more polished for in conference presentations and scholarly articles).

I spent a good part of my day working on adapting Voyeur for use with RSS feeds (like the ones being produced by the Day of Digital Humanities). Here are some glimpses (this highlights Voyeur’s ability to be embedded in remote sites, like this blog – this should be considered a modest preview release of Voyeur):

  • a summary of all posts (currently submitted – I’ll update this tomorrow to catch the last ones):
  • the top types (words) grouped in documents by author

Among the countless things to do on Voyeur, I need to better display results when there are hundreds of documents (like when each post is a separate document), but the full Voyeur interface is fairly usable for the second arrangement of documents (one document per author).

Citing Software

Reference Geoffrey Rockwell and I have been giving considerable thought recently to how we might facilitate the integration of text analysis tools and results into (mostly scholarly) writing. Scholars feel compelled to cite ideas and texts that come from other authors, but they are much less likely to recognized tools that have contributed to their work (and we would probably not want every scholar to cite search engines such as Google that have been used during research). We feel strongly that text analysis tools can represent a significant contributor to digital research, whether they were used to help confirm hunches or to lead the researcher into completely unanticipated realms. Whether or not scholars do make it more of a habit to cite tools is beyond our control, but we want to design our upcoming tools to make it easier for them to do so. At the very least this includes:

  • providing a preferred general citation for the tool suite
  • providing preferred citations for specific results including references to the tool and the source text(s)
  • making it easier for users to extract static or dynamic results and include them elsewhere (a web-based blog editor, an HTML editor, a word processor article, etc.), with a reference

An important component of academic knowledge is reproducibility, and providing scholars with more information on the processes followed during research – including the text analysis tools and digital texts used – is sure to be important.

I was prompted to write this post by a recent notice in a Globe and Mail article that provided several statistics:

These figures have been compiled by Patrick Brethour, the Globe and Mail’s British Columbia editor, drawing from the 2006 census with the help of special software from Tetrad Computer Applications Inc.

The figures referred to are mostly present in the text of the article as well, but I wonder if the editor would have been as likely to include this notice if there hadn’t been the inset with the concentrated statistics. The distinction is important because it’s about recognizing what contributed to the research regardless of how the results are presented (though ironically, journalism tends to have very different standards of citation that academic writing, and yet it’s in a newspaper article that we find a software tool cited). Will standards for citing digital tools in the humanities shift in the coming years?

TREX 2008 Winners Announced

TREX08 TADA (the Text Analysis Developers’ Alliance, of which I’m the unofficial future former director) has announced winners of the 2008 T-REX Competition (for text analysis tools development and usage). The panel of judges reviewed the many submissions received and has recognized winners in five categories:

  • Best New Tool
    • Degrees of Connection by Susan Brown, Jeffery Antoniuk, Sharon Balazs, Patricia Clements, Isobel Grundy, Stan Ruecker
    • Ripper Browser by Alejandro Giacometti, Stan Ruecker, Ian Craig, Gerry Derksen
  • Best Idea for a New Tool
    • Magic Circle by Carlos Fiorentino, Stan Ruecker, Milena Radzikowska, Piotr Michura
  • Best Idea for Improving a Current Tool
    • Collocate Cloud by Dave Beavan
    • Throwing Bones by Kirsten C. Uszkalo
  • Best Idea for Improving the Interface of the TAPoR Portal
    • Bookmarklet for Immediate Text Analysis by Peter Organisciak
  • Best Experiment of Text Analysis Using High Performance Computing
    • Back-of-the-Book Index Generation by Patrick Juola

Congratulations to all winners and thanks to all participants! Watch this space for upcoming TADA events, including the next TREX Competition.

Johnny Rodgers on Digital Texts 2.0

DText2 Johnny Rodgers, lead developer of Digital Texts 2.0 is getting some media love from the School of Interactive Arts & Technology where he’s just started an MA this fall. Johnny will be presenting our work on Digital Texts 2.0 in a couple of weeks at CaSTA 2008.

Digital Texts 2.0 (Preview Release)

DT2 We’ve made available a preview release of Digital Texts 2.0, an attempt to experiment with social networking practises in the context of interacting with electronic texts. Although we have a fairly detailed scholarly agenda for this project, one of the things I’m most curious about is whether or not students would be interested in using a Facebook application to interact with texts, whether it be for pleasure or for course work. Similarly, can instructors find innovative ways to incorporate such tools into the classroom?

Some key features currently available:

  • Add and browse Texts (via Amazon lookup, or manually)
  • Organize your texts into Collections
  • Join Groups of like-minded Readers
  • Comment on and add Tags to Authors, Collections, Texts, and Groups
  • Share your findings with Friends

Some upcoming features (probably by the end of the summer):

  • Import/Export feature set
  • Citation Generation tools
  • Hybrid Searches combining Authors and Readers of Texts
  • Text Recommendation system

Are you planning on using Digital Texts 2.0? Please let me know!

Thanks to the Digital Texts 2.0 team and especially to the heroic efforts of Johnny Rodgers, the programmer and designer, and Shawn Day, who has provided outstanding feedback.

Syndicate content