Custom Word Lists in Cirrus

November 24, 2013

Custom Cirrus

Cirrus with a custom terms list that shows my co-authors.

Cirrus is a Wordle-like word cloud visualization tool in Voyant. As with most tools in Voyant, Cirrus is primarily intended to be used with a user-defined collection of texts – you specify some URLs or upload some files and Cirrus will show you the highest frequency terms (with or without a user-defined list of stopwords).

However, it is possible to define a custom list of terms to visualize without having to define a corpus.

Cirrus supports two formats of inline data that can be specified as part of a URL:

If it were just a matter of “normal” terms like with the examples above, there would be no real benefit to the inline data format since one could just specify the full text (and then have access to all the other functionality in Voyant); something like: http://voyant-tools.org/tool/Cirrus/?input=the%20cat%20in%20the%20hat. But Voyant makes some assumptions about what you want to count (lower-case single words) that may not always be the appropriate. What if I want to differentiate between upper and lower case words? What if I do want to show multiple terms together? In the image above, for instance, I’m showing a list of co-authors from my publications and conference presentations – currently this would be impossible to do within the normal Voyant Tools.

I consider a custom terms list in Cirrus to be of somewhat limited use since the tools is detached from a corpus and the rest of the Voyant Tools functionality. It’s now a fairly static word cloud, and I’ll be the first to admit that Wordle can produce more attractive representations (especially with a few tricks like keeping words together). Then again, Wordle is a Java applet that is increasingly difficult to run in browsers, and besides, we like seeing Voyant being used lots, right? :)

P.S. Did you know that you can force Cirrus to use an HTML5 implementation in order to better display some Unicode characters (the default Flash version is faster and arguably more attractive, but more limited for character sets)? Just add forceHtml5=true to the main page or any other URL (like this one in Japanese – word tokenization is a different issue).

Comments are closed.