There are some digital humanists who are competent mathematicians, but most of us experience some anxiety about the more advanced mathematics involved in the text analysis methodologies that we use. Dammit Jim, I’m a humanist, not a mathematician! The problem of course is that there are clearly some statistical and graphical techniques that can be very powerful for humanities research (if you’re unconvinced by this claim, please read on anyway). So one faces a choice:
- not using these techniques
- using these techniques naïvely and trusting that they’re working properly and that one is interpreting the results properly
- investing a ton of time learning the mathematics involved, sometimes to the detriment of the original research agenda
- collaborating with someone who does understand the mathematics
Much of my digital humanities work is oriented around building tools for humanists willing to venture out of their immediate comfort (based on their academic training in conventional humanities) and willing to try tools. For these users, tools must be as user-friendly as possible and relatively reliable. Note that I’m not saying that tools must always be simple or intuitive, since this would exclude some potentially valuable techniques.
Correspondence Analysis is a good example of a technique that can appear very intimidating but that can also be a very powerful tool in the arsenal of a digital humanist. My intent here is not to explain the mathematics behind correspondence analysis – that can be found elsewhere – but rather, to give a sense of how and why it might be useful in a humanities context. I will base this on the ScatterPlot tool that I have been developing with Geoffrey Rockwell in Voyeur Tools.
Geoffrey and I have been looking at the Humanist Discussion Group listserv archive – a vibrant mailing list that has been active since 1987. The listserv archive is of interest to us because it provides one way of studying the digital humanities (humanities computing) community over time. I’ve explained elsewhere how the listserv archive has been compiled and extracted, but essentially we are working with messages from 1987 to 2008, including well over 8 million words. It’s worth emphasizing that the corpus has only been lightly cleaned up – there’s much more that could be done.
The Humanist archive can be studied using the default Voyeur Tools interface. With a longitudinal corpus like this (documents organized chronologically), one of the natural things to want to examine is the variation of terms over time: When does a term first appear? Is a term distributed evenly across the entire corpus? How does the frequency of two terms compare over time? etc. The defaut Voyeur Tools interface is well-suited to this kind of inquiry, especially if you know in advance what terms you want to examine more closely.
The Words in the Entire Corpus tool shows (by default) the top frequency words, as well as sparklines that indicate the distribution of the term in the corpus. The sparkline for http, for instance, indicates that there are no occurrences of the term early in the corpus, that occurrences rise precipitiously, and then stabilize. This phenomenon is not surprising given that the corpus spans between 1987 and 2008 – this rise in occurrences of http corresponds to the rise of the web in the early 1990s.
While it might be very useful to puruse this frequency list, it’s a bit difficult to compare the frequency of multiple terms – the numbers are there, but there’s no visual representation of the relative frequencies. The sparklines are a bit misleading since they are callibrated for each word – the graph tries to make use of all the available (limited) space to show the variation between the minimum and maximum frequencies of each word, but they are at different scales. It would probably be more revealing to show results of several words together and at the same scale, which is what the Word Trends tool allows:
Although this is useful, and it reveals some potentially interesting patterns, it requires the user to select individual terms to be added to the graph. One could add a large number of terms, but the graph would become increasingly illegible (the forest starts to hide the tree). What is needed is a way of showing more terms but collapsing some of the information displayed so that what’s left is more manageable and more easily perceived (as long as we are cognizant of the potential pitfalls of collapsing or reducing information).
Correspondence analysis is a technique for doing just that: taking a larger matrix of data and collasping it into a more compact form. Needless to say, the compacting doesn’t happen arbitrarily, but rather by organizing items spacially so that their position carries meaning that does not have to be explicity expresed. At a schematic level, it might be equivalent to the following transformation (where some information is lost, but some of the key concepts are conveyed more compactly simply by spatial organization):
|My cat is an animal.
That chair is a piece of furniture.
Joe’s dog is an animal.
This table is a piece of furniture.
Similarly, we might imagine a large matrix of data that expresses the frequency of each term in each document, as in the table below:
This table provides a lot of explicit data, but the relationships between data points in the rows and columns are difficult to perceive. By plotting the data spatially, we can reduce the amount of information that needs to be expressed while better expressing the associations that are already present in the data. We take a large multi-dimensional space (the matrix) and we reduce it to a 2D or 3D representation. Geoffrey Rockwell and John Bradley describe it as follows:
Correspondence Analysis helps by transforming the dimensions of the data so that the effect of throwing away of dimensions to allow us to see patterns has as little negative effect as is mathematically possible. CA transforms the data into an “equivalent” space where the largest amount of variability in the data points is captured in the first dimension, the next largest amount of variability in the second dimension, and so on. [...] In addition to maximizing the variability of data in a few dimensions, Correspondence Analysis also helps us see associations in the word profiles in quite a different way. This is because the procedure not only transforms the word data (the rows, or the distribution of the words between the parts) into a new space, but also the “part data” (the columns, or the distribution of the parts between the words) into the same set of dimensions. It is possible, therefore, to map the two types of data – the words and the parts onto the same space, and sometimes possible to see associations not only between different words, but also between the dimensions and the parts.
Rockwell and Bradley describe “parts” (of Hume’s Dialogues), but the same principle is valid for our per-year documents in the Humanist archives. In the results below, the years are indicated with orange triangles and the 100 top frequency words are indicated with blue circles.
This ScatterPlot graph (points in a cartesian space) is representing 50 of the top frequency words in Humanist. It’s starting from the same data as the table above: each document in that able is a dimension, which means that we would have a 21-dimensional space. Instead, the correspondence analysis tries to find associations between the term frequencies and the documents in a way that can be expressed in two dimensions (each term has x and y coordinates). This graph actually also expresses a third dimension using the opactiy of the circles (how dark is the blue). This is important because two terms may appear close to one another in two dimensions, but actually be far apart in the third dimension, much like how stars can appear next to one another in the sky, but actually be very far apart in terms of distance from the viewer.
As mentioned before, my intent here is not to explain the mathematics behind correspondence analysis, but to suggest how it might be useful. The image below highlights two examples:
- The green line traces a series of documents between 1987 and 1994; given that correspondence analysis is working exclusively from term frequency data, it is remarkable that the documents appear in order and in a regular pattern. It’s also notable that terms like computer and software are plotted in this region, which suggests an association between the frequency of those terms and the documents in that space – perhaps participants of Humanist in the late 1980s and early 1990s were more concerned with software than in later years.
- The blue line encompasses a cluster of documents in the early 2000s as well as a series of terms that are easily associated with the web: web, com, html, etc. This cluster of words might be identified by the user’s intuition or discovered using other mechanisms in Voyeur Tools, but correspondence analysis has managed to produce the cluster automatically. This is an important distinction: the user is not required to go hunt and peck for potential phenomena to study more closely, the results of the correspondence analysis is suggesting the cluster on its own.
It would probably be useful to combine the ScatterPlot with tools to examine word trends and concordances, which is what the scatter skin does. I am only scratching the surface of what might be possible to examine in the Humanist archive with correspondence analysis (Geoffrey and I are studying the archive and will publish some of our interpretations separately).
What really excites me about correspondence analysis is how it can serve both beginner and more advanced users. Beginner text analysis users tend not to know where to begin and any tool that can suggest possible phenomena to study can be worthwhile – rather than start with a blank slate or, say, a raw list of frequencies, the ScatterPlot visualization helps to suggest clusters and patterns that are worthy of study. The interface is not self-evident, but visual language of proximity is powerful and accessible. Advanced users can experiment with various settings to more fully exploit the underlying data.
Most importantly, the Correspondence Analysis tool is presented in the same user-friendly envelope as the rest of Voyeur Tools – it should be relatively easy to upload a corpus of documents in a variety of formats and start experimenting.
Some useful links:
- Geoffrey Rockwell and John Bradley A Correspondence Analysis of the Dialogues (see links to McKinnon’s work)
- Ben Schmidt’s Principle Component Analysis experiments
- Xiaoguang Wang & Mitsuyuki Inaba Analyzing Structures and Evolution of Digital Humanities Based on Correspondence Analysis and Co-word Analysis
- Sarah Allison, Ryan Heuser, Matthew Jockers, Franco Moretti & Michael Witmore Quantitative Formalism: an Experiment
- Wikipedia Correspondence Analysis