A Gentle Introduction to Correspondence Analysis

February 18, 2011

There are some digital humanists who are competent mathematicians, but most of us experience some anxiety about the more advanced mathematics involved in the text analysis methodologies that we use. Dammit Jim, I’m a humanist, not a mathematician! The problem of course is that there are clearly some statistical and graphical techniques that can be very powerful for humanities research (if you’re unconvinced by this claim, please read on anyway). So one faces a choice:

  1. not using these techniques
  2. using these techniques naïvely and trusting that they’re working properly and that one is interpreting the results properly
  3. investing a ton of time learning the mathematics involved, sometimes to the detriment of the original research agenda
  4. collaborating with someone who does understand the mathematics

Much of my digital humanities work is oriented around building tools for humanists willing to venture out of their immediate comfort (based on their academic training in conventional humanities) and willing to try tools. For these users, tools must be as user-friendly as possible and relatively reliable. Note that I’m not saying that tools must always be simple or intuitive, since this would exclude some potentially valuable techniques.

Correspondence Analysis is a good example of a technique that can appear very intimidating but that can also be a very powerful tool in the arsenal of a digital humanist. My intent here is not to explain the mathematics behind correspondence analysis – that can be found elsewhere – but rather, to give a sense of how and why it might be useful in a humanities context. I will base this on the ScatterPlot tool that I have been developing with Geoffrey Rockwell in Voyeur Tools.

Geoffrey and I have been looking at the Humanist Discussion Group listserv archive – a vibrant mailing list that has been active since 1987. The listserv archive is of interest to us because it provides one way of studying the digital humanities (humanities computing) community over time. I’ve explained elsewhere how the listserv archive has been compiled and extracted, but essentially we are working with messages from 1987 to 2008, including well over 8 million words. It’s worth emphasizing that the corpus has only been lightly cleaned up – there’s much more that could be done.

The Humanist archive can be studied using the default Voyeur Tools interface. With a longitudinal corpus like this (documents organized chronologically), one of the natural things to want to examine is the variation of terms over time: When does a term first appear? Is a term distributed evenly across the entire corpus? How does the frequency of two terms compare over time? etc. The defaut Voyeur Tools interface is well-suited to this kind of inquiry, especially if you know in advance what terms you want to examine more closely.

Words in the Entire Corpus of HumanistThe Words in the Entire Corpus tool shows (by default) the top frequency words, as well as sparklines that indicate the distribution of the term in the corpus. The sparkline for http, for instance, indicates that there are no occurrences of the term early in the corpus, that occurrences rise precipitiously, and then stabilize. This phenomenon is not surprising given that the corpus spans between 1987 and 2008 – this rise in occurrences of http corresponds to the rise of the web in the early 1990s.

While it might be very useful to puruse this frequency list, it’s a bit difficult to compare the frequency of multiple terms – the numbers are there, but there’s no visual representation of the relative frequencies. The sparklines are a bit misleading since they are callibrated for each word – the graph tries to make use of all the available (limited) space to show the variation between the minimum and maximum frequencies of each word, but they are at different scales. It would probably be more revealing to show results of several words together and at the same scale, which is what the Word Trends tool allows:

Although this is useful, and it reveals some potentially interesting patterns, it requires the user to select individual terms to be added to the graph. One could add a large number of terms, but the graph would become increasingly illegible (the forest starts to hide the tree). What is needed is a way of showing more terms but collapsing some of the information displayed so that what’s left is more manageable and more easily perceived (as long as we are cognizant of the potential pitfalls of collapsing or reducing information).

Correspondence analysis is a technique for doing just that: taking a larger matrix of data and collasping it into a more compact form. Needless to say, the compacting doesn’t happen arbitrarily, but rather by organizing items spacially so that their position carries meaning that does not have to be explicity expresed. At a schematic level, it might be equivalent to the following transformation (where some information is lost, but some of the key concepts are conveyed more compactly simply by spatial organization):

My cat is an animal.
That chair is a piece of furniture.
Joe’s dog is an animal.
This table is a piece of furniture.
animal

cat, dog

furniture

chair, table

Similarly, we might imagine a large matrix of data that expresses the frequency of each term in each document, as in the table below:

Document http www university edu subject date href vol num x-humanist uk information humanities
1987-88 0 0 1110 189 1235 1258 0 1112 1088 1087 567 606 831
1988-89 0 0 701 209 920 869 0 851 835 835 418 493 393
1989-90 0 0 2020 877 3317 3126 4 3068 2988 2986 735 1294 879
1990-91 0 0 2160 1373 3598 3464 0 3360 3313 3313 606 1340 413
1991-92 0 0 1868 1364 2294 2210 2 2155 2106 2102 499 968 538
1992-93 0 7 2302 1307 1772 1722 0 1634 1608 1608 411 1092 567
1993-94 64 67 2422 1320 1348 1250 35 1170 1143 1140 402 1156 810
1994-95 260 246 1482 876 873 831 160 758 744 744 428 820 433
1995-96 1454 1209 1599 1678 1355 1306 750 1211 1195 1194 416 858 803
1996-97 1803 1346 1297 1271 1028 992 935 936 915 913 767 806 593
1997-98 5641 4780 2174 3193 1405 1362 2837 1287 1243 1243 2707 1997 2290
1998-99 6572 5520 1352 3303 1261 1257 3267 1212 1165 1159 2577 1706 2071
1999-00 5857 4814 1268 2483 1066 1012 2798 984 942 942 2001 1259 1551
2000-01 5552 4277 1495 2506 1412 1305 2788 1258 1236 1236 1366 1302 995
2001-02 4362 3353 1280 1460 1158 1097 2185 1032 1015 1014 1040 1106 1001
2002-03 4118 2886 1116 1288 1056 1054 2054 973 951 951 1125 916 916
2003-04 4776 3667 1280 1585 1301 1356 2401 1226 1175 1175 1645 989 1253
2004-05 4648 3515 1372 1587 1234 1206 2345 1150 1129 1129 1236 1135 1118
2005-06 4300 3446 1377 1459 1201 1161 2160 1184 1095 1095 1240 1086 1255
2006-07 3874 2636 1440 1031 921 924 1953 924 837 833 1254 924 1512
2007-08 3625 2319 1052 983 800 798 1834 772 736 736 1292 686 1101/td>

This table provides a lot of explicit data, but the relationships between data points in the rows and columns are difficult to perceive. By plotting the data spatially, we can reduce the amount of information that needs to be expressed while better expressing the associations that are already present in the data. We take a large multi-dimensional space (the matrix) and we reduce it to a 2D or 3D representation. Geoffrey Rockwell and John Bradley describe it as follows:

Correspondence Analysis helps by transforming the dimensions of the data so that the effect of throwing away of dimensions to allow us to see patterns has as little negative effect as is mathematically possible. CA transforms the data into an “equivalent” space where the largest amount of variability in the data points is captured in the first dimension, the next largest amount of variability in the second dimension, and so on. [...] In addition to maximizing the variability of data in a few dimensions, Correspondence Analysis also helps us see associations in the word profiles in quite a different way. This is because the procedure not only transforms the word data (the rows, or the distribution of the words between the parts) into a new space, but also the “part data” (the columns, or the distribution of the parts between the words) into the same set of dimensions. It is possible, therefore, to map the two types of data – the words and the parts onto the same space, and sometimes possible to see associations not only between different words, but also between the dimensions and the parts.

Rockwell and Bradley describe “parts” (of Hume’s Dialogues), but the same principle is valid for our per-year documents in the Humanist archives. In the results below, the years are indicated with orange triangles and the 100 top frequency words are indicated with blue circles.

This ScatterPlot graph (points in a cartesian space) is representing 50 of the top frequency words in Humanist. It’s starting from the same data as the table above: each document in that able is a dimension, which means that we would have a 21-dimensional space. Instead, the correspondence analysis tries to find associations between the term frequencies and the documents in a way that can be expressed in two dimensions (each term has x and y coordinates). This graph actually also expresses a third dimension using the opactiy of the circles (how dark is the blue). This is important because two terms may appear close to one another in two dimensions, but actually be far apart in the third dimension, much like how stars can appear next to one another in the sky, but actually be very far apart in terms of distance from the viewer.

As mentioned before, my intent here is not to explain the mathematics behind correspondence analysis, but to suggest how it might be useful. The image below highlights two examples:

Humanist Correspondence Analysis
  1. The green line traces a series of documents between 1987 and 1994; given that correspondence analysis is working exclusively from term frequency data, it is remarkable that the documents appear in order and in a regular pattern. It’s also notable that terms like computer and software are plotted in this region, which suggests an association between the frequency of those terms and the documents in that space – perhaps participants of Humanist in the late 1980s and early 1990s were more concerned with software than in later years.
  2. The blue line encompasses a cluster of documents in the early 2000s as well as a series of terms that are easily associated with the web: web, com, html, etc. This cluster of words might be identified by the user’s intuition or discovered using other mechanisms in Voyeur Tools, but correspondence analysis has managed to produce the cluster automatically. This is an important distinction: the user is not required to go hunt and peck for potential phenomena to study more closely, the results of the correspondence analysis is suggesting the cluster on its own.

It would probably be useful to combine the ScatterPlot with tools to examine word trends and concordances, which is what the scatter skin does. I am only scratching the surface of what might be possible to examine in the Humanist archive with correspondence analysis (Geoffrey and I are studying the archive and will publish some of our interpretations separately).

What really excites me about correspondence analysis is how it can serve both beginner and more advanced users. Beginner text analysis users tend not to know where to begin and any tool that can suggest possible phenomena to study can be worthwhile – rather than start with a blank slate or, say, a raw list of frequencies, the ScatterPlot visualization helps to suggest clusters and patterns that are worthy of study. The interface is not self-evident, but visual language of proximity is powerful and accessible. Advanced users can experiment with various settings to more fully exploit the underlying data.

Most importantly, the Correspondence Analysis tool is presented in the same user-friendly envelope as the rest of Voyeur Tools – it should be relatively easy to upload a corpus of documents in a variety of formats and start experimenting.

Some useful links:

Comments are closed.