I've been wanting to try some multidimensional scaling since 1998, and now I've tried it.
Multidimensional scaling is this fantastic concept whereby "objects" which you can place in a (possibly) highly dimensional space (i.e., to each object you can associate a (perhaps large) finite set of values) can be approximately placed into a lower-dimensional space for better visualization and perhaps to get a better understanding of the really important features that distinguish the objects. The crux is to generate a set of distances between your objects. From this set of distances, after a dimension is chosen (say, 2), the objects are placed into a two-dimensional space as best as possible, in the sense that the distance information is as accurately represented by the two-dimensional coordinates as possible.
I first encountered this idea in Larry Polansky's seminar on timbre at Dartmouth, in 1997 or 1998, while I was wallowing in unemployment before getting my first post-PhD job. The example I most remember is a study done in which people were played pairs of sounds and asked to rate how "similar" the sounds were (i.e., to specify a distance between pairs of sounds). Run through the MDS process, a two-dimensional model was found to be reasonably accurate. The fun part is to then figure out what, if anything, those two dimensions represent. In the case of the sounds, one of the dimensions seemed to be attack time, and I forget what the other was (I'll have to fill this in later).
Since then, I've been meaning to try MDS out. I recently found out that the horribly named statistics software R does MDS, so I thought I'd give it a whirl. But what to try it on? I decided I would measure the letter frequencies in a bunch of Dickens novels. I then defined the distance between Dickens novels to be the euclidean distance based on these frequencies (i.e. the square root of the sum of the squares of the differences in frequency for each letter). Here is the result!
Not too bad! A Christmas Carol and Pickwick Papers are definitely the outliers. I don't exactly know why, but here is a table with some data:
|letter||max frequency||novel||min frequency||novel|
I created this table to show the maximum and minimum frequency of occurrence of each letter in Dickens novels.
You can see that A Christmas Carol has a number of extreme letter frequencies. This could just be due to its shortness: I imagine longer works might tend to smooth out their frequencies. Pickwick Papers also has a number of extremes; this is Dickens first novel, so perhaps he was using odd choices of letters. Or maybe it just that the work "Pickwick" appears alot, and this might explain the high occurrence of the letters c, k, and p, though why Great Expectations is more dense in 'w', I have no idea.