R

multidimensional scaling, baby!

Tue, 2011-06-07 21:58

I've been wanting to try some multidimensional scaling since 1998, and now I've tried it.

Multidimensional scaling is this fantastic concept whereby "objects" which you can place in a (possibly) highly dimensional space (i.e., to each object you can associate a (perhaps large) finite set of values) can be approximately placed into a lower-dimensional space for better visualization and perhaps to get a better understanding of the really important features that distinguish the objects. The crux is to generate a set of distances between your objects. From this set of distances, after a dimension is chosen (say, 2), the objects are placed into a two-dimensional space as best as possible, in the sense that the distance information is as accurately represented by the two-dimensional coordinates as possible.

I first encountered this idea in Larry Polansky's seminar on timbre at Dartmouth, in 1997 or 1998, while I was wallowing in unemployment before getting my first post-PhD job. The example I most remember is a study done in which people were played pairs of sounds and asked to rate how "similar" the sounds were (i.e., to specify a distance between pairs of sounds). Run through the MDS process, a two-dimensional model was found to be reasonably accurate. The fun part is to then figure out what, if anything, those two dimensions represent. In the case of the sounds, one of the dimensions seemed to be attack time, and I forget what the other was (I'll have to fill this in later).

Since then, I've been meaning to try MDS out. I recently found out that the horribly named statistics software R does MDS, so I thought I'd give it a whirl. But what to try it on? I decided I would measure the letter frequencies in a bunch of Dickens novels. I then defined the distance between Dickens novels to be the euclidean distance based on these frequencies (i.e. the square root of the sum of the squares of the differences in frequency for each letter). Here is the result!

Not too bad! A Christmas Carol and Pickwick Papers are definitely the outliers. I don't exactly know why, but here is a table with some data:

letter max frequency novel min frequency novel
a 0.0821834598387217 GreatExpectations 0.0768776118409004 ChristmasCarol
b 0.0170449592650954 HardTimes 0.0139671649105611 TaleOfTwoCities
c 0.0267413297836624 PickwickPapers 0.0209940169505509 OurMutualFriend
d 0.0478662886811785 BarnabyRudge 0.0432080565861063 MartinChuzzlewit
e 0.120758409929298 OliverTwist 0.11407513874898 MartinChuzzlewit
f 0.0216483554951017 TaleOfTwoCities 0.0189912249888765 BleakHouse
g 0.0240497477648469 ChristmasCarol 0.0197754664184909 NicholasNickleby
h 0.0685280455521702 ChristmasCarol 0.058317954787444 PickwickPapers
i 0.072300095070962 DavidCopperfield 0.0676315054556165 BarnabyRudge
j 0.00226371290097575 GreatExpectations 0.00094900354627641 ChristmasCarol
k 0.0115787353264322 PickwickPapers 0.00794740229910164 LittleDorrit
l 0.0357287315726408 NicholasNickleby 0.031509884032025 TaleOfTwoCities
m 0.0307190937415742 DavidCopperfield 0.0230840950335481 ChristmasCarol
n 0.070569519176433 MartinChuzzlewit 0.0650650150675124 ChristmasCarol
o 0.0749343436036701 HardTimes 0.0685508060178138 PickwickPapers
p 0.0193699580075308 PickwickPapers 0.0153119427801028 BleakHouse
q 0.0017914839862351 NicholasNickleby 0.000757537918518889 ChristmasCarol
r 0.0603013629290098 PickwickPapers 0.0530640557395868 GreatExpectations
s 0.0611274828097165 ChristmasCarol 0.0547473624334335 GreatExpectations
t 0.0848010789635567 GreatExpectations 0.0808519210208059 PickwickPapers
u 0.0302455269833819 HardTimes 0.026839467165667 PickwickPapers
v 0.00954165482444703 OliverTwist 0.00847443517639812 ChristmasCarol
w 0.026098975344962 GreatExpectations 0.0225608707402767 MartinChuzzlewit
x 0.00162849024123349 PickwickPapers 0.00107387243394436 ChristmasCarol
y 0.0242261229705905 BleakHouse 0.0189217987779498 ChristmasCarol
z 0.000507800143182991 ChristmasCarol 0.000180203917114389 LittleDorrit

I created this table to show the maximum and minimum frequency of occurrence of each letter in Dickens novels.

You can see that A Christmas Carol has a number of extreme letter frequencies. This could just be due to its shortness: I imagine longer works might tend to smooth out their frequencies. Pickwick Papers also has a number of extremes; this is Dickens first novel, so perhaps he was using odd choices of letters. Or maybe it just that the work "Pickwick" appears alot, and this might explain the high occurrence of the letters c, k, and p, though why Great Expectations is more dense in 'w', I have no idea.