Tour de France and multidimensional scaling

Thu, 2015-07-30 19:09

I've had the following idea for a long time and finally took the time to try it out.

In the Tour de France, every rider receives a rank (i.e., placing) on each stage. In the 2015 tour, there were 20 stages (exclusing the team time trial which works differently). Each rider who finishes the tour (i.e., finish all stages) has then a 20-dimensional vector of their placings. For example, Rohan Dennis had the vector


We can thus embed the set of finishing riders into a 20-dimensional space. What is this embedded set "like"?
Twenty dimensions is too many to hope to have an easy way to, say, visualize, the set.
Fortunately, there is an amusing technique called multidimensional scaling (mds). The idea of mds is to take a high dimensional data set and model it in a space of lower dimension. The key idea is to maintain, as much as possible, some measure of proximity: if two points are "close" in the original space, we want them to be "close" in the lower dimensional space. How we measure close is up to us, of course.

I applied this technique to the rider data and made a 2-d model for easy viewing. The goodness of fit (GOF) is not great, about 0.548, but the visualization is nice (click on it to go to Flickr for a somewhat larger version, or see this pdf):

2015 Tour de France MDS

Some things we can see in this plot:

  • The top GC riders and climbers all cluster in the upper right, with Froome (the winner) in the most extreme position.
  • In the bottom left we find all the sprinters (in particular the cluster of Greipel, Kristoff, Cavendish, Degenkolb, Coquard, and Boasson Hagen).
  • I don't know what to say about the cluster of nine riders above the main sprinter cluster.
  • Most noticeably, we see Peter Sagan on his own at the bottom, right of center. Sagan achieved high placings on many stages that pure sprinters could not, and fared very well on the sprint stages, too. On the other hand, he didn't get high places on some mountain stages.

I pulled the stage data off the web and processed the files into a csv file with names and placings with a bit of Perl. I then processed the csv file with R like this:

ranks <- read.csv(file.choose(new=FALSE),head=FALSE,sep=",")
ranksplusnames <- read.csv(file.choose(new=FALSE),head=TRUE,sep=",")
myfun <- function(x) {return(x^0.3)}
ll <- cmdscale(dist((apply(ranks,MARGIN=c(1,2),myfun)),method="minkowski",p=2),k=2,eig=FALSE)

I actually used two different csv files, one with nothing but ranks, and the other with names and headers. I'm sure you could get by with one.

I defined myfun to "adjust" the rankings. This is based on the idea that nobody really cares whether they finish 120 versus 140, but care a lot whether than finish 3rd versus 23rd. I thought applying a "concave down" function to the rankings should make comparisons more "realistic" (other functions would give similar results, including, say 1/x, as it groups larger rankings closer together). However, I was surprised to find that this did not seem to make a great difference. Hmmm.

The cmdscale command does the real work. The first thing is to calculate a matrix of distances between each rider (using the dist function). There are lots of options on calculating the distance; here, I used Minkowksi distance with p=2 (i.e., plain-old-euclidean distance). This distances array gets handed to cmdscale which finds a set of points in the specified dimensional space (k=2) that best captures this distance information. With eig=FALSE, you just get a set of points; setting this to true gives more information.

Then this set of points is given to textplot (which is part of the wordcloud package) which plots the points using the names of the riders. Textplot makes sure that the names do not overlap, which is useful here, since the upper left is so crowded. There is a bit of grep in there to extract just the last name of each rider, which also helps with the crowding.