# mds

## Tour de France and multidimensional scaling

Thu, 2015-07-30 19:09

I've had the following idea for a long time and finally took the time to try it out.

In the Tour de France, every rider receives a rank (i.e., placing) on each stage. In the 2015 tour, there were 20 stages (exclusing the team time trial which works differently). Each rider who finishes the tour (i.e., finish all stages) has then a 20-dimensional vector of their placings. For example, Rohan Dennis had the vector

<1,88,122,88,107,156,183,142,139,101,77,106,101,108,106,98,85,65,80,151>.

We can thus embed the set of finishing riders into a 20-dimensional space. What is this embedded set "like"?
Twenty dimensions is too many to hope to have an easy way to, say, visualize, the set.
Fortunately, there is an amusing technique called multidimensional scaling (mds). The idea of mds is to take a high dimensional data set and model it in a space of lower dimension. The key idea is to maintain, as much as possible, some measure of proximity: if two points are "close" in the original space, we want them to be "close" in the lower dimensional space. How we measure close is up to us, of course.

I applied this technique to the rider data and made a 2-d model for easy viewing. The goodness of fit (GOF) is not great, about 0.548, but the visualization is nice (click on it to go to Flickr for a somewhat larger version, or see this pdf):

Some things we can see in this plot:

• The top GC riders and climbers all cluster in the upper right, with Froome (the winner) in the most extreme position.
• In the bottom left we find all the sprinters (in particular the cluster of Greipel, Kristoff, Cavendish, Degenkolb, Coquard, and Boasson Hagen).
• I don't know what to say about the cluster of nine riders above the main sprinter cluster.
• Most noticeably, we see Peter Sagan on his own at the bottom, right of center. Sagan achieved high placings on many stages that pure sprinters could not, and fared very well on the sprint stages, too. On the other hand, he didn't get high places on some mountain stages.

I pulled the stage data off the web and processed the files into a csv file with names and placings with a bit of Perl. I then processed the csv file with R like this: