Sidepodcast // All for F1 and F1 for all

Innovation inspired: Confessions of a data junkie // F1's role in data science and data visualisation

Published by Tony Hirst

As the teams go away to start crunching through the data collected from pre-season testing, data junkies amongst us can only wonder at what riches they have to play with. And how we can get started playing with some of it.

As well as innovating on the vehicles themselves, F1 teams also help drive innovation in high performance computing and data analysis, with McLaren, for example, recently tying up a deal to provide predictive analytics expertise to KPMG. This side of the sport, however, and its potential for helping promote engagement in the STEM subject areas (Science, Technology, Engineering, and Mathematics) is often under-reported.

While there have been some related initiatives - a race strategy worksheet from McLaren dating back at least to 2010, and the first challenge in the recent Tata Communications F1 Connectivity Innovation Prize to find novel uses for live timing data - I think there is great potential for F1 to help innovate, educate and engage far more widely in the currently hot-topic areas of data science and data visualisation.

Tools R Us

Whilst the data available to F1 stats fans may be limited, there is data out there, and we can play with it using tools that are perhaps no less sophisticated than some of the tools that the teams work with on a daily basis.

For several years, I have been playing with data from the openly licensed Ergast motorsport results database using R, an open source statistical programming language with a huge range of community supported libraries and toolkits and industrial reach (for example, Microsoft recently acquired Revolution Analytics, a company that provides enterprise level support for R-based data analysis). Long time Sidepodcast readers might even remember some examples from posts such as Statistics and analysis from the F1 Data Junkie where I produced a chart that tried to summarise key elements of specific races to help Mr C's failing memory when doing race updates!

Hungary race position summary

The lifeblood of F1

To try to pull together the various experiments and data sketches I'd been dabbling with, I started working on a book part way through last year using the Leanpub publication platform. From the consumer side, Leanpub provides a storefront for ebooks in a range of formats (pdf, epub, mobi) and a novel payments system: publishers define a minimum price (which could be as low as $0) and a recommended price, but it's up to the buyer how much they actually pay (as long as it's at least the minimum price). When a consumer buys a book, they actually gain the ability to download the current state of the published book for as long as the author continues to publish updates.

The platform also provides an opportunity for the technical author to explore new workflows: the manuscript is provided as one or more Markdown documents or HTML documents.

In my case, I use an environment called RStudio to write "R markdown" documents that let me blend prose with R programme code, which can automatically produce output code and charts that can be passed straight to Leanpub. Finally, in addition to automatically generating and publishing the final e-book, you can also bundle in code downloads (I'm trying to put together a prebuilt virtual machine, for example, containing the RStudio environment, sample code and example data files so readers can play along too).

"What has all that got to do with Formula 1?", you might well ask?

The answer is simple: It's an innovative, efficient and effective workflow that I have been prompted to explore through my engagement with Formula One and the compulsion to innovate that it is associated with, an ethos also reflected by Mr C's Sidepodcast website developments and Christine's publishing adventures.

Wrangling F1 Data With R

Innovation is the lifeblood and ethos of F1. Every area of the business is up for grabs if it can be improved. Even task planning and inventory management, if Microsoft Dynamic's sponsorship of Lotus and the provision of their Enterprise Resource Planning suite is anything to go by!

Aside from deepening my engagement with the sport, trying to pull together the Wrangling F1 Data With R book has also provided me with the opportunity to develop my skills and understanding in several areas outside the sport:

  • It has given me a recreational context for developing my data wrangling skills in general (if you like Sudoku or Killer, you'll love data wrangling)
  • It has improved my SQL query writing and patchy skills in writing R
  • It has forced me to try to explain the stories I think I can see in, and the insights I think I can learn from, the charts I've been playing with
  • It has led me to various academic papers whose findings I have tried to and will continue trying to replicate, getting the research out of the academic ghetto and into the real world (or at least, into the outside/inside world of F1)
  • It has given me an opportunity to explore custom chart design and new ways of representing information graphically
  • It has given me basis for exploring how to support a programming related book with pre-built virtual machines that contain the data and application you need to get started on working with the data directly
  • It has got me thinking about the economics of the web, how I can try to (re)cover the costs associated with my F1 data wrangling activities in a reasonable way and financially support the typically free services I draw on whilst doing it

I also hope that somewhere along the line, other F1 data junkies and stats fans may find something of use in it, (or prompt them to send me corrections!) and encourage other F1 fans in general to start exploring new data related technologies either using F1 data or data of their own.

Grab a copy of Wrangling F1 Data With R

A free preview of several chapters of the book are also available from LeanPub.