Analyzing a MySQL database with R

In this article, we analyze a MySQL database which contains soccer transfer data, using the R environment, on OS X. We show in a few simple steps how you can link the two together.

  Prerequisites

Make sure you have a recent version of MySQL, and the latest version of R installed. As our graphical user interface, we use RStudio. From RStudio you can install the MySQL package: install.packages("RMySQL"). If this doesn’t work for you, please refer to the installation instructions. For plotting, install the ggplot2 library by issueing a install.packages("ggplot2") in RStudio. Alternatively, you can use the graphical interface: in the bottom right panel, choose ‘Packages’ and click ‘Install Packages’. This presents you with a dialog to choose a CRAN mirror, and a prompt where you can enter the package name.

Analysis

Our MySQL database contains a table transactions, which contains transaction data. The transactions table has a column transfer_value of type float, which is what we are interested in. First, we generate a big list of all the transactions: SELECT transfer_value FROM transactions, which generates a MySQL table with a single column that contains the transaction data.

First, let’s make sure the MySQL library is imported:

> library(RMySQL)

Next, we connect to our database, which is named soccer:

> con <- dbConnect(MySQL(), dbname="soccer")

Now we can start to issue queries:

> transfer_values <- dbGetQuery(con, "select transfer_value from transactions")
> summary(transfer_values)
transfer_value    
Min.   :    5900  
1st Qu.:  500000  
Median : 1700000  
Mean   : 3688301  
3rd Qu.: 4500000  
Max.   :94000000

This shows us a summary of the transfer values. The maximum is 94 million, which was Cristiano Ronaldo’s transfer to Real Madrid. To look at how the numbers are distributed, we can plot them in a graph, using the qplot function from the ggplot2 library, which is a convenience function that quickly generates ggplot plot objects.

> library(ggplot2)
> qplot(transfer_values$transfer_value)

This gives us a nice picture, which looks like a Power Law:

If we zoom in on the transfers up to 10 million, we get a more detailed picture, showing peaks at regular intervals: apparently, the soccer negotiators like nice round numbers too.

> qplot(transfer_values$transfer_value, xlim=c(0,1e07), binwidth=1e05, ylim=c(0,500))

For more info on the qplot command, you can type help(qplot), and RStudio will show you the relevant help page on the bottom right.

Conclusion

Even though we did very basic analysis of the data, we have seen how to connect R to MySQL and how to use the ggplot2 library to visualize the data.

Thanks to Peter Tegelaar for reading a draft of this post and helping with the code.

Update: see Hacker News for the discussion.