In this article, we analyze a MySQL database which contains soccer transfer data, using the R environment, on OS X. We show in a few simple steps how you can link the two together.
Make sure you have a recent version of MySQL, and the latest version of R installed. As our graphical user interface, we use RStudio. From RStudio you can install the MySQL package:
install.packages("RMySQL"). If this doesn’t work for you, please refer to the installation instructions. For plotting, install the ggplot2 library by issueing a
install.packages("ggplot2") in RStudio. Alternatively, you can use the graphical interface: in the bottom right panel, choose ‘Packages’ and click ‘Install Packages’. This presents you with a dialog to choose a CRAN mirror, and a prompt where you can enter the package name.
Our MySQL database contains a table
transactions, which contains transaction data. The
transactions table has a column
transfer_value of type
float, which is what we are interested in. First, we generate a big list of all the transactions:
SELECT transfer_value FROM transactions, which generates a MySQL table with a single column that contains the transaction data.
First, let’s make sure the MySQL library is imported:
Next, we connect to our database, which is named
> con <- dbConnect(MySQL(), dbname="soccer")
Now we can start to issue queries:
> transfer_values <- dbGetQuery(con, "select transfer_value from transactions") > summary(transfer_values) transfer_value Min. : 5900 1st Qu.: 500000 Median : 1700000 Mean : 3688301 3rd Qu.: 4500000 Max. :94000000
This shows us a summary of the transfer values. The maximum is 94 million, which was Cristiano Ronaldo’s transfer to Real Madrid. To look at how the numbers are distributed, we can plot them in a graph, using the
qplot function from the
ggplot2 library, which is a convenience function that quickly generates
ggplot plot objects.
> library(ggplot2) > qplot(transfer_values$transfer_value)
This gives us a nice picture, which looks like a Power Law:
If we zoom in on the transfers up to 10 million, we get a more detailed picture, showing peaks at regular intervals: apparently, the soccer negotiators like nice round numbers too.
> qplot(transfer_values$transfer_value, xlim=c(0,1e07), binwidth=1e05, ylim=c(0,500))
For more info on the
qplot command, you can type
help(qplot), and RStudio will show you the relevant help page on the bottom right.
Even though we did very basic analysis of the data, we have seen how to connect R to MySQL and how to use the
ggplot2 library to visualize the data.
Thanks to Peter Tegelaar for reading a draft of this post and helping with the code.
Update: see Hacker News for the discussion.