Dianne Cook and her colleagues recently published an article in the Annual Review (http://www.annualreviews.org/doi/abs/10.1146/annurev-statistics-041715-033420?src=recsys&) on the different ways to implement statistical graphics in Big Data. Since the 70 and Tukey’s work, exploratory data analysis (EDA) became common among social scientists. Contrary to classical statistical inference, EDA do not necessary require to have specific hypothesis due largely to the enormous amount of data, though this type of analysis can lead later to more formal hypothesis testing. I must confess this preliminary step seduces me much more than generating hypothesis from a random hypothesis generator (which can however be really efficient to show off in society). Let me add that sometimes you can end up learning really different matters if you chose EDA vs. modeling or prediction-based experiments, and I think the Piketty (EDA) vs. Tirole’s (analytical modeling) fight over the new French’s labor law give a good example on how different school can lead to opposite conclusions.
During a data challenge for the useR! 2014 convention, OECD release data about maths and reading score for 500,000 students from 65 different countries (usually used for the Pisa ranking). As you can see on figure 1, when sample weights was use to calculate the average results, there was no universal maths gap between boys and girls (roughly in similar proportions). Looking more precisely into maths educational programs in countries with quite opposite results though (Latin American countries vs. Jordan, Qatar, Thailand for example) and economic investments could help us see what’s going on and build evidence-based policy upon this. In contrat, all countries show a gender gap in favor of girls for reading score. This is hardly explainable by developmental explanations since well-known sex differences on verbal ability display small advantage for female on verbal production (d=-0.33), or no differences in vocabulary (d=-0.02) or reading comprehension (d=0.03) whatsoever (see Hines, 2004; Hyde and Linn, 1988 for more details and this blog post for those whom missed introduction on meta-analysis).
Figure 1 © 2016 Annual Reviews
Another example comes from the 2008 US presidential election. Nate Silver, our little genius, had the great idea to combine and weight numerous of polls. Figure 2 shows the huge variation between pollsters (% difference for Obama plotted against the day of the release). While DailyKos is clearly releasing pro-Obama polls, Rasmussen went the other way around (Dr. Cook noticed that the ancestral Gallup showed surprisingly disappointed polls results… and, contrary to popular beliefs, FOX showed pretty consistent polls). However, let’s remember the US election is undirect and you vote for delegates (based on the size of the population).
Figure 2 © 2016 Annual Reviews
Statisticians love challenges. Regularly, the American Statistical Association propose the Data Expo and, in 2009, they proposed to work on flight arrival/departure within the US for 20 years (120 millions records, 12 GB). While big airports like Dallas (DFW on figure 3) or Detroit (DTW) were really efficient, more importants delays were observed in Newark (EWR), SF, NYC or Chicago (highest volume for the latter though). Really elegant video can be found here: http://www.annualreviews.org/doi/story/10.1146/multimedia.2016.04.05.418.
Figure 3 © 2016 Annual Reviews
When it comes to evaluate the nutritional value of chocolates, you can use what’s called the “parallel coordinate” (for high-dimensional data mainly, multiple measurements for each sample). In the Eulerian adaptation, you can compare all pairs while the histogram indicate the variations between the 2 groups (milk – in orange – vs. dark – in red- chocolate here). The figure a, at the top, seems to show that milk chocolate presents more carbs, less fibers and more sugars than dark chocolate but it’s not that easy to read. The Eulerian plot below shows (along with the video here: http://www.annualreviews.org/doi/story/10.1146/multimedia.2016.04.05.420) that higher values occur when fiber is one of the two variables.
Figure 4 © 2016 Annual Reviews
I would like to thank Dr. Cook for having sent me a version of this article. I won’t go through all the R packages available to play with data but some are important to remember (notably ggplot2, ggpairs, tabplots, ggparallel). As Tukey mentioned few decades ago, “the combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data”. Agreed.