"Summary Statistics Obscure Important Details"
Boston Data Festival was this week, packed with excellent talks on topics from “Quantifying Culture” to “Evaluating Trading Algorithms Using Probabilistic Programming.”
And while the focus in data science today seems to be on big data and novel methods like “deep learning,” David Weisman, a 30-year veteran of the field, decided to focus on the basics during his presentation on Friday night.
Here are David’s fundamental principles to keep in mind:
summary statistics (mean, median, correlation) obscure important details
small samples have high variance
when looking at your (observed) data, also think about unobserved data
correlation is a summary statistic itself, and only measures ONE type of relationship
big data can increase spurious corrleations
Regarding the first point, researchers often summarize data to uncover patterns (even a regression is a summary, David pointed out). But quantifying the variance (spread) of the data is equally important - and that is why graphical summaries are so critical (boxplot with overlaid jitter plot, anyone?).
Further, large effect sizes must be considered with caution, especially when they come from small sample sizes. The latter tend to have high variance - increasing the possibility that the effects are misleading.
Last, to illustrate how big data can show spurious correlation (and the need for careful thinking and analysis), David recommended this amusing collection:
David’s slides include an excellent set of exercises to help you “think about your data.” They are available from the Boston Data Festitval website, under “Friday”.
Also check out this article from Science Magazine where the authors also emphasize balancing “the new” with the tried and tested.
Finally, Andrew Gelman from Columbia on large effects:
For instance it is not uncommon in an underpowered study for a researcher to state that although his estimate is not statistically significantly different from 0, that could simply be a function of the overly large standard error. Ironically, however, large estimates are actually a byproduct of large standard errors. [Emphasis added]