I’ve been known to kick an occasional hornet’s nest on social media, at least when it comes to data. A recent cri de coeur has been to establish the primacy of statistical thinking in everything we do with data — including topics considered less technical than hardcore data science, like Excel and business intelligence:
Now, it sounds like there is a lot of sympathy for this attitude.
Most people acknowledge there are larger mathematical forces at work than what meets the eye on a table or chart. Some aren’t sure how to really sink their teeth into statistics… or whether it’s really worthwhile. And indeed, there seems to be some debate over how much this matters for the typical data analyst.
Here’s one response alluding to that distinction:
This distinction between data visualization and statistics is something worth drilling in on. Here goes…
Analytics versus statistics
In one camp, there’s a group who mostly associates data visualization with data exploration, business intelligence and analytics. This is to be distinguished from the more scientific pursuit of statistics:
Cassie Kozkyrov (from previous video) has even called to separate these pursuits more or less completely. So should data visualization and related pursuits (dashboards, infographics, etc) be treated as independent pursuits? Here’s what I think.
Data is data
First of all, data is a slippery eel. That’s why we have so many ways to try to grab hold of it. We can try to make a clean break between data visualization and statistics, but even within those two fields there’s a near infinite number of ways to work with the data. Wouldn’t we want to keep our options open and make use of another toolkit, if it helps?
After all, whether we are visualizing it or modeling it, the data is the data. All data is subject to higher mathematical and statistical principles: linearity, expected values, inference and so forth. Without some “spidey sense” of the forces behind the data, it’s too easy to take a chart or table at face value.
If you’re looking for a brilliant overview of how the mathematical way of thinking can sharpen your decision making, check out Jordan Ellenberg’s How Not to Be Wrong. Of the many real-world examples in the book, Ellenberg points out that due to variability of small sample sizes, it’s the smallest states that have the lowest and highest incidences of brain cancer.
North and South Dakota in particular are at polar ends of the spectrum, despite being relatively similar states. Imagine sinking large marketing budgets based on conversion rates that particularly low or high among smaller groups of customers. It would easily be done from gazing at a dashboard without the proper statistical safeguards in place.
Exploring and confirming are both analytics
Now, it’s clear that most data analysts aren’t building neural networks, and most data scientists aren’t having to explain a barchart to the budget manager they’re paired with. That said, this doesn’t mean that one of these is statistics and one isn’t. And we John Tukey to thank for the increased eminence of so-called exploratory data analysis (EDA), which breaks this distinction down quite nicely.
Tukey believed that too much emphasis in statistics was given to confirmatory data analysis (think hypothesis testing, model evaluation, etc). and not to exploring the data for interesting observations and questions. He was a vocal advocate for — you guessed it! — data visualization, even going so far as to invent the boxplot.
(Extra fun fact: John Tukey was also instrumental in developing the S language for statistical programming at Bell Laboratories. S was developed to accommodate for off-the-cuff data exploration, which was difficult to do in the build-and-compile workflow of many programming languages of the day. The R language is an offshoot of S.)
So I would say that the distinction between analytics and statistics a la Kozkyrov is really the difference between exploratory and confirmatory data analysis, and that while both focus on different tasks of working with data, they are working on the same data.
Data visualization may be particularly well suited for EDA, but there are indeed many places where data visualization is also used to guide more advanced statistical methods (to plot residuals, model error, clusters, you name it).
What’s the answer?
So is data visualization a part of statistics? Given that EDA is a valuable part of statistics, there’s certainly a home for data visualization. But are there cases where data visualization is used outside of statistics? Possibly so, but that data is subject to the same mathematical rules no matter it use case, as is statistics.
What do you think? Care to define the relationship between analytics and statistics? Let me know in the comments.
Leave a Reply