It’s one of those questions where you could ask 10 data professionals and get 11 answers: what’s the difference between statistics, data science and data analytics? (How’s that for sample size?)
Well, here’s my answer.
This is an excerpt from my book Advancing into Analytics: From Excel to Python and R. For a book with a subtitle all about tools, I hope to pay attention to more big-picture topics such as the relationships between statistics, data analytics and data science. In fact, this passage is from an entire chapter about what I call the “data analytics stack,” where Excel, Python and R are put into the broader context of data analytics and other data pursuits.
Here it is:
Statistics
Statistics is foremost concerned with the methods for collecting, analyzing, and presenting data. We’ve borrowed a lot from the field: for example, we made inferences about a population given a sample, and we depicted distributions and relationships in the data using charts like histograms and scatterplots.
Most of the tests and techniques we’ve used so far come from statistics, such as linear regression and the independent samples t-test. What distinguishes data analytics from statistics is not necessarily the means, but the ends.
Data Analytics
With data analytics, we are less concerned about the methods of analyzing data, and more about using the outcomes to meet some external objective. These can be different: for example, you’ve seen that while some relationships can be statistically significant, they might not be substantively meaningful for the business.
Data analytics is also concerned with the technology needed to implement these insights. For example, we may need to clean datasets, design dashboards, and disseminate these assets quickly and efficiently. While the focus of this book has been on the statistical foundations of analytics, there are other computational and technological foundations to be aware of, which will be discussed later in this chapter.
Business Analytics
In particular, data analytics is used to guide and meet business objectives and assist business stakeholders; analytics professionals often have one foot in the business operations world and another in the information technology one. The term business analytics is often used to describe this combination of duties.
An example of a data or business analytics project might be to analyze movie rental data. Based on exploratory data analysis, the analyst may hypothesize that comedies sell particularly well on holiday weekends. Working with product managers or other business stakeholders, they may run small experiments to collect and further test this hypothesis. Elements of this workflow should sound familiar from earlier chapters of this book.
Data Science
Finally, there is data science: another field that has inseparable ties to statistics, but that is focused on unique outcomes.
Data scientists also commonly approach their work with business objectives in mind, but its scope is quite different from data analytics. Going back to our movie rental example, a data scientist might build an algorithmically powered system to recommend movies to individuals based on what customers similar to them rented. Building and deploying such a system requires considerable engineering skills. While it’s unfair to say that data scientists don’t have real ties to the business, they are often more aligned with engineering or information technology than their data analytics counterparts.
Machine Learning
To summarize this distinction, we can say that while data analytics is concerned with describing and explaining data relationships, data science is concerned with building predictive systems and products, often using machine learning techniques.
Machine learning is the practice of building algorithms that improve with more data without being explicitly programmed to do so. For example, a bank might deploy machine learning to detect whether a customer will default on a loan. As more data is fed in, the algorithm may find patterns and relationships in the data and use them to better predict the likelihood of a default. Machine learning models can offer incredible predictive accuracy and can be used in a variety of scenarios. That said, it’s tempting to build a complex machine learning algorithm when a simple one will suffice, and this can lead to difficulty with interpreting and relying on the model.
Machine learning is beyond the scope of this book; for a fantastic overview, check out Aurélien Géron’s Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd edition (O’Reilly). That book is conducted heavily in Python, so it’s best to have completed Part III of this one first.
Distinct, but Not Exclusive
While distinctions among statistics, data analytics, and data science are meaningful, we shouldn’t let them create unnecessary borders. In any of these disciplines, the difference between a categorical and continuous dependent variable is meaningful. All use hypothesis testing to frame problems. We have statistics to thank for this common parlance of working with data.
Data analytics and data science roles are often intermingled as well. In fact, you’ve learned the basics of a core data science technique in this book: linear regression. In short, there is more that unites these fields than divides them. Though this book is focused on data analytics, you are prepared to explore them all; this will be especially so once you’ve learned R and Python.
Now that we’ve contextualized data analytics with statistics and data science, let’s do the same for Excel, R, Python, and other tools you may learn in analytics.
In short, while there are very real differences in theory and practice between these fields, they are all fundamentally operating on the same object, data. What do you think? Care to add an nth opinion on statistics vs data analytics vs data science? Please do so in the comments.
And if you’d like to learn more about the data analytics stack and where Excel, Python and R fit in, be sure to check out Advancing into Analytics.
Leave a Reply