I recently read, per the tweeted recommendation of Secret Life of Pronouns author James Pennebaker, Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are, by Seth Stephens-Davidowitz.
See Seth Stephens-Davidowitz’s book Everybody Lies. Best #bigdata/#language book in years. Must read for #socialpsychology #economics
— James W Pennebaker (@jwpennebaker) June 2, 2017
High accolades from Pennebaker!
I’m not as keen on the book as Pennebaker. But, it’s is a good introduction both to the ethics and methodology of data science. Many of themes are adult in nature and the author has a strong political bias. Still, it’s well worth the read to get a sense of how Big Data affects social science.
New data, new rules
Classical research often relies on self-reported surveys, often of 18-to-21 year-old psychology undergraduates. While significant limitations (such as bias – discussed below) exists in this data, researchers have come up with some sophisticated design and statistical techniques in getting solid research from these datasets.
Newer research has pioneered the depth and breadth of research datasets. From blog posts to Google searches to YouTube browsing patterns, these new data sources offer unparalleled research possibilities.
Stephens-Davidowitz decries how most established researchers have not moved past the self-report surveys they are used to. Much of this research lacks the rigor of classical research, but the days are still early. Great contributions are possible in bridging the well-honed techniques of survey development to less structured datasets.
Bias and the truth
The more in one’s safe, unobserved natural environment one feels, the closer to truth will the data resemble. As the author points out, most of us show way less discretion in our online queries than our cocktail party conversation starters. Researchers call the participant’s desire to answer questions in a manner pleasing to others social desirability bias.
So maybe we let loose on Google searches. But social desirability bias is sometimes even more pervasive online than in more traditional data sources. Take social media, which is nearly built on pleasing and impressing others.
The author brings up a great example of National Enquirer versus The Atlantic pieces. While both print and online circulation of these publications is about equal, The Atlantic gets about 30 times the amount of social media engagement. Maybe their social media strategy is just that much savvier. But there is probably an element of social desirability bias (wanting to appear sophisticated to Facebook friends) involved.
I didn’t need Google to tell me that Everybody Lies. While Big Data offers unprecedented research opportunities, further integration with classical statistics is needed.
Where does theory fit in, anyway?
A key takeaway for me from this book is Big Data’s pioneering spirit toward extracting meaning from newer (and not just bigger) datasets.
Facebook post and Google traffic data certainly doesn’t suffer from small sample sizes or lack of datapoints across time. This is a huge benefit over survey-based research.
But, classical research is more purposeful about experimental design; both because researchers have more control over it and because it is crucial to the validity of the research. It is much more accurate to adjust for bias before the data is collected rather than after the fact.
Of course, unlike traditional surveys, you don’t necessarily intend your tweeting or posting to become scientific research. This brings up loads of ethical issues, some of which are addressed in the book.
Everybody has always lied
So, survey responses, Google searches, and sales transactions aren’t always in sync. Somewhere, someone is lying. And Big Data offers a new, powerful technique for uncovering human behavior.
But what’s new here? Everyone has always lied. Researchers always knew that. Sure, Big Data is powerful. But the insistence in classical statistics on validity, reliability and replicabilitiy is still important. Maybe old-school academics need to update their toolbox of data sources. But data scientists should not dismiss the tenets of old-school statistics, either. I hope for a fusion of the two schools of thought.
Leave a Reply