
A topic of interest to me as a liberal arts grad is the similarity between good writing and good data work.
Mark Twain famously said, “I didn’t have time to write you a short letter, so I wrote a long one instead.” Tight writing takes longer. A long letter may seem more comprehensive, but take away the fluff, and you’re left with a less coherent message.
Same idea with data. Some people load as many variables as possible into a model, hoping it gives the most realistic view possible.
This presents the same problem as a long letter. Most of this data is garbage. You are at risk of overfitting data — that is, capturing spurious rather than meaningful relationships.
