A tabular dataset is arranged in two dimensions — what in Excel we would call rows and columns, but a fancier statistical way of saying this is observations and variables.
“Observations” are so called because those are the things being observed (makes sense, right)? Our dataset is measuring housing prices and other features. We can tell from the index column that each of these rows is a unique observation; probably an individual house.
Then we have the “variables” along the columns. These are what we measure from each observation. And the whole idea with variables is that they are free to vary among observations…
This seems pretty obvious, but what would be the case if variables didn’t vary? To find out, head to your downloaded file. We’re going to try to answer a couple questions about the data.
“Are homes in Ames more likely to have a driveway than those in Davenport?”
First off, let’s say we want to determine whether homes in Ames are more likely than homes in Davenport to have a driveway. Maybe you’re doing an ecological study of asphalt surface cover or planning a public transit campaign.
These are two binary variables, so the appropriate statistical test is a chi-square test of independence, and to do that we need to have our data summarized as a two-way frequency table. We’ll do that with the help of PivotTables. I will summarize the Count of id
with driveway
along the columns and neighborhood
along the rows.
Can I check which neighborhood is more likely to have homes with a driveway when none of even have a driveway? I don’t think so. Every home displayed here has a driveway. Variables are not varying, so we really don’t have much to analyze.
And even if we did have some homes without driveways, did you notice that there are only four homes in Ames? Any conclusion we draw given such a small amount of data is likely not to be too reliable.
“Is there an influence of lot size on price?”
Let’s try another example. Maybe we are analysts at a bank and want to evaluate home sales prices. We could check for the relationship of lot size on price. These are two continuous variables, so deriving a correlation coefficient is the right move. However, correlation is only a valid measure when there is a linear relationship between the variables.
So let’s plot the x-y relationships of these variables with a scatterplot to see if it appears a line could summarize the general trend. Select the range of data in columns B
and C
to get started:
Well, it looks like there is a line summarizing the relationship… but a perfect vertical line. That means that every house is the same lot size, so this tells us nothing about what the correlating house prices might be.
In fact, if you try to derive the correlation here using Excel’s CORREL()
function, you’ll get an error. That’s because the equation for the coefficient depends on some variability in each variable, and we don’t have any here.
How do we get around this?
So as you’re seeing, data analysis tends to break down when variables don’t indeed vary. In general, we’d say that a variable doesn’t really vary when >90% of observations take on the same value.
Are there ways around this impasse? Here are a few steps to take. In general, implementing these can be rather advanced so we won’t get into the specifics, but they are good to understand in the meantime.
Collect more data
As it stands in our samples, we only have four homes recorded in Ames, no homes without a driveway, every home with a lotsize of 6,000 square feet, and so forth. There are likely more than four homes in Ames, so perhaps you could beef up the count there by more sampling.
But what if you find out that, by law, every home that you sample will have a driveway and have a lot size of 6,000 square feet? No increase in data sampling is going to change that these variables do not vary (Silly example, but there are cases in real life where de facto or de jure there is really no variance in variables).
Weight the uncommon data
Another approach: we know that some of our data is disproportionately uncommon, such as homes in Ames. There are methods to weight very small groups so that they have a more robust showing in the overall analysis. This is commonly done in polls to ensure that underrepresented groups are represented as part of the bigger picture.
More computing power
Finally, it’s possible to use raw computing power to overcome the problem of variables not really varying. For example, imagine how often someone defaults on a credit card. It’s not super common, right? Indeed, this probably doesn’t happen more than 10% of the time, so the 90/10 “variables vary” rule isn’t applicable here.
However, through machine learning methods we are able to overcome what in this field is called “imbalanced data.” To learn more, check out these ideas from Google. They know a thing or two about machine learning.
This exercise may seem silly and untethered to reality. Of course variables vary! But actually, there are many times in real life where variables may not vary, and you as the analyst need to understand what that means and what to do about it.
Have you encountered this in your data before? What do you make of the exercises? Did you enjoy the tautology? Let me know in the comments.
Leave a Reply