Data analysts should be proactive about missing values in their data, as these can significantly impact analysis outcomes. First of all, many algorithms such as linear and logistic regression as well as decision trees require complete datasets and cannot inherently process missing data.
But beyond statistical modeling and machine learning, missing values might also indicate issues in the data collection process, potentially introducing biases or flaws that skew results and lead to incorrect conclusions.
Excel lacks advanced handling of missing values, with no built-in null value system like SQL. Power Query in Excel offers improved management by recognizing missing values as null and providing tools to profile and identify their prevalence. However, it provides limited support for visually inspecting and correcting these missing values.
Integrating Python, especially with the Pandas library, into Excel enhances missing data management by offering advanced imputation methods and visualization tools, thereby improving Excel’s capability to handle incomplete datasets.
To see this in action, follow the included exercise file that uses a penguin dataset with missing values, demonstrating how Pandas can improve the analysis of missing data within Excel.
Once the penguins_df
DataFrame is set up, a straightforward next step is to tally the missing values in each column. If you’ve dabbled computer science, you might know that True
and False
values are actually evaluated as 1s and 0s. With this so-called coercion, we can sum up all instances where na
or missing is set to True
(or 1). This provides a count of missing observations for each column:
Knowing the raw counts of missing values is useful, but understanding the relative percentage of missing values in each column can provide more context. This approach highlights how prevalent missing values are across different columns. By dividing the number of missing values in each column by the total number of rows in the DataFrame using len()
, you can calculate these percentages. This analysis reveals that most columns have fewer than 5% missing values, with the ‘sex’ column being the only one that exceeds 1%.
Let’s explore another method to evaluate missing values by setting up a visualization in Python.
I’ll create a bar chart that displays all columns in the DataFrame where the number of missing values is greater than zero. This visual representation provides a quick and clear comparison of the significance of missing values across different columns, offering an immediate understanding of their relative impact overall.
For a clearer view, I’ve placed the code in a separate Gist below along with the resulting chart. You can access the final results in Excel with the exercise file.
Another interesting method to visualize missing values is through a heatmap. This approach is particularly useful if you’re looking for correlations or patterns among missing values in your data. For instance, if one variable tends to be missing alongside another, it could indicate a deeper issue in the data collection process. I’ll use Seaborn for this purpose.
With this plot, we can get a clear visual representation of where missing values are located within the overall grid of our data:
Now that we’ve explored some methods to summarize and visualize the data, let’s consider our next steps. Fortunately, none of the columns have significant amounts of missing values. I typically use 3-5% of the total as a guideline. Beyond this threshold, missing values across columns can lead to a substantial reduction in your dataset and could significantly skew your results if you need to drop or impute them. The best solution is always prevention. If possible, revisit your data collection source to correct and prevent future issues. However, I recognize that we often operate in the real world of data, where changes can be challenging and time constraints are common. So, let’s explore some quick fixes.
First, we’ll look at imputing the data. This involves using a summary statistic to fill in missing values. For a quantitative variable like bill_length_mm
, I’ll use the median to fill the blanks. For a categorical variable like sex
, I’ll use the mode, which is the most frequently occurring value, if it exists.
It’s important to note that imputing missing values is a delicate matter. There are many differing opinions on how best to handle it, and it can become quite complex. To that end, I’ll create new imputed columns so we can compare them with the original ones and determine if the adjustments are acceptable.
Another option, if you prefer not to impute missing values—and this is particularly viable when you have few missing values but still need to exclude them for statistical or presentation purposes—is simply to drop them.
It’s important to check how many rows you actually lose by doing this. Keep in mind that if you drop a row based on one missing value in a column, you’re also discarding all other data in that row, which could be valuable. So, exercise caution with this approach.
You can accomplish this using the dropna()
method:
What questions do you have about missing values analysis or Python in Excel more broadly? I hope you’re discovering just how easy, flexible, and enjoyable it is to analyze, visualize, manipulate, and perform various operations on your data with Python, even if missing values aren’t a concern in your work.
If you’re just getting started with Python in Excel, it’s a good idea to understand how the Python language functions outside of the Excel environment. For that, you can check out my book, Advancing into Analytics:
Leave a Reply