Upsampling time series data involves increasing the frequency of your data points by filling in gaps to transform a lower-frequency dataset into a higher-frequency one. This process is often necessary when aligning datasets with different frequencies or when a more granular view of the data is required.
For instance, in financial analysis, you might have daily stock prices but need minute-by-minute data for detailed analysis or simulation. Similarly, in IoT and sensor data applications, aligning data from devices reporting at different intervals necessitates upsampling to ensure all data points have consistent timestamps.
Despite its importance, upsampling can be complicated because it creates time points for which there may not be any data. This challenge requires careful handling to avoid introducing inaccuracies.
Using Pandas within Excel allows for seamless integration of powerful time series manipulation capabilities, enhancing Excel’s analytical potential while leveraging Python’s robust data handling to make upsampling both efficient and straightforward.
To follow along with the examples and exercises, download the exercise file below. In this example, we aim to convert monthly consumer confidence index (CCI) figures into weekly data.
To get started, I’ll read the dataset into Pandas and ensure the date column is correctly formatted as a date data type. The next important step is to set the index of the cci_df
DataFrame to the date
column. From here, I can resample the data from monthly to weekly, taking the mean:
cci_df = xl("cci[#All]", headers=True)
cci_df['date'] = pd.to_datetime(cci_df['date'])
cci_df = cci_df.set_index('date')
cci_df.resample('W').mean()
These results are not going to be very satisfactory. There aren’t any time periods to calculate the mean, as we are moving from more to fewer data points. Consequently, we end up with a numeric error in Excel.
Typical aggregation methods aren’t going to work here, so we’ll need to make some adjustments. This can be done in various ways. The simplest method is to fill in the data either forward or backward using ffill()
and bfill()
respectively. This data is arguably the most accurate since it comes from the actual dataset, but it doesn’t do much to smooth out variances between the time periods:
cci_df.resample('W').ffill()
cci_df.resample('W').bfill()
Next, let’s try interpolating these missing values. While ffill()
(forward fill) fills missing values with the last observed value and bfill()
(backward fill) fills them with the next observed value, interpolate()
estimates missing values by computing intermediate values based on surrounding data points. This results in a smoother transition between existing values, making it useful for maintaining a more natural data trend compared to the constant values filled by ffill()
and bfill()
. The syntax is very similar:
cci_df.resample('W').interpolate()
The interpolate()
method offers different interpolation methods like linear, cubic, and quadratic, each providing a unique way of estimating missing values.
Linear interpolation connects missing data points with straight lines, creating a direct, simple transition between known values. Cubic interpolation, on the other hand, fits a cubic polynomial to the data, resulting in smoother, more curved transitions that better capture the underlying trend of the data. Quadratic interpolation fits a second-degree polynomial, offering a middle ground between linear and cubic methods, providing smoother transitions than linear but not as complex as cubic.
These methods allow users to choose the most appropriate way to fill gaps based on the nature of their data and the desired level of smoothness.
Let’s put the basic forward fill to the test with these interpolation methods! I am going to create a blank DataFrame, apply all of them to the DataFrame, and plot the results:
cci_df_interpolate = pd.DataFrame()
cci_df_interpolate['ffill'] = cci_df.resample('W').ffill()
cci_df_interpolate['linear'] = cci_df.resample('W').interpolate(method='linear')
cci_df_interpolate['quadratic'] = cci_df.resample('W').interpolate(method='quadratic')
cci_df_interpolate['cubic'] = cci_df.resample('W').interpolate(method='cubic')
cci_df_interpolate.plot()
plt.title('Comparing upsample missing value fill methods')
Let’s briefly interpret this plot before closing:
- Forward fill maintains the last known value until a new data point is encountered, resulting in a step-like pattern with sharp horizontal segments and sudden vertical jumps.
- Linear interpolation fills missing values by drawing straight lines between known data points, creating piecewise linear segments that assume changes between data points are linear, leading to abrupt changes in slope at each data point.
- Quadratic interpolation results in smoother curves that better capture the underlying curvature and non-linear trends of the data. This method offers a more nuanced estimation of missing values, creating smoother transitions between points.
- Cubic interpolation provides the smoothest and most flexible line, closely fitting the data points and adapting to changes with high accuracy. This method is ideal for data with complex trends, as it captures detailed patterns more effectively than linear or quadratic interpolation.
The choice of method to fill missing values depends on the data’s characteristics and the desired level of accuracy. For data with constant or slowly changing values, forward or backward fill is simple and effective. Linear interpolation works well for linearly changing data. For non-linear trends, quadratic or cubic interpolation may be more suitable. Quadratic interpolation captures basic curvature with smooth transitions, while cubic interpolation provides high accuracy for complex, fluctuating data by fitting detailed patterns. Consider your data’s trends and complexity to ensure the filled values accurately reflect its behavior.
For this dataset, which exhibits both stable and volatile periods, using a mix of methods to fill in missing values might be most effective. This approach captures the data’s underlying curvature and provides more accurate estimates during volatile periods, such as the fluctuations observed during the COVID-19 pandemic.
This hybrid approach can be applied to other datasets with varying trends, enhancing the accuracy and reliability of the analysis.
Have questions about upsampling and filling in missing values? Let me know in the comments. If you’re interested in getting your team started with Python for financial data analysis, check out my corporate training services.
Leave a Reply