Feature engineering involves transforming raw data into meaningful variables to enhance machine learning models. It includes creating new features, encoding categorical data, and scaling or normalizing values—tasks that significantly boost model accuracy and insights. This practice naturally overlaps with data cleaning, as both involve handling missing values, inconsistencies, and outliers. Feature engineering also aligns with exploratory data analysis (EDA), since insights from EDA often guide effective feature creation.
For Excel users, mastering feature engineering expands the potential for more sophisticated analysis and predictive modeling. While Excel’s Power Query provides powerful data transformation tools, certain advanced tasks, especially those involving complex statistical rules or extensive group-wise transformations, are much easier and more efficient using Python and Copilot, enabling streamlined and scalable operations beyond standard Excel workflows.
In this post, you’ll explore advanced feature engineering techniques with the Palmer Penguins dataset. Download the exercise file below to follow along:
 
Dummy-coding categorical variables
Categorical variables like species or island are common in datasets but can’t be directly interpreted by most predictive models. Typically, these categories need conversion into numeric dummy variables, representing each distinct category as its own binary column. We began our analysis using the following prompt:
“Convert the categorical variables species, island, and sex to numerical dummy variables suitable for modeling.”

Copilot quickly generated dummy variables, clearly converting each categorical variable into binary columns for each category (e.g., species_Adelie, island_Biscoe, sex_Male). This numerical format allows our data to be effectively used by predictive models and machine learning algorithms.
Binning quantitative variables
Sometimes numeric values like body mass provide more insight when grouped into meaningful categories rather than considered individually. To efficiently address this, we gave Copilot the following prompt:
“Bin body_mass_g into three categories (light, medium, heavy) based on quantiles, and add this as a categorical feature.”

Copilot quickly responded by categorizing penguin body mass into three intuitive groups—light, medium, and heavy—based on the underlying quantiles of the data. The output clearly indicates the category each penguin falls into, turning numeric complexity into easy-to-understand categorical labels.
By creating quantile-based categories, analysts can quickly identify patterns, make clearer comparisons across groups, and feed simplified, meaningful variables into predictive models—all achieved effortlessly through Python and Copilot integration directly within Excel.
Creating group-wise statistical measures
Numeric summaries grouped by categories, such as median values by species, are frequently essential features for deeper data insights. To efficiently create these features, we provided Copilot with this prompt:
“Calculate the median body mass by species and add this as a new numeric feature to each row of the dataset.“

Copilot quickly calculated the median body mass for each penguin species and added these values directly into each row as a new numeric feature. This new column allows us to easily identify how an individual penguin compares against the typical body mass of its species, highlighting significant deviations or confirming typical measurements.
Standardizing variables with Z-score scaling
Numeric features in datasets often vary greatly in scale, making them challenging to compare directly. For instance, a penguin’s body mass (measured in grams) naturally has a larger numeric scale than its bill length (measured in millimeters). This discrepancy can distort analyses, especially when performing predictive modeling or clustering.
To address this issue clearly and efficiently, we gave Copilot the following prompt:
“Apply standardization (Z-score scaling) to the numeric columns bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g using Python.”

Copilot quickly applied Z-score scaling to our selected numeric features. This transformation converts each numeric value into a standardized score representing how many standard deviations it is from the feature’s average. A standardized value near 0 indicates the measurement is close to the average, while positive or negative values reflect deviations above or below the average, respectively.
Standardization makes these numeric features directly comparable, allowing each to contribute equally and meaningfully to subsequent analyses.
Advanced missing value imputation
Dealing with missing values is a common challenge in data analysis. Excel and Power Query offer basic options for handling missing data, but these tools fall short when it comes to advanced predictive imputation, especially when we want to use similar, complete observations to estimate missing values.
To effectively overcome this limitation, we provided Copilot with the following prompt:
“Perform advanced imputation by predicting missing values in each row based on the most similar complete observations in the dataset.”

Copilot quickly applied the K-Nearest Neighbors (KNN) imputation method, filling in missing numeric values by finding the most similar penguins in the dataset and using their known measurements to predict missing data. The resulting dataset now has complete observations for each numeric feature, providing a robust foundation for further modeling and analysis.
Check out this post for a little more on k-nearest neighbors with Copilot, Python and Excel:
Conclusion
Feature engineering is foundational for achieving robust, predictive analytics. By embracing Python and Copilot alongside Excel, you’re positioned to execute sophisticated transformations that previously required significant manual effort or were simply unattainable with basic Excel or Power Query techniques alone.
Yet, while Python-powered feature engineering significantly expands your analytical capabilities, it’s essential to remain mindful of potential limitations. Advanced methods require clear understanding and interpretation to avoid unintended biases or misrepresentations in your data. Additionally, ensuring seamless integration between Python-driven transformations and traditional Excel workflows will require thoughtful structuring and documentation.
It is crucial to deeply engage with your data through exploratory and visual methods to verify that the engineered features truly address your analytical needs. Feature engineering inherently involves substantial trial and error. Although Copilot significantly accelerates experimentation and automates repetitive tasks, its effectiveness hinges upon clear and accurate guidance. Without properly defined parameters, Copilot might inadvertently speed up the creation of irrelevant or misleading features.
Moving forward, consider experimenting with these techniques on your own datasets to identify unique opportunities for improvement. Challenge yourself to combine the intuitive visualizations of Excel with the scalable computational power of Python, leveraging Copilot’s AI assistance to bridge gaps and streamline your processes. As your comfort grows, you’ll uncover new strategies to elevate your analyses, enabling more impactful and data-driven decision-making.

Leave a Reply