In a previous blog post I showed how to get over limitations of creating dummy datasets directly in Microsoft 365 Copilot by asking for a Python script instead, particularly the faker package.
Technology moves quickly, and since publishing that post, the Faker package has become available directly within the Python in Excel environment! In this post, I’d like to show you how to generate datasets directly in Excel using this package.
To follow along with this tutorial, download the exercise file below.
When to use Faker
Faker is especially useful when generating categorical data, such as locations, email addresses, or names. To get started with any Faker session, we’ll begin with a couple of lines of code:
from faker import Faker
imports theFaker
class from the Faker library.fake = Faker()
creates a new instance of theFaker
class and assigns it to the variablefake
.
This Faker instance can then be used to generate various types of fake data. If you’re not familiar with this vocabulary of classes and instances, they relate to Python’s use of object-oriented programming. However, you don’t need to worry too much about the theory for now—let’s dive into creating some fake data!
To do so, let’s get started by creating a fake name! The code fake_name = fake.name()
generates a fake name using the Faker
instance and assigns it to the variable fake_name
. We’ll then view the results in our Excel workbook.
We just created a single fake name, which can be useful, but you’ll likely need to generate more data than that for most tasks. You can generate multiple names at once using a Python feature called list comprehension, like so:
# Generate a list of 10 fake names | |
fake_names = [fake.name() for _ in range(10)] |
This code uses the same fake.name()
method as before to generate a new fake name for each iteration over the range(10)
, resulting in a list of 10 unique names:
So far, very cool! You might notice that the names I’m getting are completely different from yours and that your names keep changing every time you run the code. This can be a problem if you want to share your results or reuse the data later. To make the data reusable, we can stabilize the random number generation by setting a random seed. Once you set the seed, everything else in your Faker instance will consistently produce the same random numbers.
In the example below, my 10 names will remain the same after setting the random seed, but the one generated before setting the seed will continue to change. I usually set my random seed to 1234, but you can choose any integer—just be consistent.
Awesome work so far! Now, let’s get a bit more sophisticated. Faker can generate much more than just names. You can explore the thorough documentation to see all the possibilities. There are numerous categories, and it can even set the locale of the random data to make it more location-appropriate. For example, you can generate phone numbers formatted for the United States versus those you’d find in India.
Generating a DataFrame of fake data
Let’s take our data generation a step further. This time, I want to generate a list of both names and emails. Now, we’re entering the realm of data that makes sense to organize into rows and columns. This sounds like a job for Pandas!
What I’m going to do is create a dictionary in Python, which will be a great way to hold the data before we shape it into a Pandas DataFrame. This will give us a typical two-dimensional dataset that Excel users will find familiar:
# Generate fake data for the DataFrame | |
data = { | |
'Name': [fake.name() for _ in range(10)], | |
'Email': [fake.email() for _ in range(10)] | |
} | |
df = pd.DataFrame(data) |
This approach is pretty cool, but we might want to ensure the data is more internally consistent by aligning the email addresses with the corresponding names. To achieve this, we can start by generating the person’s name and then convert that name into an email address in lowercase, using @example.com. Once these elements are created, we can directly insert both lists into a DataFrame.
# Generate names | |
names = [fake.name() for _ in range(10)] | |
# Create email addresses based on names | |
emails = [f"{name.replace(' ', '.').lower()}@example.com" for name in names] | |
# Create the DataFrame directly from the lists | |
df = pd.DataFrame({ | |
'Name': names, | |
'Email': emails | |
}) | |
df |
You’ll see that the results in our workbook are much more consistent but perhaps overly simplified.
For example, you might want these email addresses to use more realistic and varied domains rather than just example.com. You’ll also notice that the first entry results in a bogus email address—since the “Mrs.” in the name wasn’t stripped, it created an odd-looking email address. These issues can definitely be fixed with more advanced techniques, but we’ll leave it as is for now. Again, the point of this exercise is to generate data that’s close enough to realistic. After all, this is just fake data.
Awesome work! If you plan to continue generating this specific set of fake data, it’s a great idea to package it into a function. This approach allows you to easily call the function later in your workbook to generate exactly the number of rows you need. To make this function self-contained, I’ve ensured that everything necessary for it to work is defined internally. You’ll notice that I’m setting up a standalone instance and seed for Faker directly within the function:
This code is reproduced below:
def generate_fake_names(num_entries): | |
fake = Faker() | |
Faker.seed(1234) | |
# Generate names | |
names = [fake.name() for _ in range(num_entries)] | |
# Create email addresses based on names | |
emails = [f"{name.replace(' ', '.').lower()}@example.com" for name in names] | |
# Create the DataFrame directly from the lists | |
df = pd.DataFrame({ | |
'Name': names, | |
'Email': emails | |
}) | |
return df | |
# Example usage | |
df = generate_fake_names(5) | |
df |
Generating quantitative data with Faker
The Faker package can generate a wide range of categories and handle tasks involving quantities. For example, if I wanted to create a function to randomly generate a DataFrame with sales categories and a sales quantity variable, with numbers evenly distributed between 1 and 100, my code would look like this. Notice that I’ll create a custom list of sales categories to randomize from, using Python’s random
package instead of Faker. This means I need to set the random seed for both Faker and random
since the randomizations are coming from two different sources.
from faker import Faker | |
import pandas as pd | |
import random | |
# Function to generate data | |
def generate_data(n, seed=1234): | |
# Set seeds for reproducibility | |
random.seed(seed) | |
Faker.seed(seed) | |
# Create Faker instance | |
fake = Faker() | |
# List of realistic business categories | |
business_categories = [ | |
'Electronics', 'Furniture', 'Groceries', 'Clothing', 'Beauty Products', | |
'Books', 'Toys', 'Automotive', 'Home Improvement', 'Sports Equipment' | |
] | |
data = {'Category': [random.choice(business_categories) for _ in range(n)], | |
'Quantity': [fake.random_int(min=1, max=100) for _ in range(n)]} | |
return pd.DataFrame(data) | |
# Generate data for 2 entries | |
df = generate_data(2) | |
df |
However, if you’re looking to generate more complex statistical distributions or need greater control over the statistical properties of the data (such as the mean, variance, and so forth), you might consider using other packages specifically designed for statistical simulations. In particular, the numpy package is excellent for this. Keep in mind that if you use a different package than Faker, you’ll want to set the random seed for it as well, as the random generators used will be different.
For example, let’s generate the same dataset of categories and quantities, but this time we’ll use numpy to create a variable that is normally distributed with a mean of 50 and a standard deviation of 10. At this point, we’re not even using Faker anymore because we’ve created our own custom list of business categories tailored to our specific use cases. Additionally, we’re working with more specific variable distributions, which Faker isn’t really designed to handle.
This demonstrates that while Faker is excellent for generating categorical data, it’s not the ultimate tool for all data generation needs in Python.
import random | |
# Function to generate data | |
def generate_data(n, seed=1234): | |
# Set seeds for reproducibility | |
random.seed(seed) | |
np.random.seed(seed) | |
# List of realistic business categories | |
business_categories = [ | |
'Electronics', 'Furniture', 'Groceries', 'Clothing', 'Beauty Products', | |
'Books', 'Toys', 'Automotive', 'Home Improvement', 'Sports Equipment' | |
] | |
data = {'Category': [random.choice(business_categories) for _ in range(n)], | |
'Quantity': np.random.normal(loc=50, scale=10, size=n).astype(int)} | |
return pd.DataFrame(data) | |
# Generate data for 2 entries | |
df = generate_data(2) | |
df |
Getting help with Copilot
Because there are so many types of variables you might want to create, and Faker is just one tool in the toolkit, a great way to expand your range of possibilities is by using Copilot to have generative AI write the script for you. For example, here’s a prompt you can give to Copilot in Excel to generate a dataset. Unfortunately, this feature can be a bit hit or miss right now in Copilot for Excel. If it doesn’t work, I’d suggest trying it in Copilot for 365, as shown in the post shared at the beginning of this post. But here’s a prompt you can try in Copilot in Excel:
Create a Python script to give me a dataset for an organization’s employee performance review for 500 employees with the following details:
- Columns: Employee ID, Department, Performance Score, Salary, Years of Experience Performance
- Score: Normally distributed with a mean of 70 and a standard deviation of 10
- Salary: Log-normally distributed with a mean of $50,000 and a standard deviation of $15,000
- Years of Experience: Exponentially distributed with a lambda of 0.1
- Department: Randomly chosen from ‘Sales’, ‘HR’, ‘IT’, ‘Marketing’, ‘Finance’
- Random seed: Set to 1234
- Faker seed instance: Set to 1234
Keep in mind that Copilot requires data to be stored in a table format to be used effectively. To work around this, I created a fake table as a workaround to run the prompt and achieve the desired result.
Copilot does provide some results, which are displayed in full in the Gist below.
# Set the random seed for reproducibility | |
np.random.seed(1234) | |
fake = Faker() | |
Faker.seed(1234) | |
# Define the number of employees | |
num_employees = 500 | |
# Generate Employee IDs | |
employee_ids = [fake.unique.random_int(min=1000, max=9999) for _ in range(num_employees)] | |
# Generate Departments | |
departments = np.random.choice(['Sales', 'HR', 'IT', 'Marketing', 'Finance'], size=num_employees) | |
# Generate Performance Scores | |
performance_scores = np.random.normal(loc=70, scale=10, size=num_employees) | |
# Generate Salaries | |
log_mean_salary = np.log(50000) | |
log_std_dev_salary = np.log(15000) | |
salaries = np.random.lognormal(mean=log_mean_salary, sigma=log_std_dev_salary, size=num_employees) | |
# Generate Years of Experience | |
years_of_experience = np.random.exponential(scale=1/0.1, size=num_employees) | |
# Create the DataFrame | |
df = pd.DataFrame({ | |
'Employee ID': employee_ids, | |
'Department': departments, | |
'Performance Score': performance_scores, | |
'Salary': salaries, | |
'Years of Experience': years_of_experience | |
}) |
Conclusion
In conclusion, the Faker package in Python is a powerful and flexible tool for generating fake data that can be seamlessly integrated into Excel. It’s an ideal solution for Excel analysts and data enthusiasts who need to test and validate their spreadsheets. Whether you’re preparing data for demonstrations, training sessions, or simply ensuring the robustness of your Excel applications, Faker offers a straightforward way to create realistic datasets on the fly. However, it’s definitely not the only tool you need to know for generating realistic datasets. Be aware of NumPy, particularly for quantitative data, and don’t hesitate to use Copilot for creating more complex generation scripts.
What questions do you have about the Faker package specifically, or Python in Excel more generally? Let me know in the comments.
Leave a Reply