Top 6 Tips for Simulating Realistic Data with ChatGPT and Python

I recently set out to create a mock dataset for a BI Sprint Series project I’m working on. I figured the process would be quick. After all, I was using a combination of ChatGPT and Python.

Boy was I wrong!

It was incredibly tedious and long, especially since this was the first time attempting data simulation. The good news is that I learned some valuable lessons through trial, error, and a lot of prompting.

Here are some things I wish I knew before getting started:

1. Be Painfully Specific in Your Prompt

The quality of your AI-generated data depends heavily on how specific and structured your prompt is. The more data or information you provide, the more accurate and realistic the output will be.

Instead of:

“Generate some subscription data with random values.”

Try:

Simulate 100 subscription records for a SaaS platform. Each record should include a unique SubscriptionID, a realistic company name, a StartDate within the past two years, and a SubscriptionPlan (Basic, Pro, or Enterprise). Monthly fees should align with the plan tier, and Enterprise plans should be more common among business users.

Why it works:
This level of detail gives AI a framework to follow. You’re not just asking for data – you’re defining structure, setting constraints, and introducing probabilities. That’s what turns randomness into realism.

2. Leverage ChatGPT to Refine Your Schema

Once you’ve outlined your key metrics and objectives, use ChatGPT to help you fill in the gaps in your table design. You likely know the core business questions you’re answering, but ChatGPT can help ensure your schema supports them effectively and realistically.

Prompt Example:

Design a normalized database schema for a company that provides HR SaaS to the hospitality industry. I need a sales transactions fact table and several dimension tables (e.g., customers, location, date, subscription plans etc). This will be used to simulate realistic sales data for analysis. Please suggest the tables, columns, data types, and relationships between them.

Why it works:
Instead of guessing every field, you’re collaborating with the model to stress-test your schema—ensuring it aligns with your analytical goals while maintaining realism. It’s like having a second brain to validate and enhance your structure.

3. Build Dimension Tables First

Always start with dimension tables when simulating relational data. These provide context and structure that the fact tables relies on. They hold the descriptive attributes of your entities (like customers or plans).

If you simulate the fact table before the dimension tables, you risk ending up with mismatched or meaningless values. The data will look disjointed and fail basic integrity checks (e.g., a transaction referencing a non-existent plan).

✅ Workflow Example:

Define dimension tables. In my case, they included customers, subscription plans, locations, and date
Simulate them with unique IDs and consistent structure
Then generate the fact tables (sales and subscription). Remember to reference the IDs from your dimension tables and use consistent rules to assign things like plan or customer combinations

🧪 Illustration in Python:

Pro tip:

Once your dimension tables are set, use their IDs in the fact table to ensure your data is relationally consistent.

4. Model Real-World Logic

Random numbers alone won’t cut it. To make synthetic data truly useful, you need to embed real-world behaviors and business rules that reflect how your product or market actually works.

This is where your domain understanding shines. Since you already know your KPIs, customer lifecycle, and operational quirks (e.g., trial periods, billing cycles, seasonal spikes), you can turn them into logic in your simulation.

Example in Python:

Why it Matters:

Analysts, data scientists, and decision-makers rely on patterns in your data. If those patterns aren’t grounded in business reality, your insights and model tests won’t translate into the real world.

5. Make Dates Reflect Real-World Timing

Speaking of real-world logic, the concept also applies to dates. Generating random dates alone won’t make the data feel real. Whether it’s signup dates, transaction timestamps, or subscription start dates, your dataset should reflect plausible business timelines – steady growth, seasonality, or campaign-driven spikes.

For example, a basic signup date might look like:

But to simulate sales activity with seasonal behavior (e.g., Q1 dips, Q4 peaks), you need to add logic. For instance:

Pro Tip:

Injecting patterns like seasonality, growth over time, or churn post-trial can take your mock dataset from generic to genuinely insightful.

6. Simulate Change Over Time, Not Just Snapshots

Most business processes aren’t static. Customers come and go, payments fail ad recover, and that’s just the tip of the iceberg. Yet, many mock datasets only reflect a fixed moment in time.

To make your data more reflective of real-world behavior, simulate status changes over time. For instance, a subscription might start on a trial, convert to a paid plan, and eventually churn. Capturing these transitions makes your analysis far more realistic.

Here’s how that might look in code:

Progress Over Perfection

Simulating realistic data for a BI project is no small feat. While these tips gave structure to the chaos, the process was still full of trial, error, and unexpected learning curves. If you’ve made it this far, hopefully you’re walking away with a clearer sense of what to expect, and how to approach your own mock dataset with more confidence.

That said, my final dataset isn’t perfect and that’s okay. I didn’t get everything right, but I’m genuinely glad I took on the challenge of building a dataset from scratch. It pushed me to think more deeply about structure, logic, and how the pieces of a system come together to tell a story.

I’m proud of what I built, and even more proud of what I learned along the way. If you’re on a similar path, I hope these lessons make the process feel a little more approachable and remind you that learning is often the most valuable outcome of all.