Simulating Realistic Subscription Data for Business Intelligence

This entry is part 3 of 6 in the series Investigating Subscription Churn at StaffWise

Investigating Subscription Churn at StaffWise

Problem Statement
Identifying Key Data Points for Analysis
Simulating Realistic Subscription Data for Business Intelligence
Data Cleaning Progress Update
Documenting the Analysis Process
Report: Investigating Churn at Staffwise

One of the biggest hurdles in building a business intelligence (BI) pipeline is… well, getting the data. But what happens when real data isn’t available or accessible? That was the situation I found myself in while building a data pipeline and dashboard for a subscription-based business.

The solution? Simulate a realistic, meaningful dataset from scratch – one that spans five years of transactions and reflects seasonality, billing cycles, customer behavior, churn, and revenue.

Here’s a walkthrough of the process – how I built that dataset, the challenges I faced, and what I learned along the way.

Step 1: Defining the Business Model and Scope

Before writing a single line of code, I defined the business logic I wanted to simulate. The dataset needed to mimic a real subscription service with:

Multiple billing cycles (monthly, quarterly, annual)
Tiered subscription plans and pricing
Realistic customer behavior: churn, renewals, and payment failures
Seasonality and holiday influence on signups and payments
A multi-year timeline (2019 – 2024) for trend analysis

This upfront clarity made it easier to shape the dataset and keep it focused on business intelligence needs.

Step 2: Building Robust Dimension Tables

To ensure structure and consistency, I created the dimension tables first. These became the backbone of the dataset.

Date Dimension: Daily records with year, month, day of week, season, and flags for weekends and holidays
Location Dimension: A curated list of city-state-country combinations for geographic insights
Billing Cycles, Subscription Plans, and Customers: Each with defined attributes, keys, and realistic variation

Starting with dimensions ensured the fact tables could link cleanly and allowed me to enforce relationships from the beginning.

Step 3: Simulating Subscriptions and Transactions

Next came the fact tables — specifically, subscriptions and sales. Here’s how I approached the simulation:

Subscriptions were assigned start dates randomly distributed over five years.
Transactions were generated based on billing cycles, with built-in seasonality — increased activity in summer and holidays.
Churn was modeled by simulating payment failures (e.g., monthly plans failing after 90 days).
Payment outcomes like “Success,” “Failed,” and “Refunded” were assigned using weighted probabilities.
Revenue was calculated based on plan pricing and payment outcome, with occasional refunds and discounts baked in.

The result was a dataset that doesn’t just look realistic — it behaves like a business with rhythms, trends, and anomalies.

Challenges and Solutions

Modeling seasonality and churn without real data

I used domain logic to assign higher transaction probabilities to peak periods (e.g., summer months, Easter, Thanksgiving, Christmas, spring break) and modeled churn using decay rules for monthly billing.

Ensuring consistency across linked tables

I applied strict foreign key logic in the simulation and regularly validated that IDs matched across tables.

Data mismatches caused by vague logic

I learned the hard way that vague prompts lead to chaotic outputs. For example, randomly assigning cities, states, and countries independently led to wild combinations like “Nairobi, Canada.” Fixing this meant associating locations as structured triples and randomizing by row, not column.

Lessons Learned

Clear business context is everything – it guides your simulation rules and ensures the data has analytical value.
Start with dimensions – they reduce chaos in downstream logic.
Randomness isn’t enough – simulate behavior, not just data.
Details matter – naming conventions, schema alignment, and column logic can make or break your tables.
Test early and often – especially when your dataset is the foundation of everything that follows.

Next Steps

With this simulated data in place, I’ll be moving on to exploration, cleaning, and modeling. For anyone trying to build a realistic BI pipeline without access to internal systems, simulation isn’t just a workaround. It’s a great way to test your logic, sharpen your storytelling, and build something you fully understand.

Access the Simulation Code

To dive into the full code behind this simulation, you can explore the Google Colab notebook here. The notebook includes all implementation details, from data generation and dimension table setup to fact table creation and transaction simulation.

Series Navigation<< Identifying Key Data Points for AnalysisData Cleaning Progress Update >>