Personification: Combining Disconnected Datasets To Construct Realistic Synthetic Audiences

Ever found yourself staring at multiple datasets, knowing they collectively hold powerful insights but you have no way of connecting them?

For example, imaging you're a car dealership and you have sales data on your customers, you downloaded demographic data from the US Census, and found smartphone usage data from a separate survey. You'd like to mash all that together to create realistic personas for virtual audience testing, that follow realistic patterns.

You could do it manually, but this could be hundreds of thousands of rows, and how would you get the distributions right across the whole dataset? If you join them at random, you'd get oddities in the data, like children with unlikely $120,000 incomes, and rural areas where implausibly >80% of people work as software developers. If you can't believe the audience you constructed, you can't trust the results of your synthetic market research.

That's where personification comes in – a technique for creating audiences that stand up to scrutiny – the numbers all tally in the aggregate, and no individual persona has an unrealistic combination of attributes. It is commonly used in synthetic data generation, and was recently used to create personas that predicted Trump's election.

How Personification Works

The process is called "SYNC", or Synthetic Data Generation via Gaussian Copula... yeah that's a mouthful. As you can imagine it's somewhat complicated to run, but here is the general flow:

Statistical Synthesis - First, we create synthetic individuals whose characteristics match the statistical distributions of our aggregate datasets. If census data says 30% of people in an area earn over $80K, then 30% of our synthetic people will too.
Maintaining Statistical Relationships - The system preserves correlations between variables. If education and income are linked in real data (people earn more when they have a degree), our synthetic people reflect that relationship.
Probabilistic Matching - When we have individual-level data (like customer records), we match those real individuals to their most likely synthetic counterparts.
Data Enrichment - Finally, we can enrich our understanding of real customers by layering in the additional attributes from their synthetic matches.

Once you go through this process you should be able to tally the different attributes in your original data and see that the synthetic audiences match in the aggregate too. Attributes that you would expect to be correlated should be in your final audience also.

Why This Matters for Synthetic Market Research

The applications are genuinely transformative:

Hidden Segment Discovery - Uncover customer segments that were invisible when looking at datasets in isolation.
More Realistic Personas - Move beyond simplistic demographic profiles to multi-dimensional personas grounded in statistical reality.
Privacy-Preserving Insights - Get individual-level insights without compromising privacy, since we're working with synthetic people.
Cross-Dataset Learning - Understand how behaviors in one domain (smartphone usage) might predict preferences in another (vehicle purchases).
Realistic AI Personas - When you don't have individual level data you can use to clone your customers, this is the next best way to get realistic individual personas that match what you know about your audience.

My Recent Experience

I am working on a project where we are applying personification to combine census demographic data, credit data, and social trends. What jumped out immediately was how certain high-value customers, previously categorized by just age and income, actually represented a distinct behavioral segment with specific digital habits we would have missed otherwise.

The synthesis process revealed patterns that wouldn't have been visible from any single dataset – like how education level correlates with certain product preferences only within specific income bands. Importanty, it gives us confidence in what our virtual audience is saying, because their demographics line up with their assumptions (and the assumptions of the client).

The Limitations Worth Noting

To be fair, personification isn't perfect:

The quality depends heavily on your input data sources
Synthetic people are statistical approximations, not literal predictions
Correlation doesn't equal causation (though it often suggests where to look deeper)

When trying to recreate individual-level data from aggregated sources, accuracy varies dramatically based on aggregation unit size. According to researchers, small units (fewer than 10 individuals), reconstruction accuracy hovers around 40-47%, but drops to just 22-24% for larger units (100+ individuals). Even for binary variables like gender, accuracy tops out at 80%, and for variables with many categories (like income bands), accuracy plummets to 24-26% – far from perfect but still much better than random guessing.

Getting Started with Your Own Data

If you're intrigued by personification for your market research:

Start by identifying datasets that would be valuable to connect
Look for common variables that could serve as matching points
Consider which aggregate patterns are most important to preserve

The technical lift to implement this approach is still high – for us it's a custom project rather than something you can do through Rally today. However, if you wanted to chat about doing a similar project, please get in touch.

Ask Rally