Is Synthetic Data Good Enough to Train User Simulators?

By Shirley Wu & Arpandeep Khatua

We spent a month trying to make synthetic data work for user simulation. Here's what we found, and why we ultimately went real.

Today, many people use synthetic data to train models, with good reasons. Synthetic data is cheap, controllable, and scalable. Like many people, we started with synthetic data to train our user simulators. And, spoiler alert if you haven't read our paper, we eventually decided to collect real user data to train our models. This effort of trying out synthetic data surprisingly took about a month of our time.

The gap between synthetic and real responses

To give you a sense of how different synthetic and real user data really are, consider a Reddit post asking whether social media and politics are ruining your life.

Reddit post about politics and social media
Figure 1. A real Reddit post, which is also used as a prompt for synthetic data generation.

If you just ask a model to comment on this post (Figure 2), its answer is rather neutral: "if you use it consciously, it's fine. But if you… it can ruin your peace of mind." You can tell the model knows two directions to go, agree or disagree. Like a helpful assistant, it blends them into one balanced response.

If you further engineer a better prompt with instructions on how to behave like a human, the model knows to be more decisive and picks a side.

Synthetic model responses to the Reddit post
Figure 2. With basic prompting, the model hedges. With a more carefully engineered prompt, it commits to a position, but still within a narrow range of responses.

This seems promising. Until you realize that across many generations to the same prompt, the model agrees or disagrees for very similar reasons. But this is beyond mere "intra-model repetition," where a single model consistently generates similar responses, a problem noted by Jian et al. (2025). The real problem is what you find when you look at the actual human responses.

In reality, beyond just agreeing or disagreeing, many people mock the poster:

Real user response: mockery
Figure 3. A real user response that mocks the framing of the post entirely.

Others express entirely different emotional registers:

Real user response: emotional
Figure 4. A real user response conveying a distinct emotional perspective absent from synthetic outputs.

Or provide structural analysis of why Reddit is becoming political:

Real user response: analytical
Figure 5. A real user offering a meta-level analysis of platform dynamics, a response type that synthetic data rarely, if ever, generates.

And even among those who simply agree or disagree, real comments offer far more interesting perspectives than anything synthetic data tends to produce. Consider these two examples you would hardly ever sample from a language model:

Real user disagrees: ignoring politics is a privilege
Figure 6. A user disagrees because ignoring politics is a privilege unavailable to many people.
Real user agrees: social media exposure bias
Figure 7. A user agrees because social media's downvoting mechanics create an exposure bias that distorts perception.

When the reward goes up but nothing improves

Even with this observation, we didn't give up on synthetic data right away. One can argue that even if a gap exists, the model can still learn something useful from it. Then we saw this:

Training reward curves on synthetic vs real data
Figure 8. Training reward curves for GRPO on 10k synthetic vs. 10k real user responses. The reward climbs steadily on synthetic data, but barely moves on real data, using the same algorithm, data size, and compute.

For context: we use BERTScore as a reward here, it computes embedding similarity between ground truth responses and generated ones. We train two models, one on 10k synthetic responses and one on ~10k real user responses, using the same algorithm (standard GRPO), the same amount of data, and the same training pipeline and compute.

The results diverge sharply. We see clear improvements on synthetic data, but no meaningful improvement on real data. An algorithm designer looking only at synthetic results could easily declare success. But that same algorithm fails to move the needle on real data, neither during training nor, likely, at test time when the model needs to generalize.

Why this happens: the Platonic convergence problem

To understand the root cause, we took Qwen3-8B, an open-source LLM with no exposure to either dataset, and measured how "surprised" it is by each type of response. We computed the mean token-level log probability of real and synthetic responses on Reddit, using Qwen3-8B's own outputs and GPT-4o-mini's synthetic outputs as reference points.

Log probability distribution of real vs synthetic responses
Figure 9. Distribution of mean token log-probabilities. Synthetic outputs from GPT-4o-mini (blue) land in the same high-likelihood region as the model's own outputs. Real human responses (red) are spread across a much wider, lower-probability range.

What's striking is that GPT-4o-mini's outputs are also quite predictable to Qwen3-8B, even though they come from a completely different model. This is consistent with the Platonic Representation Hypothesis (Huh et al., 2024), which argues that neural networks are converging toward a shared statistical model of reality. Different LLMs, despite different architectures and training data, end up with similar representations, so synthetic text from any LLM lands in roughly the same narrow, high-likelihood region.

Real human responses, by contrast, spread across a much wider range, into regions these converging representations would basically never produce on their own.

This is the core problem. When we train on synthetic data and see the reward go up, we're not teaching the model to simulate humans. We're teaching it to stay within the shared representational space that all LLMs already occupy. The reward signal from real data pushes the model outside of it, which is harder to optimize, but that's where the actual humans are.


Takeaway

In our case, this is a sobering result for synthetic data: the improvements you observe on synthetic benchmarks may simply not transfer to the real users you actually want to simulate. For this study, we decided to go real all the way, starting from the data.

That said, the "in our case" qualifier matters. Synthetic data might still be a useful reference if you engineer a diverse enough set of system prompts to improve variety and coverage. A promising direction might be using a small amount of real user data as a seed to generate more targeted synthetic data. But for now, when it comes to simulating the full breadth of how real people actually respond, there's no substitute for the real thing.