Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies
postGati Aher, Rosa I. Arriaga, Adam Tauman Kalai
Published: 2023-07-09

🔥 Key Takeaway:
The most realistic human behavior didn’t come from making the AI smarter—it came from letting it be inconsistent: when the model varied its responses by name, gender, or offer size in ways that weren’t perfectly rational, it accidentally got closer to how real people actually act.
đź”® TLDR
This paper introduces "Turing Experiments" (TEs) as a method to test how well large language models (like GPT-4) can simulate human behavior across a population, not just mimic an individual. The authors recreated four classic human subject studies using LMs: the Ultimatum Game (economic fairness), Garden Path Sentences (linguistic parsing), the Milgram Shock Experiment (obedience to authority), and Wisdom of Crowds (aggregated estimates). GPT models replicated human responses well in the first three, but in the fourth, they showed a “hyper-accuracy distortion”—giving unrealistically precise answers due to training or alignment, not actual human-like behavior. The study found that model outputs vary consistently by demographic cues like names and gender, enabling fine-grained simulation of population differences. It also highlighted the importance of prompt design and avoiding p-hacking by validating prompt consistency before running tests. For synthetic market research, this suggests using diverse, demographically varied personas, validating prompt clarity and coherence, being wary of hyper-accurate or idealized responses in knowledge-based tasks, and checking for systematic distortions that may affect fidelity to real-world behavior.
📊 Cool Story, Needs a Graph
Figure 5: "(a) Ultimatum Game TE simulation shows a wide gap in average acceptance rate for different-gender pairs”

Language model simulations reveal gender-based acceptance rate differences in economic decision-making.
This figure highlights systematic behavioral differences by gender pairing in simulated Ultimatum Game interactions using a large language model. Simulated male participants were more likely to accept unfair offers from females, and females were less likely to accept unfair offers from males, supporting a "chivalry" effect noted in human studies. It provides strong evidence that the model internalizes social cues and demographic context, an important factor in designing representative AI personas.
⚔️ The Operators Edge
The study’s most overlooked but crucial detail is that just changing the subject’s name and title (like “Mr.” or “Ms.”) consistently altered simulated outcomes, revealing that the AI model encodes subtle demographic associations—even when no explicit traits like age, job, or ethnicity are provided. This works because large language models have absorbed billions of real-world linguistic patterns where names and honorifics correlate with social norms, expectations, and biases.
Why it matters: Most experts assume you need rich persona profiles (age, race, job, goals) to simulate behavior changes. But this study shows that simple surface cues—like “Ms. Huang” versus “Mr. Wagner”—are enough to activate realistic shifts in AI behavior, including replicating known effects like gender-based acceptance gaps in economic games. This means researchers can simulate diverse population responses by minimally varying prompts, saving complexity while increasing realism.
Example of use: A product team testing different loan application flows could simulate gender and cultural biases by changing only the applicant name (e.g., “Mr. James Carter” vs. “Ms. Aisha Khan”) without constructing full demographic backstories. This lets them spot unintended disparities in model-based screening or approval suggestions with minimal setup.
Example of misapplication: A UX researcher simulates user reactions to a pricing model using generic personas like “User A” and “User B,” assuming names don’t matter. They get uniform responses and conclude there’s no difference in how segments respond—missing the fact that AI models often adjust tone, trust, or behavior subtly based on name and title cues. Had they used demographically salient names, they may have uncovered gaps in how the feature resonates across perceived identity lines.
🗺️ What are the Implications?
• Better prompts beat bigger models: The way you ask questions influences results more than which AI model you use. Crafting clearer, more human-like prompts can improve accuracy without extra cost.
• AI personas mirror social patterns—sometimes too well: The AI can reflect real human biases like gender-based decision trends, so researchers should check for exaggerated or unrealistic demographic effects before acting on results.
• Avoid fact-based or numerical questions for now: On estimation tasks, newer AI models tend to give unrealistically perfect answers (e.g., guessing the exact melting point of metals), which makes them less useful for simulating uncertainty or group averages.
• Larger simulations reveal hidden behaviors: Group effects like fairness, influence, or herd behavior only show up with 500–1000+ simulated personas. Use larger sample sizes when testing social dynamics or decision spread.
• Validate with small real samples: If budget is tight, you can run synthetic studies at scale, then validate trends with a smaller human panel to reduce risk before launch.
• Use diverse but simple personas: Varying basic demographic cues like name and gender gave distinct response patterns—no need for complex backstories. Focus on diversity breadth rather than depth.
• Pre-check prompts before launch: Small tweaks in phrasing can change outcomes. It’s worth pre-testing prompts on a few simulated personas to catch unclear or biased wording.
• Be cautious with “crowd wisdom” outputs: Simulated group averages can be distorted by models trying to be too accurate. Don’t over-trust aggregated results without checking variance or outliers.
đź“„ Prompts
Prompt Explanation:
Ms. Olson was asked to indicate whether the following sentence was grammatical or ungrammatical.
Sentence: While the student read the notes that were long and boring blew off the desk.
Answer: Ms. Olson indicated that the sentence was
Prompt Explanation:
Prompt used to simulate a subject's decision in the Ultimatum Game TE.
In the following scenario, Ms. Huang had to decide whether to accept or reject the proposal.
Scenario: Mr. Wagner is given $10. Mr. Wagner will propose how to split the money between himself and Ms. Huang. Then Ms. Huang will decide whether to accept or reject Mr. Wagner’s proposal. If Ms. Huang accepts, then Mr. Wagner and Ms. Huang get the money as they agreed to split. If Ms. Huang rejects, then Mr. Wagner and Ms. Huang both receive nothing. Mr. Wagner takes $6 for himself and offers Ms. Huang $4.
Answer: Ms. Huang decides to
Prompt Explanation:
Free-response prompt used in the Milgram Shock Experiment TE to simulate subject behavior.
The current punishment voltage level is slight shock, 45 volts. The learner selects a wrong answer.
Mr. Wagner
Prompt Explanation:
2-choice prompt used in the Milgram Shock Experiment TE to classify subject action.
The experimenter observed Mr. Wagner’s action and noted whether Mr. Wagner did shock or did not shock the victim.
Action: Mr. Wagner shocks the learner with a slight shock of 45 volts and moves on to the next question.
Experimenter’s note: Mr. Wagner did
⏰ When is this relevant?
A national grocery chain is considering whether to introduce an in-store “smart cart” that scans items as customers shop, shows running totals, and suggests recipes or deals. Before building a prototype, they want to understand how different customer types might respond to the idea, what benefits or concerns they mention, and whether it influences store choice or basket size.
🔢 Follow the Instructions:
1. Define key shopper personas:
Choose 3–4 typical grocery shopper segments. Example personas:
â—¦ Tech-savvy urban shopper: Age 32, lives in city, shops for 1–2, likes automation, uses digital coupons.
â—¦ Older traditional shopper: Age 64, suburban, shops weekly, prefers familiar experiences, avoids self-checkout.
â—¦ Budget-focused parent: Age 40, two kids, always looking for deals, shops in-store for bulk and discounts.
â—¦ Time-pressed professional: Age 35, single, values speed and convenience, often shops after work.
2. Create prompt template to simulate feedback:
Use this structure to prompt the AI persona:
You are a grocery shopper described as: [insert persona description].
The store is considering introducing a “smart cart” that automatically scans items as you shop, shows your running total, offers personalized deals, and suggests recipes based on what you’ve picked.
You are being interviewed by a retail innovation researcher.
QUESTION: What’s your honest first impression of this idea? What sounds useful or not useful about it?
3. Run the simulation:
Input each persona’s prompt into the language model (like GPT-4). Generate 5–10 variations per persona by slightly changing the question phrasing (e.g., “Would this change how you shop?” or “How would this compare to your current experience?”).
4. Add follow-up prompts:
Choose 1–2 follow-ups based on initial responses. Example follow-ups:
â—¦ “Would this make you more likely to shop at this store instead of competitors?”
â—¦ “What would make you trust or reject this cart system?”
5. Tag responses with themes:
Manually or automatically tag answers with themes like “convenience,” “privacy concern,” “budget tracking,” “tech anxiety,” or “interest in recipes.” Note positive or negative sentiment and specific phrases.
4. Compare reactions across segments:
Summarize what each persona segment valued, disliked, or hesitated about. Look for common objections (e.g., privacy, tech complexity) and opportunities (e.g., budget visibility, faster checkout).
🤔 What should I expect?
You’ll get a segmented view of how different shoppers might react to a smart cart feature—who finds it exciting vs. intimidating, what benefits resonate (e.g., time savings, budgeting), and which concerns (e.g., learning curve, privacy) could block adoption. This allows the business to prioritize features, target messaging, and make an early go/no-go decision before investing in hardware.