Evaluating the Bias in LLMs for Surveying Opinion and Decision Making in Healthcare
postYonchanok Khaokaew, Flora Salim, Andreas ZĂĽfle, Hao Xue, Taylor Anderson, C. Raina MacIntyre, Matthew Scotch, David Heslop
Published: 2023-04-17

🔥 Key Takeaway:
Adding detailed demographic or “human-like” examples in a prompt tends to make the model latch onto and amplify stereotypes—flattening out genuine differences between groups. By contrast, using a simple, neutral (zero-shot) prompt preserves real-world variation across demographics much more faithfully.
đź”® TLDR
This paper compared how well four open-source large language models (LLMs)—Llama3, Gemma2, Ministral, and Galactica—simulate real-world healthcare decision-making by matching their responses to a major US survey on COVID-19 vaccine hesitancy, using demographic-based prompts to create digital twins of survey respondents. Key findings: Llama3 tracked early vaccine acceptance trends closely but underestimated later hesitancy, while Galactica and Ministral consistently overestimated skepticism, likely due to their training data; all models showed significant disparities in how they treated demographic groups, with Gemma2 and Ministral especially exaggerating differences by income and education (e.g., Disparate Impact Ratios for income as low as 0.236 for Gemma2). Most models flattened or misrepresented group-level differences, often failing to capture the real-world variation between races or income levels, which could lead to misleadingly neutral or inaccurate synthetic data. The authors recommend bias-aware evaluation, prompt refinement, and improving demographic representation in model training to better align LLM-driven simulations with observed human behavior, and caution that current LLMs can both under- and over-represent real disparities depending on model and prompt design.
📊 Cool Story, Needs a Graph
Figure 2: "Comparison of survey and LLMs decision outputs 4 different situations"

Side-by-side comparison of vaccine hesitancy predictions from all LLMs and the real survey across four pandemic phases.
Figure 2 presents a grouped bar chart where the vaccine hesitancy rates (percentage who would not get vaccinated) are shown for the Understanding America Study (survey baseline) and for each of the four LLMs (Llama3, Ministral, Galactica, Gemma2) across four key periods of the 2020 pandemic. This layout enables immediate visual assessment of which models most closely track real-world trends and where they diverge, making it easy to benchmark the proposed synthetic approach against all baselines in a single view and spot systematic under- or over-estimation patterns for each model.
⚔️ The Operators Edge
A subtle but crucial detail in this study is the choice to use "majority voting over three generations per prompt" to reduce the response variance of each AI persona (see methods section, page 3). This means that for every scenario and persona, the model generates three separate answers, and the most common answer is taken as the simulated response. This simple ensemble step helps smooth out the randomness and instability that can occur with large language models—especially in cases where prompts are sensitive or outputs are inconsistent.
Why it matters: Many experts might focus on prompt design or model selection as the key factors in synthetic research quality, but this majority-voting technique directly tackles the LLMs' tendency to give different answers to the same prompt (due to sampling variability). By averaging out the "noise," it yields results that are more reliable and representative of the persona’s intended decision, making the simulation more robust and less prone to outlier or spurious outputs.
Example of use: Imagine a team running synthetic A/B testing for a new product landing page. For each AI persona, instead of relying on a single response to "Would you sign up for this product?", they generate three answers and use the majority view. This ensures that if the LLM "wobbles" on the fence, the dominant signal is preserved, reducing the risk of overreacting to one-off outputs.
Example of misapplication: A team skips majority voting and uses only one response per persona per scenario. For a controversial or ambiguous prompt ("Would you trust this new fintech service?"), the model's randomness could lead to highly volatile and contradictory results—one run says yes, another says no, a third says maybe—making it hard to tell if the AI audience is truly divided or just inconsistent. Without this smoothing step, decision-makers might read too much into noisy synthetic data, drawing the wrong conclusions about customer preferences or risk appetite.
🗺️ What are the Implications?
• Be careful with few-shot prompts for demographic questions: The study found that using few-shot prompting—where the AI is shown several example answers before responding—often leads the model to repeat or exaggerate those examples, especially when sensitive fields like race or income are present, which can distort your results.
• Use zero-shot prompts with clear context for more reliable outputs: Simply providing a well-structured prompt with the relevant demographic and scenario information (but no examples) produced more balanced and consistent responses from simulated personas.
• Always test and compare several prompt styles before running a full study: Since prompt design had a big impact on response quality and bias, it's worth piloting different prompt types (few-shot, zero-shot, with/without explicit context) on a small scale to see which best matches real-world trends for your specific topic.
• Check your simulated results against real survey data if possible: Comparing outputs from your AI personas to real survey benchmarks helps catch problems with prompt-induced bias or unrealistic uniformity across demographic groups, improving the credibility of your findings.
• Understand that prompt sensitivity is a moving target: What works for one question type or demographic may not generalize, so regularly revisit and update your prompt strategy as business questions or simulation goals evolve.
đź“„ Prompts
Prompt Explanation: The AI was given a structured prompt template to simulate a persona’s healthcare decision-making by combining contextual (pandemic phase) and demographic information, then asked to answer a vaccination intent question in the voice of that persona, providing both a Yes/No answer and a short rationale.
Imagine yourself in the following situation: [SITU PROMPT]. Your background and personal circumstances are as follows: [You are a AGE-year-old GENDER of RACE ethnicity, living in a diverse country with varying access to healthcare, differing levels of trust in government and medical institutions, and socioeconomic disparities. Your annual Income is INCOME. Your education level is EDU_LEVEL. Over the past two weeks, you have been worrying about your health WORRY_LEVEL]. Please use this persona to answer the question below:
‘How likely are you to get vaccinated for coronavirus once a vaccination is available to the public?’
In this context, please answer based on your persona. Answer: [Yes/ No] Short reason: [FILL IN] based on your persona
⏰ When is this relevant?
A national grocery chain wants to understand how shoppers from different backgrounds would respond to a new loyalty app with digital coupons, personalized offers, and a points-based reward system. They want to compare reactions from three key audience segments: tech-savvy young adults, middle-income families, and seniors who prefer traditional shopping.
🔢 Follow the Instructions:
1. Define the audience segments: Create three AI persona profiles with realistic details. For example:
• Tech-savvy young adult: 26, urban, uses smartphone for most purchases, values convenience and personalization.
• Middle-income parent: 42, suburban, two kids, shops for bargains, mixes online and in-store shopping.
• Senior shopper: 68, retired, small town, prefers paper coupons, less comfortable with technology.
2. Prepare the prompt template: Use a direct and context-rich prompt for consistency. For each persona, fill in the details:
Imagine yourself in the following situation: You are shopping for groceries at your usual store. The store is launching a new loyalty app that gives digital coupons, personalized weekly offers, and lets you earn points for every dollar spent, redeemable for discounts.
Your background and personal circumstances are as follows: You are a [AGE]-year-old [persona description].
Please answer the following question as this persona:
What is your first reaction to this new loyalty app? How likely are you to use it, and why?
3. Run the persona prompts through the AI model: For each persona, submit the above prompt and generate multiple (e.g., 5–10) responses to capture a range of realistic reactions.
4. Probe for reasons and barriers: Ask a follow-up question for each response:
What would make you more (or less) likely to try or keep using this app? Is there anything that would stop you from signing up?
5. Group and code the responses: Review the answers and tag common themes such as "enthusiastic about digital rewards," "concerned about data privacy," "prefers paper coupons," or "finds app confusing."
6. Compare across segments: Summarize which persona groups are most receptive to the app, what features drive interest or resistance, and where the biggest gaps or objections appear.
🤔 What should I expect?
You'll get a clear sense of which customer types are eager to use the new app, which ones need more support or education, and what messaging or feature tweaks could boost adoption across each target group—all without running time-consuming human interviews or surveys.