LLM-Based Role-Playing Simulations: Demographic Gaps and Mitigation Strategies

Overview

Large Language Models (LLMs) are increasingly used to role-play as survey respondents or virtual participants in social science research. By conditioning an LLM on a hypothetical person’s profile (e.g. age, gender, race, ideology), researchers can generate synthetic answers to attitudinal or behavioral questions that aim to mirror real demographic groups (Synthesizing Public Opinions with LLMs: Role Creation, Impacts, and the Future to eDemorcacy) (Synthesizing Public Opinions with LLMs: Role Creation, Impacts, and the Future to eDemorcacy). Recent studies have shown that LLMs can indeed mimic human-like response patterns in aggregate – for example, producing group-level survey results that reasonably approximate actual opinion polls (Synthesizing Public Opinions with LLMs: Role Creation, Impacts, and the Future to eDemorcacy). However, not all demographic groups are represented equally well. Certain populations consistently show larger gaps between LLM-simulated responses and real-world data, raising concerns about bias and validity in these virtual simulations ([2303.17548] Whose Opinions Do Language Models Reflect?). This report examines which demographics are least accurately represented by LLM role-play, why these gaps occur (from training data biases to alignment filters), and how researchers can design more realistic and inclusive virtual “audiences.” We draw on peer-reviewed findings and experimental results to highlight both pitfalls and best practices in using LLMs for open-ended survey-style simulations.

Underrepresented Demographics in LLM Simulations

Multiple empirical evaluations have identified specific demographic subgroups that LLMs struggle to emulate accurately. In these cases, the model’s generated opinions or behaviors deviate markedly from real survey data for those groups. Key findings include:

Older Adults: LLMs often misrepresent the attitudes of older people. For example, one comprehensive study of 60 U.S. demographic segments found opinions of 65+ individuals were poorly reflected by current LLMs ([2303.17548] Whose Opinions Do Language Models Reflect?). Even when explicitly prompted to adopt an elderly persona, the model’s responses remained misaligned with actual seniors’ views. This aligns with another report noting that model accuracy declines for older age groups, especially in non-US contexts ([2501.15351] Fairness in LLM-Generated Surveys). In Chilean data, accuracy dropped significantly for seniors, indicating underrepresentation of their perspectives ([2501.15351] Fairness in LLM-Generated Surveys).
Women (in Certain Contexts): Gender gaps appear in some LLM simulations. In a Chilean survey scenario, a 13B LLaMA model showed notable bias against women, with substantially lower accuracy for female respondents’ answers compared to males ([2501.15351] Fairness in LLM-Generated Surveys). Men’s opinions were mirrored more closely than women’s, even though a baseline model (random forest) didn’t show such a gap – pointing to the LLM introducing this bias ([2501.15351] Fairness in LLM-Generated Surveys). In the U.S., gender effects are mixed: some studies found overall performance similar for men and women ([2501.15351] Fairness in LLM-Generated Surveys), but intersectional biases reveal that certain groups of women are underrepresented. For instance, non-white women’s responses were predicted with especially low accuracy, indicating compounding errors at the intersection of race and gender ([2501.15351] Fairness in LLM-Generated Surveys) ([2501.15351] Fairness in LLM-Generated Surveys). One analysis observed that women with left-leaning political views were particularly poorly modeled by a GPT-based agent ([2501.15351] Fairness in LLM-Generated Surveys) – the model struggled to capture this subgroup’s distinct combination of gender and ideology.
Racial and Ethnic Minorities: LLMs tend to be less accurate for non-dominant racial groups. A U.S. study noted that when simulating public opinion, race significantly influenced prediction accuracy ([2501.15351] Fairness in LLM-Generated Surveys). In particular, prompts targeting non-white demographics led to larger errors. The most pronounced bias was again for intersecting categories (e.g. non-white women as noted above). A broad benchmarking of role-play bias (BiasLens) also found that among various attributes, roles defined by race or culture triggered the highest levels of biased or stereotyped responses from multiple LLMs (Benchmarking Bias in Large Language Models during Role-Playing). This suggests minority cultural groups are high-risk for misrepresentation, with models either defaulting to majority norms or overusing stereotypes.
Lower Socioeconomic Status (Education/Income): Several experiments show degraded performance for lower-education and lower-income personas. In both the U.S. and Chile, LLM simulations were less accurate for individuals with low education levels ([2501.15351] Fairness in LLM-Generated Surveys). The Chilean model also fared worse for lower social class groups ([2501.15351] Fairness in LLM-Generated Surveys). In the U.S., a significant drop in accuracy was observed for low-income individuals ([2501.15351] Fairness in LLM-Generated Surveys) ([2501.15351] Fairness in LLM-Generated Surveys). These patterns imply that people from lower socioeconomic backgrounds (who may use different language patterns or have views underrepresented in the training data) are not being faithfully emulated.
Political Centrists or Atypical Combinations: Interestingly, models often mimic strong partisans better than moderates. Both in Chile and the U.S., centrist or politically neutral individuals were less well-represented in simulations ([2501.15351] Fairness in LLM-Generated Surveys). The LLMs were “better attuned” to those with clear left or right ideologies (likely because those opinions are more distinctly expressed in training data), whereas middle-of-the-road viewpoints were predicted erratically ([2501.15351] Fairness in LLM-Generated Surveys). Similarly, in Chile the model had difficulty with respondents who had no stated ideology, yielding unpredictable outputs ([2501.15351] Fairness in LLM-Generated Surveys). This indicates a potential bias where nuanced or less polarized positions get lost. Furthermore, combinations of traits that defy stereotype can confuse the model – e.g. a religious person with left-wing views was a tough case, showing significantly lower prediction accuracy in one analysis ([2501.15351] Fairness in LLM-Generated Surveys) ([2501.15351] Fairness in LLM-Generated Surveys). Real humans often hold cross-cutting beliefs (such as socially conservative and economically liberal mixes), but LLM personas may default to more homogeneous belief sets, failing to capture these nuanced subgroups.
Religious Minorities: Context-specific findings suggest highly religious groups can be misrepresented. In Chilean data, having a religious affiliation was associated with lower simulation accuracy ([2501.15351] Fairness in LLM-Generated Surveys) ([2501.15351] Fairness in LLM-Generated Surveys). The worst performance of all was for an intersectional group – older, low-educated religious women – which the model almost entirely failed to simulate accurately ([2501.15351] Fairness in LLM-Generated Surveys) ([2501.15351] Fairness in LLM-Generated Surveys). This points to compounded underrepresentation: each of those traits (female, elderly, religious, low education) added to the error. While not all studies examined religion, this case flags devout or minority-religion individuals as a potential high-risk category.
Non-U.S. Cultural Groups: A consistent theme is that LLMs are biased toward Western, especially American, data. An extensive comparison of U.S. vs. Chilean survey simulations found the model performed much better on U.S. responses, even after controlling for question difficulty ([2501.15351] Fairness in LLM-Generated Surveys). The authors traced this to the US-centric pre-training corpus, which left the model less grounded in Chile’s cultural context ([2501.15351] Fairness in LLM-Generated Surveys) ([2501.15351] Fairness in LLM-Generated Surveys). In general, regions and cultures underrepresented in the training data (many non-English-speaking or Global South populations) see larger gaps between real and simulated opinions. For instance, one study observed that ChatGPT’s accuracy and similarity scores were consistently higher for U.S. datasets than for Chilean ones, underscoring the need for caution when simulating publics outside the model’s primary domain ([2501.15351] Fairness in LLM-Generated Surveys) ([2501.15351] Fairness in LLM-Generated Surveys).

Checklist: High-Risk Groups for Misrepresentation

Based on the above findings, researchers should be especially careful (and validate results) when simulating the following demographic groups, as LLM role-play may underrepresent or distort their true attitudes:

Older adults (especially 65+) – Often poorly reflected in LLM outputs ([2303.17548] Whose Opinions Do Language Models Reflect?) (Assessing Political Bias in Language Models | Stanford HAI).
Women in certain contexts – Particularly women of color or women with non-mainstream ideologies ([2501.15351] Fairness in LLM-Generated Surveys) ([2501.15351] Fairness in LLM-Generated Surveys).
Racial/ethnic minorities – Non-white groups, immigrants, and culturally distinct minorities show higher simulation errors (Benchmarking Bias in Large Language Models during Role-Playing) ([2501.15351] Fairness in LLM-Generated Surveys).
Low-education or low-income individuals – LLMs tend to be less accurate for these groups’ responses ([2501.15351] Fairness in LLM-Generated Surveys) ([2501.15351] Fairness in LLM-Generated Surveys).
Highly religious individuals – Especially in largely secular datasets, religious respondents’ views may be mis-modeled ([2501.15351] Fairness in LLM-Generated Surveys) ([2501.15351] Fairness in LLM-Generated Surveys).
Ideological centrists or mixed-belief personas – Middle-of-the-spectrum views and atypical belief combinations are often oversimplified by the model ([2501.15351] Fairness in LLM-Generated Surveys) ([2501.15351] Fairness in LLM-Generated Surveys).
Non-Western populations – Groups outside the model’s training focus (e.g. non-English speaking communities) are at risk due to cultural knowledge gaps ([2501.15351] Fairness in LLM-Generated Surveys) (Synthesizing Public Opinions with LLMs: Role Creation, Impacts, and the Future to eDemorcacy).
Miscellaneous under-sampled groups – e.g. widowed individuals (a proxy for older, often female subgroup) and certain religious sects (one study noted Mormons in the U.S. were underrepresented) (Assessing Political Bias in Language Models | Stanford HAI). Any demographic slices with sparse data in training or surveys should be treated with caution.

By recognizing these high-risk categories, analysts can double-check LLM-generated data (or avoid over-reliance on it) for these groups. In many cases, the “majority” viewpoint of affluent, younger, secular, English-speaking users is best captured (), whereas marginalized voices deviate more. This imbalance stems from multiple factors explored next.

Why Do These Gaps Occur?

Several intertwined reasons explain why LLM-based simulations struggle with certain demographics:

Training Data Imbalances: LLMs learn from vast text corpora that are not demographically neutral. Online content skews toward certain communities – predominantly younger, English-speaking, Western, and male voices. As a result, models are far better versed in the vernacular and viewpoints of those dominant groups. The “Fairness in LLM-Generated Surveys” study explicitly found the model’s superior performance on U.S. data “originates from the U.S.-centric training data”, which gave it deep familiarity with American opinions but left it less effective at interpreting Chilean respondents ([2501.15351] Fairness in LLM-Generated Surveys). Likewise, underrepresentation of older adults or rural poor populations can be traced to their lower presence in internet text (and even in some survey datasets, since certain groups respond less to online polls ([2501.15351] Fairness in LLM-Generated Surveys)). In short, if a demographic’s voice is rare or filtered in the training pipeline, the model may fill the gaps with generalizations or stereotypes. This can yield, for example, an elderly persona that talks more like a middle-aged internet user than a real 80-year-old. Continuous model updates would be needed to capture evolving and diverse population segments (Synthesizing Public Opinions with LLMs: Role Creation, Impacts, and the Future to eDemorcacy), but such updates lag behind reality.
Alignment and RLHF Biases: Beyond pre-training data, the fine-tuning and alignment process (instruction tuning with human feedback) introduces its own bias. Human annotators and safety guidelines tend to favor certain normative behaviors and moderate views. As a consequence, reinforcement learning from human feedback (RLHF) tends to “blunt” the model’s depiction of extreme or controversial opinions. Researchers have observed that LLMs fine-tuned with human feedback often reflect liberal, well-educated demographics, regardless of the persona they are asked to play (). In other words, an RLHF-tuned model like ChatGPT has a built-in inclination toward progressive, educated stances (likely mirroring the values of the annotators or the intended “polite” assistant persona). One paper noted these models “adopt progressive stances regardless of the demographic background they role-play”, which may enhance safety but “limits their utility as models of human communicative dynamics” (). For example, if prompted to simulate an outspoken conservative or a member of a hate group, a heavily aligned model might sanitize the response or outright refuse, due to content filters against hate speech. Even in less extreme cases, alignment tuning can erase nuance: A recent evaluation found newer RLHF-tuned models showed over 99% approval for a certain political figure (U.S. President Joe Biden), despite real public opinion being deeply divided (Assessing Political Bias in Language Models | Stanford HAI). This suggests the model learned to favor a socially desirable answer (nearly unanimous approval) far from actual sentiments. The same study identified that groups like older adults, widows, and Mormons were effectively “invisible” to the model’s aligned persona (Assessing Political Bias in Language Models | Stanford HAI). Thus, alignment can skew the aggregate “persona” of the model toward specific values and suppress variance – contributing to misalignment for demographics who hold disfavored or less familiar views.
Biases in Modeling Interactions: LLMs do capture many complex correlations from data, but they may struggle with intersectional and conditional combinations of traits. Social identities are not independent – e.g. the experience of being a young Black woman in a city is different from an older Black woman in a rural area, etc. Models often oversimplify by treating each attribute in isolation. If not carefully prompted, an LLM might simulate a “50-year-old working-class Hispanic woman” by gluing together generic features of “50-year-old,” “working class,” “Hispanic,” and “woman,” without grasping how these factors intersect in real life. This can produce stereotyped or internally inconsistent personas. The logistic regression analysis in one study highlighted that even if a model seemed fair on single dimensions (gender or race alone), it exhibited bias on combined dimensions – for example, being both non-white and female had a uniquely negative effect on prediction accuracy that wasn’t apparent when looking at gender or race separately ([2501.15351] Fairness in LLM-Generated Surveys). Real human demographics have many such high-order interactions (education with age with region, etc.), and if the training data didn’t adequately cover those joint distributions, the model’s responses will be off. Essentially, LLMs lack a true causal or lived-experience understanding of demographics, so they might conflate independent traits. This can manifest as “one-size-fits-all” answers for minority subgroups – e.g. assuming all rural religious individuals think the same, failing to reflect internal diversity. Without special handling, nuances like generational differences within a racial group or urban–rural splits can be lost, leading to hallucinated homogeneity.
Content Filtration and Self-Censorship: Related to alignment, the censorship filters that prevent toxic or sensitive outputs can also remove some authentic expressiveness for certain groups. Marginalized and politically extreme groups sometimes express themselves with strong language, passionate intensity, or reference taboo experiences. If an LLM’s safety layer aggressively blocks or dulls any output containing negativity, profanity, or politically charged rhetoric, then a simulated persona from those groups will sound unnaturally restrained. For instance, a simulation of a disenfranchised minority youth talking about discrimination might be toned down if the model avoids slurs or graphic details – yet in reality such a person’s testimonial could be raw and unfiltered. While no one advocates for producing harmful content wholesale, research contexts have found that relaxing content filters can increase realism. Anecdotally, when using uncensored LLM variants (or instructing the model to stay “in character” even if the dialogue is heated), the responses become more emotionally expressive and candid, closer to how real humans might react in passionate discussions. This comes with the obvious trade-off of potentially generating offensive remarks. Indeed, the BiasLens benchmark uncovered tens of thousands of biased or offensive outputs when role-playing 550 diverse social roles, indicating that the underlying model will produce such content if not reined in (Benchmarking Bias in Large Language Models during Role-Playing) (Benchmarking Bias in Large Language Models during Role-Playing). Many of those were likely models parroting stereotypes or hateful views when asked to assume roles, which in one sense is “realistic” (if those roles truly hold such views) but problematic ethically. Thus, content filters are a double-edged sword: necessary for safety and bias mitigation, but possibly restricting an LLM from fully inhabiting a persona that holds very unsavory beliefs. This tension contributes to gaps – e.g. a neo-Nazi character simulation might come out sounding implausibly mild or may trigger refusals due to the model’s anti-hate alignment, failing to reflect how an actual member of that group would speak. Researchers must navigate this carefully when the goal is realism.

In summary, insufficient or skewed training data gives the model an incomplete picture of certain demographics, and alignment tuning further pushes the model toward a generic, often liberal and educated, persona. Meanwhile, the model’s internal representation might not entangle demographic factors in the nuanced ways humans do, and strict safety filters can clip the peaks and troughs of expression for controversial groups. All these factors combined lead to the underrepresentation and inaccuracies discussed. Recognizing these root causes allows us to consider interventions to improve demographic realism in LLM simulations.

Alignment vs. Realism: Effects of Modifying Censorship Filters

A crucial area of inquiry is how an LLM’s alignment (safety/censorship mechanisms) affects the realism and expressiveness of its role-played personas – especially for marginalized or politically extreme groups. Several experiments have explored what happens if those safeguards are loosened or if the model is prompted in ways to sidestep them:

“Base” Pretrained Models vs. RLHF-tuned Models: One way to gauge alignment effects is to compare an unaligned base model (just pre-trained on raw internet text) with a fine-tuned, instruction-aligned model on the same role-playing task. Santurkar et al. (2023) did exactly this in analyzing political opinion alignment. They found a striking difference: models trained on internet data alone leaned toward the views of less-educated, lower-income, and conservative groups, whereas the RLHF-refined models leaned toward more liberal, higher-educated groups (Assessing Political Bias in Language Models | Stanford HAI). Neither is an exact match to the real population, but the contrast is telling – the base model presumably mirrors the bias of online content (which often amplifies certain fringe or populist views), while the aligned model reflects the bias of human moderators (skewing more progressive). In practical terms, using a base (uncensored) model may yield harsher or more polarized responses that, for some demographics, might actually be closer to reality. For example, a base model might freely generate conspiracy-laden or derogatory statements when asked to role-play a member of an extremist group, whereas ChatGPT might refuse or water down those statements. If one’s research specifically needs an uncensored portrayal of a radical or marginalized perspective, the base model can be more expressive (albeit at the cost of safety). On the other hand, the base model might over-index on internet biases – e.g. it may oversample toxic viewpoints that aren’t truly as prevalent in the general group. This is why Santurkar’s work emphasizes evaluating representativeness (how well model output aligns with actual survey distributions) and steerability (how well we can get the model to adopt a given subgroup’s stance) (Assessing Political Bias in Language Models | Stanford HAI) (Assessing Political Bias in Language Models | Stanford HAI). Alignment greatly influences these properties.
Experimenting with Reduced Filtering: Some researchers have tested models in an “uncensored” mode to see if that increases persona fidelity. One anecdote from Chuang et al. (2024) describes how role-playing a partisan respondent led ChatGPT to give an incorrect but biased answer that humans of that party commonly gave (). Specifically, when asked a factual question (“What was the US unemployment rate when Obama left office?”), ChatGPT normally would give the correct figure, but when told “Answer as a typical Republican”, it returned an inflated number consistent with partisan misinformation (). This example illustrates that even an aligned model like ChatGPT can produce more realistic (albeit incorrect) outputs for a persona when prompted to ignore its truth bias and mimic how that persona might err. Essentially, the model has the knowledge of partisan patterns, but alignment usually keeps it truthful; disabling that truth constraint (by prioritizing the role-play) made the output less factually correct but more true-to-character. In general, relaxing certain alignment constraints (like always telling the truth or avoiding any group offense) allows the model to mirror the irrational or biased aspects of human responses – which is important for realism in attitudinal simulations. However, fully removing filters can also let loose undesirable content that goes beyond realism into outright policy violations (hate speech, etc.). One systematic study, “LLM Censorship: The Problem and its Limitations”, discusses how an automated censor can block model outputs and how turning it off reveals the model’s latent tendencies (though that work is more about content moderation logic than demographic realism). The key takeaway is that some degree of controlled de-alignment may be necessary to simulate politically sensitive or marginalized voices authentically. For instance, researchers have used open-source LLMs with minimal moderation to simulate focus group discussions including angry, distrustful tones that a tightly aligned model might avoid. They report that the conversations felt more genuine and captured a wider range of sentiments – but required careful post-hoc filtering to remove truly harmful language.
Pluralistic Alignment Approaches: An emerging idea is to train models not with a single monolithic alignment (which tends to favor one set of values), but with a pluralistic or persona-dependent alignment. Instead of one safety filter for all, the model could have different modes reflecting different ethical outlooks or cultural norms. For example, one could fine-tune a model on data from a marginalized community, effectively aligning it with that group’s vernacular and values, then use it to role-play members of that community. Early experiments in this vein (e.g. the PERSONA testbed for pluralistic alignment (PERSONA: A Reproducible Testbed for Pluralistic Alignment - arXiv)) aim to let the model shift alignment when simulating different personas. In a controlled setting, this could mean disabling or altering certain censorship rules when the context calls for it (say, to allow the model playing a comedian to use profanity or a model playing a protester to voice anger). The challenge is ensuring the model doesn’t go rogue – it must still refuse truly dangerous content even in character. Some researchers have tried “self-alignment” methods where the model is asked to judge when content is appropriate given a role () (). While still experimental, these approaches suggest it’s possible to have more expressive simulations without wholly sacrificing safety. For now, though, most findings indicate a trade-off: tighter alignment = safer, more generalized outputs, whereas loosening alignment = more vivid, group-specific outputs.

In summary, modifying or disabling alignment filters can indeed make simulated responses more lifelike for certain groups – a bigoted persona might actually spew bigotry, a conspiracy-minded persona might share wild theories – reflecting the true diversity of human attitudes. This boosts realism and captures voices that a polite aligned model would omit. However, it requires extreme caution: the generated content can reinforce harmful stereotypes or misinformation if used incautiously. Any such experiment should be done in consultation with ethics review, and likely kept in vitro (for analysis) rather than deployed. The current consensus is that for research purposes one might use less-filtered models to probe the full range of behaviors, but for practical survey augmentation, it’s better to incorporate explicit controls or post-processing to handle the potentially problematic outputs from these more “honest” simulations (Benchmarking Bias in Large Language Models during Role-Playing) (Benchmarking Bias in Large Language Models during Role-Playing). Ultimately, the goal is to find a balance where LLMs are aligned enough to avoid egregious bias, yet not so aligned that they become an echo chamber of one demographic. The next section focuses on strategies to achieve more balanced and nuanced simulations.

Designing More Realistic and Less Biased Virtual Personas

Given the challenges above, how can we improve LLM-based role-playing for surveys and behavioral research? A number of practical design takeaways have emerged from the literature:

Enrich Persona Prompts with Nuanced Details: Simple prompts like “Act as a 30-year-old Hispanic woman from Texas” often yield a flat stereotype. Instead, researchers found it effective to provide richer context or backstory to the persona. This can include personal history, media exposure, and even contradictory traits. For example, you might prompt: “You are a 30-year-old Hispanic woman from Texas who grew up in a conservative small town but attended a liberal college. You regularly attend church and listen to progressive podcasts, so your views are a mix. Answer the survey questions in her voice.” By injecting such nuance (mixed influences, specific cultural references), the model is guided away from one-dimensional tropes. Indeed, embedding specific personal preferences or experiences directly in the prompt helps achieve “a more realistic and nuanced simulation of survey feedback,” as one framework argued (Synthesizing Public Opinions with LLMs: Role Creation, Impacts, and the Future to eDemorcacy). This approach of dynamically creating a profile (sometimes via retrieved real profiles) has been shown to improve response accuracy over basic prompting (Synthesizing Public Opinions with LLMs: Role Creation, Impacts, and the Future to eDemorcacy) (Synthesizing Public Opinions with LLMs: Role Creation, Impacts, and the Future to eDemorcacy). The persona should have some internal “tension” or uniqueness – real people often hold a patchwork of beliefs, so the prompt can explicitly mention a few areas where the persona’s views deviate from what one might assume. This guards against the model falling back on uniform answers.
Leverage “Belief Networks” or Anchoring Opinions: A recent experiment by Chuang et al. (2024) introduced an interesting technique: rather than relying on demographics alone, they seeded the LLM persona with one concrete belief, then observed its stances on other issues () (). They constructed a human-derived belief network (mapping how beliefs co-occur in survey data) and, say, told the model “this persona strongly supports raising taxes on the wealthy” (a single belief). The model, when asked about related topics, then produced opinions that aligned more closely with real human patterns than when given demographics without that seed. This suggests a prompting strategy: include an example opinion or value in the prompt to anchor the persona. For instance, “You believe government should support welfare programs, and you also tend to trust traditional authorities.” This gives the model a foothold that triggers correlated beliefs (captured in its training). The result is more internally consistent and human-like. In short, don’t just list identity traits – also specify a key belief or attitude as a starting point. This method greatly improved alignment with survey data in their tests ().
Few-Shot Prompting with Real Examples: Another prompt engineering strategy is to provide example responses from real individuals of the target demographic (if available) as in-context exemplars. For instance, show the model how a real 70-year-old veteran answered a question, then ask it to answer as a similar person. This few-shot approach can imbue the model with the style, level of formality, and typical concerns of that group. It has been noted that few-shot prompting can nudge LLMs to reveal preference associations already present in pre-training data (Synthesizing Public Opinions with LLMs: Role Creation, Impacts, and the Future to eDemorcacy). By giving 2–3 demonstrations of the desired persona’s voice, you reduce the model’s reliance on broad stereotypes and instead steer it to mimic those concrete examples. Researchers caution that few-shot examples must be chosen carefully (they should be representative and not all identical) (Synthesizing Public Opinions with LLMs: Role Creation, Impacts, and the Future to eDemorcacy). Still, this is a straightforward way to reduce bias: if you want the model to not overlook a demographic, show it data from that demographic directly in the prompt. In effect, this hijacks the model’s pattern-matching to favor the provided micro-data over its macro-statistics.
Adjust Sampling and Creativity Settings: When generating open-ended responses, use temperature and other decoding parameters to induce variability that reflects human diversity. If the model is too deterministic (low temperature), it may give a bland, averaged answer for a demographic. A slightly higher temperature can produce a range of plausible opinions, some moderate, some extreme, mirroring the spread within a population. Likewise, using nucleus/top-p sampling to allow less probable word choices can make the response more idiosyncratic. For example, a persona might unexpectedly mention a local sports team or a recent TV show – details that aren’t strictly necessary but add realism. One caution: very high creativity can yield nonsensical riffs, but moderate stochasticity helps avoid every simulated person sounding like a median respondent. In virtual audience simulation, researchers sometimes generate multiple personas for the same demographic to capture different subtypes, effectively sampling the space of possible opinions. This technique acknowledges that there is no single “correct” answer for a demographic, so the model should produce a distribution of outputs. Comparing these outputs to the distribution of actual survey answers can highlight biases. If the model’s variance is too low (all its simulated 70-year-olds say nearly the same thing), that’s a red flag – prompt adjustments or higher randomness might be needed to inject proper variance.
Incorporate Uncertainty and Ambivalence: Real survey respondents often express uncertainty (“not sure,” “it depends”) or even contradictory sentiments in open-ended formats. Training data, however, often comes from confidently written text. To increase realism, prompts can encourage the persona to display uncertainty or conflicting thoughts when appropriate. For instance, “If you have mixed feelings, you might start by saying you’re torn on the issue.” This can counteract the model’s tendency to give a decisive answer every time. Some studies note that LLMs can simulate varied confidence levels, which allows modeling of unsure responses (Synthesizing Public Opinions with LLMs: Role Creation, Impacts, and the Future to eDemorcacy). Including phrases like “hesitates before answering…” or explicitly instructing “Answer in a reflective, uncertain tone if the persona is conflicted” can lead the model to produce the “ums” and caveats that make the response feel more authentic. This also helps break a potential bias where the model might always provide a well-structured argument (because it’s been trained to do so), whereas real people might be less coherent on complex issues.
Targeted Fine-Tuning or Retrieval Augmentation: Beyond prompting, fine-tuning an LLM on domain-specific data (if available) can dramatically improve realism for underrepresented groups (Synthesizing Public Opinions with LLMs: Role Creation, Impacts, and the Future to eDemorcacy). For example, researchers have experimented with fine-tuning on transcripts of focus groups with minority communities. This teaches the model the speech patterns and content preferences of those communities, which then reflect in role-play. Fine-tuning is costly and risks overfitting, so a lighter alternative is retrieval-augmented generation (RAG): have a database of real quotes or statistics about the demographic, retrieve relevant info given the question, and feed it into the prompt. One public opinion simulation used a retrieved profile of a persona (from a database of voter interviews) to ground the model’s answers (Synthesizing Public Opinions with LLMs: Role Creation, Impacts, and the Future to eDemorcacy). The result was more accurate than either prompting or fine-tuning alone. Essentially, factual grounding about the group (e.g. “80% of people like me feel X about this issue”) helps the model avoid hallucinating. It’s an effective way to inject real-world data on the fly, ensuring the persona’s responses align with known survey trends (Synthesizing Public Opinions with LLMs: Role Creation, Impacts, and the Future to eDemorcacy). This method can also correct model bias: if a model tends to skew too liberal, retrieving a conservative-leaning background snippet for that persona can pull its answer in line.
Bias Checks and Iterative Refinement: Treat the LLM’s output as a draft to be post-processed for bias. For instance, one could programmatically check if the answer uses any stereotypes or seems too generic, and if so, adjust the prompt or add instructions to avoid that. Some frameworks (like BiasLens (Benchmarking Bias in Large Language Models during Role-Playing)) automatically flag biased wording in role-play outputs; those could be employed to refine prompts. A simple checklist for each persona’s answer might include: Does this sound too fluent/knowledgeable for the persona’s profile? Is it echoing a known stereotype? Does it ignore part of the prompt? By catching those issues, one can revise the prompt with more specificity. For example, if a model keeps making a working-class character very polite and verbose, one might add “use simple, informal language” to the prompt. Human-in-the-loop evaluation with people from the target demographic can be invaluable – they can tell you if the voice “feels right” or what’s missing. Iteratively updating the prompt or fine-tuning dataset based on these reviews will yield a more representative final simulation.
Monitoring High-Risk Outputs: When simulating marginalized or sensitive personas without heavy filters, implement runtime checks on the content. This isn’t about prompt engineering per se, but a design safeguard: if the model’s response includes slurs, overt misinformation, or toxic content, one might want to either filter it out or at least flag it in analysis. For research purposes, you might allow the model to produce it but clearly label it as “biased content generated to reflect persona’s possible view”. Having a checklist of disallowed extreme content that even a role-play shouldn’t cross (e.g. direct calls for violence) is important, so that prompt instructions can include something like “While you may express anger or bias as this persona, do not explicitly encourage harm.” This way, you preserve expressiveness without enabling the worst-case outputs. Some studies have shown that in 76% of cases, adding a role-play persona actually increased biased responses from the model (Benchmarking Bias in Large Language Models during Role-Playing) (Benchmarking Bias in Large Language Models during Role-Playing) – hence, the need for guardrails. Effective prompt design can set those boundaries (for example, by saying “this person is prejudiced but never uses slurs in public”). It’s a delicate balance between authenticity and responsibility.

By applying these strategies, practitioners have managed to reduce bias and increase demographic realism in virtual audience simulations. For instance, the “role creation” approach that dynamically builds a profile with injected knowledge achieved significant accuracy gains in public opinion prediction (Synthesizing Public Opinions with LLMs: Role Creation, Impacts, and the Future to eDemorcacy). Likewise, pluralistic prompting that acknowledges within-group diversity leads to a spread of responses that better match survey distributions (Synthesizing Public Opinions with LLMs: Role Creation, Impacts, and the Future to eDemorcacy) ([2501.15351] Fairness in LLM-Generated Surveys). The overarching lesson is to move beyond static, surface-level personas – instead, treat each simulated person as a complex individual shaped by multiple factors. Just as a skilled interviewer might adjust questions to a respondent’s background, a skilled prompt can cue an LLM into adopting a richer, more faithful voice for that respondent.

Conclusion

LLM-based role-playing for surveys and social simulations holds great promise for augmenting research, but fair and realistic representation of all demographic groups remains a challenge. Studies to date highlight that certain groups – especially older adults, women and racial minorities in specific contexts, lower socio-economic strata, and those outside the Western liberal mainstream – are often misrepresented by default LLM behaviors ([2501.15351] Fairness in LLM-Generated Surveys) ([2303.17548] Whose Opinions Do Language Models Reflect?). These gaps stem from a confluence of skewed training data, alignment-driven biases (e.g. a safety-induced progressive tilt), and the difficulty of capturing intersectional nuance in a statistical model. Encouragingly, scholars are actively developing techniques to diagnose and mitigate these issues. Experiments where alignment filters are relaxed demonstrate the importance of calibrating model “censorship” to balance realism with ethical constraints. On the other hand, injecting domain knowledge and nuance into prompts shows that much can be done without sacrificing safety – often, the solution is simply providing the model with more context so it doesn’t default to shallow assumptions.

In practical terms, anyone attempting a virtual audience or survey simulation with LLMs should proactively check for demographic biases and not assume the model “knows” how to be fair. Combining methods – e.g. using Santurkar et al.’s OpinionQA or similar tools to evaluate opinion alignment (Assessing Political Bias in Language Models | Stanford HAI) (Assessing Political Bias in Language Models | Stanford HAI), and then adjusting prompt strategies as described – can iteratively improve the fidelity of the simulation. It’s also wise to validate key results against real survey data whenever possible, particularly for high-stakes demographics (e.g. predicting how a marginalized community feels about a policy). The research community is just beginning to explore the capabilities of LLMs as proxy respondents, and while early results are mixed, they point to a future where carefully tuned AI personas could become a valuable supplement to traditional surveys ([2501.15351] Fairness in LLM-Generated Surveys) ([2501.15351] Fairness in LLM-Generated Surveys). Achieving that will require continued attention to fairness, pluralism, and context in model design.

In closing, LLM simulations must be approached with both enthusiasm and caution: enthusiasm for their potential to amplify under-heard voices at scale, and caution to ensure those voices are not distorted by unseen biases. By identifying which groups are at risk of misrepresentation and employing thoughtful prompt engineering and alignment techniques, we can move closer to LLM-generated data that mirrors the true complexity of human opinions – complete with diversity, disagreement, and nuance. Such advancements will not only improve the utility of virtual audience studies but also contribute to the broader goal of making AI systems more equitable and culturally aware (Synthesizing Public Opinions with LLMs: Role Creation, Impacts, and the Future to eDemorcacy) (Assessing Political Bias in Language Models | Stanford HAI).

Sources: The analysis above synthesized findings from recent studies on LLM social simulations, bias, and alignment, including Abeliuk et al. (2025) on fairness across demographics ([2501.15351] Fairness in LLM-Generated Surveys) ([2501.15351] Fairness in LLM-Generated Surveys), Santurkar et al. (2023) on opinion alignment gaps ([2303.17548] Whose Opinions Do Language Models Reflect?) (Assessing Political Bias in Language Models | Stanford HAI), Argyle et al. (2023) on simulating human samples (Synthesizing Public Opinions with LLMs: Role Creation, Impacts, and the Future to eDemorcacy), Chuang et al. (2024) on role-play alignment with belief networks (), the BiasLens framework by Li et al. (2024) exposing role-play biases (Benchmarking Bias in Large Language Models during Role-Playing), and others. These works collectively inform the best practices and cautionary notes presented here.

Ask Rally