Reproducing Real-World Demographic Biases in AI Agent Simulations
Researchers have increasingly explored using AI agents – especially large language models (LLMs) – to simulate human survey respondents or social behaviors. A key question is whether these synthetic agents, when assigned demographic characteristics mirroring real populations, exhibit the same attitudes, preferences, and biases observed in real demographic groups. Recent studies provide evidence that, under the right conditions, LLM-based simulations can indeed reflect real-world demographic patterns (a property termed “algorithmic fidelity” ([2209.06899] Out of One, Many: Using Language Models to Simulate Human Samples)). However, they also highlight important nuances in methodology and limitations in faithfully reproducing human diversity.
LLM Simulations Reflecting Demographic Patterns
Several studies have demonstrated that LLMs can accurately emulate group differences from real survey data when prompted with detailed demographic personas:
-
Argyle et al. (2023) showed that the GPT-3 model could be conditioned to act as “silicon” survey respondents from various subpopulations. By providing GPT-3 with rich socio-demographic backstories of thousands of real survey participants, they found the model’s response distributions closely matched those of real human subgroups ([2209.06899] Out of One, Many: Using Language Models to Simulate Human Samples). For example, GPT-3’s simulated two-party vote shares for different demographic groups in U.S. elections aligned closely with actual results from those groups (Performance and biases of Large Language Models in public opinion simulation | Humanities and Social Sciences Communications). The authors define algorithmic fidelity as this fine-grained, demographically correlated alignment of the AI’s outputs with real-world subgroup patterns ([2209.06899] Out of One, Many: Using Language Models to Simulate Human Samples).
-
Lee et al. (2024) extended this approach to public opinion on climate change. They conditioned GPT-4 on respondents’ demographics (and, in some cases, on additional psychological variables) to predict answers to climate opinion surveys. They found that demographically-conditioned GPT-4 reproduces known patterns – for instance, it could mimic U.S. presidential voting behaviors by subgroup with high accuracy (Can large language models estimate public opinion about global warming? An empirical assessment of algorithmic fidelity and bias | PLOS Climate). However, when predicting climate-change beliefs, demographics alone were insufficient; only when issue-specific factors were included did the model’s predictions approach real-world responses (Can large language models estimate public opinion about global warming? An empirical assessment of algorithmic fidelity and bias | PLOS Climate). This suggests that for complex topics, giving the AI more context (beyond basic demographics) is crucial. Notably, GPT-4 still showed an “algorithmic bias”, systematically underestimating the pro-climate opinions of Black Americans compared to survey data (Can large language models estimate public opinion about global warming? An empirical assessment of algorithmic fidelity and bias | PLOS Climate). This highlights that even a strong LLM can miss certain group-specific nuances.
-
Open-Ended Behavioral Simulations: Beyond surveys, AI agents have been used to simulate broader behaviors. Stanford researchers constructed 1,052 AI agents modeled on real individuals by feeding a language model the transcript of a two-hour interview with each person. These detailed persona agents achieved about 85% accuracy in reproducing their human counterparts’ responses on the General Social Survey and other tests (LLM Social Simulations Are a Promising Research Method). In other words, a simulated individual answered questions similarly to how the real person did, nearly as consistently as the person’s own answers would be in a re-test (LLM Social Simulations Are a Promising Research Method). This impressive result required rich personal data; indeed, the “interview-based” agents far outperformed agents given only abstract demographic info or self-descriptive blurbs (Stanford researchers are simulating human personalities with AI agents - CO/AI). It suggests that supplying nuanced background (beyond just age/gender labels) helps the AI capture a person’s idiosyncratic attitude profile more faithfully.
-
Multiple Models and Domains: Recent work has compared various LLMs and even non-LLM approaches. One study used World Values Survey data to test GPT-4, GPT-3.5, Claude, and open models (Llama, Mistral) by prompting them with demographic attributes (e.g. religion, ethnicity) and having them answer hundreds of opinion questions (Llms, Virtual Users, and Bias: Predicting Any Survey Question Without Human Data) (Llms, Virtual Users, and Bias: Predicting Any Survey Question Without Human Data). In general, LLMs could generate plausible population-wide answers without any additional training, often approaching the accuracy of a supervised model trained on the survey (Llms, Virtual Users, and Bias: Predicting Any Survey Question Without Human Data). For instance, LLM “virtual users” could predict many survey item outcomes comparably to a machine learning model using the structured data. This again underscores that pretrained LLMs encode substantial knowledge about how demographics correlate with opinions. That said, certain groups’ responses were less accurate, indicating gaps or biases in the model’s knowledge (Llms, Virtual Users, and Bias: Predicting Any Survey Question Without Human Data). In this study, the authors intriguingly found that disabling the model’s built-in content filters (“censorship”) improved accuracy for minority groups – the uncensored LLM more freely produced the perspectives of underrepresented demographics, whereas the aligned version had struggled (Llms, Virtual Users, and Bias: Predicting Any Survey Question Without Human Data). This finding suggests that heavy-handed moderation or alignment tuning might inadvertently suppress some authentic subgroup variations.
In summary, when prompted to “role-play” as a person from a given demographic, modern LLMs often do reproduce the broad statistical tendencies of that group. Everything from generational differences in attitudes to gender gaps and partisan splits have been mirrored in these simulations ([2209.06899] Out of One, Many: Using Language Models to Simulate Human Samples) (Performance and biases of Large Language Models in public opinion simulation | Humanities and Social Sciences Communications). This capability has been leveraged to create synthetic public opinion polls or to populate virtual societies with agents that act in aggregate like real populations.
Methodologies: Conditioning on Demographics and Personas
How researchers condition AI agents on demographics is critical to achieving realistic bias distributions. Common techniques include:
-
Persona Prompts and Backstories: The simplest approach is instructing the LLM with a persona description (e.g. “You are a 45-year-old female teacher from California who is politically independent and has a college degree.”). Argyle et al. used automated backstory templates filled with real respondents’ attributes to give GPT-3 a vivid identity before asking each survey question ([2209.06899] Out of One, Many: Using Language Models to Simulate Human Samples). Bisbee et al. similarly prompted ChatGPT to “adopt different personas” and answer political opinion questions (Synthetic Replacements for Human Survey Data? The Perils of Large Language Models | Political Analysis | Cambridge Core). This role-play method is effective in steering the model toward the response distribution characteristic of that persona’s group – for example, when asked to rate feelings about various social groups, ChatGPT’s average ratings per persona matched real survey averages very closely (Synthetic Replacements for Human Survey Data? The Perils of Large Language Models | Political Analysis | Cambridge Core). Each synthetic persona essentially acts as a stand-in for a demographic subset.
-
Demographic Variables as Input: In some studies, instead of free-form role-play, the prompt is structured to include demographic fields. For instance, in a healthcare decision modeling study, prompts were formatted with an explicit list of attributes (age, gender, race, income, etc.) and a scenario, then the model was asked to respond with a decision (like vaccine uptake) (Evaluating the Bias in LLMs for Surveying Opinion and Decision Making in Healthcare) (Evaluating the Bias in LLMs for Surveying Opinion and Decision Making in Healthcare). This mimics how one might feed features into a predictive model, except the LLM generates a textual decision. Such structured prompts make it clear which demographic profile the agent is imitating, and researchers can systematically vary those fields to simulate a whole population.
-
Fine-Tuning and Hybrid Models: While many works use off-the-shelf LLMs with prompting, some experiments have fine-tuned models on survey data. For example, one project (Hewitt et al. 2024, as noted in a survey paper) fine-tuned an open LLM on responses from 160 different human experiments (LLM Social Simulations Are a Promising Research Method). This approach essentially trains the AI to become an even closer proxy of human subjects, potentially increasing fidelity in specific domains. However, fine-tuning requires substantial data and careful handling to avoid overfitting to particular samples.
-
Rich Persona Embeddings: The Stanford “digital twin” study mentioned above took a different route by conducting lengthy interviews with real people and feeding those transcripts to the model (Stanford researchers are simulating human personalities with AI agents - CO/AI). The AI then internalized not just demographics but personal narratives, opinions, and speech patterns. The result was an agent that could be queried on entirely new questions and still answer much like the real person would (Stanford researchers are simulating human personalities with AI agents - CO/AI). This method is resource-intensive but demonstrates the upper bound of realism when an agent’s persona is extremely well specified.
Validation of these simulations is done by comparing the AI-generated data to real-world data. Common validation methods include:
-
Aggregate Distribution Matching: Compare statistics like means, proportions, or correlations. For instance, does the percentage of simulated “Gen Z” respondents supporting a policy match the percentage in actual surveys? Argyle et al. reported high correlation between GPT-generated samples and real survey samples across many questions ([2209.06899] Out of One, Many: Using Language Models to Simulate Human Samples). In one case, GPT-3’s simulation of U.S. 2012 election voting by subgroup reproduced the true two-party vote split almost exactly (Performance and biases of Large Language Models in public opinion simulation | Humanities and Social Sciences Communications).
-
Cross-tab and Regression Comparison: Researchers also test if relationships between variables hold. Bisbee et al. examined regression coefficients derived from synthetic survey responses versus those from the real survey (Synthetic Replacements for Human Survey Data? The Perils of Large Language Models | Political Analysis | Cambridge Core). They found some divergence – meaning that while an LLM might get marginal percentages right, it could misestimate how demographics interact (for example, over- or under-estimating the strength of a gender effect on an opinion).
-
Individual-Level Accuracy: In the Stanford study, since each AI agent corresponded to a real person, they could measure what fraction of questions the AI answered the same way as the person did. The 85% accuracy on the GSS indicates a high individual-level fidelity (LLM Social Simulations Are a Promising Research Method). Other work has used test-retest reliability as a benchmark – essentially asking, if a person’s answers can vary slightly over time, does the AI’s answer fall within that same variability range? An ideal simulation shouldn’t be a verbatim copy (which might indicate overfitting or plagiarism of training data), but it should land in the statistical ballpark of real responses.
-
Human Believability: A more qualitative check is whether human judges can tell AI-generated respondents from real ones, or whether the AI’s answers sound plausible. While this is not a primary metric (because the goal is matching reality, not just sounding realistic), it has been noted that people often find the outputs believable, yet there is a risk of AI answers echoing stereotypes that a casual observer might accept as “realistic” () (). Thus, believability alone is not a sufficient validation of correctness.
Patterns Observed vs. Distortions and Biases
Encouraging findings: When properly conditioned, AI agents have shown a remarkable ability to mirror real demographic patterns. They can capture known biases such as generational gaps, education effects, and partisan divides:
-
Political and Social Attitudes: As noted, models like ChatGPT and GPT-4, prompted with profiles, produced support/opposition levels on political issues that aligned with actual survey data for those profiles ([2209.06899] Out of One, Many: Using Language Models to Simulate Human Samples) (Synthetic Replacements for Human Survey Data? The Perils of Large Language Models | Political Analysis | Cambridge Core). For example, a “conservative older male” agent will favor gun rights more than a “liberal young female” agent, closely mirroring trends in opinion polls. In Argyle et al.’s experiments, GPT-3 could even reflect complex cross-features (e.g. attitudes of college-educated suburban women as a subgroup) with surprising nuance ([2209.06899] Out of One, Many: Using Language Models to Simulate Human Samples).
-
Health Behavior: In a simulation of COVID-19 vaccine hesitancy, different LLM-based agents exhibited the known differences in uptake by education level, race, and income. Some models in that study “captured the effects of gender, race, income, and education in vaccine acceptance quite well,” closely tracking survey data (Evaluating the Bias in LLMs for Surveying Opinion and Decision Making in Healthcare). This implies the model had learned, for instance, that historically Black communities showed more skepticism early on, or that higher-educated individuals were faster to accept the vaccine – patterns borne out in real studies and reflected in the synthetic data.
-
Global and Cross-Cultural Opinions: Research using the World Values Survey and other international data finds that ChatGPT’s responses vary when the persona’s country or culture is changed, echoing real cultural biases (Performance and biases of Large Language Models in public opinion simulation | Humanities and Social Sciences Communications). For instance, a “simulated respondent” from Sweden vs. one from Singapore will give different answers on questions about traditional values or political trust, much as actual populations differ. However, the fidelity can be uneven if the model’s training data is sparse for certain regions (Performance and biases of Large Language Models in public opinion simulation | Humanities and Social Sciences Communications). One paper noted that ChatGPT, being English-centric, was less reliable for non-Western populations unless those views were well-represented in its training (Performance and biases of Large Language Models in public opinion simulation | Humanities and Social Sciences Communications).
Despite these successes, cautionary findings have also emerged. LLM simulations are not perfect replicas of human data and can introduce their own artifacts or errors:
-
Reduced Variability (“Over-smoothing”): A consistent observation is that synthetic data can be too consistent. Bisbee et al. reported that ChatGPT’s simulated respondents exhibited “less variation in responses than in the real surveys” (Synthetic Replacements for Human Survey Data? The Perils of Large Language Models | Political Analysis | Cambridge Core). In their test, the standard deviation of opinions across simulated individuals was only about one-third of that in actual poll respondents (How Internet-Trained LLMs Exaggerate Our Differences - OSF). In practice, this meant the AI personas were often overly average – e.g., if most people are 60 out of 100 on a feeling thermometer, the AI might make almost everyone a neat 60, whereas real people’s scores vary widely around that mean. This homogenization could be due to the model averaging likely answers or avoiding extreme positions unless strongly cued. It suggests current LLMs may under-represent the true diversity of thought within demographic groups.
-
Distorted Relationships: Even if means match, the way variables interact in the simulation might differ from reality. The same study found regression coefficients from the synthetic data often differed significantly from real data (Synthetic Replacements for Human Survey Data? The Perils of Large Language Models | Political Analysis | Cambridge Core). For example, the true data might show that income is a strong predictor of support for a policy within each ethnic group, but the LLM’s responses might not reflect as steep an income gradient (or vice versa). This could mislead analyses that rely on synthetic data. The problem may stem from the LLM not perfectly capturing conditional distributions or over-generalizing from its training corpus.
-
Amplification of Biases: There is a risk that an AI will exaggerate certain demographic stereotypes, essentially creating a caricature. Cheng et al. (2023) introduced a framework “CoMPosT” to evaluate such caricature in LLM simulations () (). They found that GPT-4’s simulations of certain political and marginalized groups on generic topics were “highly susceptible to caricature,” meaning the AI’s portrayal was more extreme or one-dimensional than real individuals from those groups (). For instance, a simulated “progressive activist” might respond with uniformly far-left views on every issue, whereas real activists have nuanced or even moderate views on some questions. This exaggeration likely arises from over-reliance on stereotypes present in the model’s training data. The lack of individuation – many synthetic individuals sounding alike – and exaggeration of group traits are warning signs that must be managed to avoid reinforcing biases ().
-
Selective Accuracy & Gaps: LLM knowledge is not uniformly accurate across all demographics. Lee et al.’s climate opinion study highlighted how some groups’ attitudes were captured well and others poorly (Can large language models estimate public opinion about global warming? An empirical assessment of algorithmic fidelity and bias | PLOS Climate). GPT-4, for example, struggled with the nuanced climate views of Black Americans, even when it got white and Hispanic Americans’ trends correct (Can large language models estimate public opinion about global warming? An empirical assessment of algorithmic fidelity and bias | PLOS Climate). This could reflect historical biases in data availability (e.g. more data on majority groups’ opinions) or alignment choices that caused the model to miss certain perspectives. Similarly, the healthcare survey simulation found that different models yielded different bias patterns: one model might under-predict vaccine hesitancy among a certain group (making the synthetic population too optimistic), while another model over-predicts the gap (making differences too stark) (Evaluating the Bias in LLMs for Surveying Opinion and Decision Making in Healthcare) (Evaluating the Bias in LLMs for Surveying Opinion and Decision Making in Healthcare). The study noted that some models actually “amplify racial, income, or education disparities” beyond what the real data showed (Evaluating the Bias in LLMs for Surveying Opinion and Decision Making in Healthcare), whereas others blunted those differences. Such variability implies that choice of model and prompt can tilt the balance – either masking real biases or overshooting them – so using multiple models and calibrating against real data is advisable.
-
Prompt Sensitivity and Reproducibility: Another challenge is that these simulations can be brittle. Bisbee et al. demonstrated that minor changes in prompt wording led to significant changes in the output distribution, even though logically the task was the same (Synthetic Replacements for Human Survey Data? The Perils of Large Language Models | Political Analysis | Cambridge Core). For example, phrasing the persona prompt in a slightly different style might make the AI shift its answers unpredictably. Moreover, LLMs like ChatGPT are moving targets (with periodic model updates); the authors found the same prompt given three months later produced significantly different survey results (Synthetic Replacements for Human Survey Data? The Perils of Large Language Models | Political Analysis | Cambridge Core), likely due to updates in the model or randomness. This raises concerns about the reliability and reproducibility of synthetic survey data (Synthetic Replacements for Human Survey Data? The Perils of Large Language Models | Political Analysis | Cambridge Core). If a policymaker tried to use an AI to gauge public opinion, one would need to ensure the method is stable over time or easily recalibrated.
Given these findings, scholars urge careful validation and bias assessment whenever using AI agents for social simulation (Can large language models estimate public opinion about global warming? An empirical assessment of algorithmic fidelity and bias | PLOS Climate). Simply mirroring demographics in the prompt does not guarantee a fully faithful reproduction of reality – the simulations might have their own “AI biases.” Some biases come from the training data (e.g., if online text underrepresents a minority viewpoint, the model might too), and some come from the model’s architecture or safety filters as noted.
Conclusion and Future Directions
In summary, AI agent simulations conditioned on real-world demographics can reproduce many real-world bias distributions, often with striking accuracy on aggregate measures. Studies in 2023–2025 have shown that LLMs like GPT-3.5, GPT-4, and others can serve as surrogate respondents, yielding attitude distributions by age, gender, race, etc., that resemble those found in actual surveys ([2209.06899] Out of One, Many: Using Language Models to Simulate Human Samples) (Can large language models estimate public opinion about global warming? An empirical assessment of algorithmic fidelity and bias | PLOS Climate). This suggests that LLMs have absorbed a great deal of sociocultural knowledge – enough to differentiate, say, a middle-aged Midwestern man’s likely opinions from those of a young urban woman on a range of topics. Such simulated populations open up exciting possibilities for social science: one can test hypotheses quickly or explore how a policy might be received by different demographics by querying the AI agents instead of running new surveys (LLM Social Simulations Are a Promising Research Method) (LLM Social Simulations Are a Promising Research Method).
However, researchers also consistently warn that these simulations are not a replacement for real data and must be used with caution (Synthetic Replacements for Human Survey Data? The Perils of Large Language Models | Political Analysis | Cambridge Core) (Synthetic Replacements for Human Survey Data? The Perils of Large Language Models | Political Analysis | Cambridge Core). To faithfully mirror human diversity, future work is focusing on a few improvements:
-
Better Conditioning: Incorporating more than surface demographics – e.g. ideological cues, personality traits, or lived experience details – can improve fidelity on complex behaviors (Can large language models estimate public opinion about global warming? An empirical assessment of algorithmic fidelity and bias | PLOS Climate). Role-playing with richer personas (even up to using real interview text) has proven to yield more lifelike agents (Stanford researchers are simulating human personalities with AI agents - CO/AI).
-
Mitigating Caricature and Bias: Techniques to ensure the model doesn’t default to stereotypes are being studied. This might include adversarial prompts to test for exaggeration (), or fine-tuning on datasets that emphasize intragroup diversity. Aligning model outputs to match not just averages but variance and covariance of human data is a challenge on the horizon.
-
Model and Prompt Selection: Different LLMs have different strengths. Larger models (like GPT-4) generally showed higher fidelity (Can large language models estimate public opinion about global warming? An empirical assessment of algorithmic fidelity and bias | PLOS Climate), but overly restrictive safety tuning can make them “too impartial.” Researchers found that using open models with fewer filters, or explicitly instructing the model to be candid, sometimes yielded more accurate reflections of sensitive group differences (Llms, Virtual Users, and Bias: Predicting Any Survey Question Without Human Data). Going forward, one may choose a model or tune a model to balance ethical constraints with fidelity to authentic group voice.
-
Validation Against Ground Truth: Every synthetic simulation ideally should be cross-checked against some real data. For example, if using an AI to simulate “Gen Z vs Boomer” differences, one might compare the AI’s output on a few benchmark questions to recent survey results as a sanity check. Ongoing work is establishing metrics (like algorithmic fidelity scores) to quantify how close an AI-generated distribution is to the real-world target (Can large language models estimate public opinion about global warming? An empirical assessment of algorithmic fidelity and bias | PLOS Climate) (Can large language models estimate public opinion about global warming? An empirical assessment of algorithmic fidelity and bias | PLOS Climate).
In conclusion, broad AI agent simulations, particularly those powered by LLMs, have shown an emerging ability to reproduce real-world demographic biases and patterns in synthetic populations. From public opinion polls to behavioral experiments, these agents often echo the differences between Gen Z and Millennials, men and women, liberals and conservatives, etc., that we observe in actual societies. The research to date – spanning political science, psychology, and computer science – indicates substantial promise in using “simulated societies” to complement traditional studies (LLM Social Simulations Are a Promising Research Method) (LLM Social Simulations Are a Promising Research Method). Yet it also emphasizes the need for careful methodological design and ethical guardrails. Demographic mirroring can yield realistic outcomes, but only if we account for the nuances – ensuring the AI’s learned biases align with real biases (and not artifacts), and guarding against new biases introduced by the AI itself. With continued refinement, AI agent simulations could become a powerful tool for social scientists, allowing them to explore hypothetical scenarios and rare populations by testing on a silicon sample before drawing conclusions about society ([2209.06899] Out of One, Many: Using Language Models to Simulate Human Samples). The consensus of recent studies is that we should proceed optimistically yet carefully, validating these synthetic proxies at each step to truly trust their reflection of our diverse world. Sources:
-
Argyle, L. P., et al. (2023). Out of One, Many: Using Language Models to Simulate Human Samples. Political Analysis, 31(3), 337–351. (Finding: GPT-3, when given real demographic backstories, accurately emulated response distributions of many human subgroups ([2209.06899] Out of One, Many: Using Language Models to Simulate Human Samples), achieving high correspondence with survey data (Performance and biases of Large Language Models in public opinion simulation | Humanities and Social Sciences Communications).)
-
Lee, S., et al. (2024). Can large language models estimate public opinion about global warming? PLOS Climate, 3(8): e0000429. (Finding: GPT-4, conditioned on demographics, reproduced patterns in voting and – with additional covariates – in climate opinions. It showed 53–91% accuracy on various belief measures when both demographics and psychological factors were included (Can large language models estimate public opinion about global warming? An empirical assessment of algorithmic fidelity and bias | PLOS Climate). Noted an underestimation bias for Black Americans’ responses (Can large language models estimate public opinion about global warming? An empirical assessment of algorithmic fidelity and bias | PLOS Climate).)
-
Park, J. S., et al. (2024). “Digital Twin” AI Agents of 1,052 Americans. (Stanford University study, reported in HAI News). (Finding: AI agents built from interview transcripts could predict individuals’ survey responses with ~85% of the test-retest fidelity of the people themselves (LLM Social Simulations Are a Promising Research Method), far better than demographic-only simulations. Demonstrated feasibility of large-scale personality-conditioned agents.)
-
Zhang, H., et al. (2025). Evaluating Bias in LLMs for Simulated Decision-Making in Healthcare. (Preprint). (Finding: Prompting models with demographic profiles to answer vaccine attitude questions showed some models reflected real group differences well, while others distorted them. Identified cases of both attenuation and amplification of racial/income disparities by different LLMs (Evaluating the Bias in LLMs for Surveying Opinion and Decision Making in Healthcare), urging bias-aware simulation design.)
-
Bisbee, J., et al. (2024). Synthetic Replacements for Human Survey Data? The Perils of LLMs. Political Analysis, 32(4), 401–416. (Finding: ChatGPT personas yielded very accurate group means matching ANES survey averages (Synthetic Replacements for Human Survey Data? The Perils of Large Language Models | Political Analysis | Cambridge Core), but the synthetic data was over-smoothed with far less variability. Regression analyses and slight prompt changes produced inconsistent results, raising concerns about reliability (Synthetic Replacements for Human Survey Data? The Perils of Large Language Models | Political Analysis | Cambridge Core) (Synthetic Replacements for Human Survey Data? The Perils of Large Language Models | Political Analysis | Cambridge Core).)
-
Cheng, M., et al. (2023). Characterizing and Evaluating Caricature in LLM Simulations. Proc. EMNLP 2023. (Finding: Introduced metrics for “caricature” in LLM simulations. Found GPT-4 often over-generalized personas: simulations of certain political/marginalized groups on neutral topics were overly stereotyped and lacked individual variation (). Emphasized need for evaluation beyond just replicating known survey stats.)
-
Santurkar, S., et al. (2023). Whose Opinions Do Language Models Reflect? (Preprint). (Finding: Created OpinionsQA dataset to compare LLM-stated opinions with real demographic group opinions. Identified significant gaps for some groups, highlighting that default LM outputs can be misaligned with minority viewpoints (Llms, Virtual Users, and Bias: Predicting Any Survey Question Without Human Data). Underscores importance of alignment if using LLMs to represent user groups.)
-
Additional references in-text: ([2209.06899] Out of One, Many: Using Language Models to Simulate Human Samples) (Performance and biases of Large Language Models in public opinion simulation | Humanities and Social Sciences Communications) (Can large language models estimate public opinion about global warming? An empirical assessment of algorithmic fidelity and bias | PLOS Climate) (LLM Social Simulations Are a Promising Research Method) (Stanford researchers are simulating human personalities with AI agents - CO/AI) (Synthetic Replacements for Human Survey Data? The Perils of Large Language Models | Political Analysis | Cambridge Core) () (Evaluating the Bias in LLMs for Surveying Opinion and Decision Making in Healthcare) (Synthetic Replacements for Human Survey Data? The Perils of Large Language Models | Political Analysis | Cambridge Core).