Correcting Bias in LLMs with DSPy

Correcting Bias in LLMs with DSPy

Left unchecked, large language models can massively distort reality. Our baseline simulation showed 90.6% of AI personas voting Democrat, a 40-point deviation from expected. But with DSPy optimization, we achieved near-perfect political balance, proving that prompt calibration isn’t optional, it’s essential.

May 21, 2025
← Back to Articles

If you don’t calibrate your AI personas, they’ll happily vote 90% Democrat

We recently conducted an experiment that reveals just how important calibration is when using AI for synthetic research. Without proper optimization, AI personas show overwhelming political bias—with 90.6% of AI-generated personas voting Democrat in our simulation.

This isn't just an academic concern. As businesses increasingly turn to AI for market research, customer insights, and decision-making, these inherent biases could significantly skew results.

Our Political Persona Experiment

To create these simulated voters, we used Anthropic Claude, with the same method as in our last political experiment on media diets, prompting with aggregate statistics from the US census and validating the numbers added up.

Generating 100 at a time, we created a dataset of 500 diverse personas with different demographic backgrounds—varying in age, occupation, education level, income, and location based on US census data. Each persona was given a clear political leaning in the original dataset, with the resulting distribution ending up as 51.4% Republican and 48.6% Democrat voters.

The test was simple: ask each AI-simulated persona who they would vote for in the 2024 election between Donald Trump and Kamala Harris. This works as a test because GPT-4o’s knowledge cutoff date is prior to the 2024 election, so it doesn’t know who won without making a search query.

The Baseline Results Were Biased Towards Blue

Our baseline model—with no optimization techniques applied—produced personas that voted overwhelmingly Democratic:

  • 90.6% voted for Kamala Harris

  • Only 9.4% voted for Donald Trump

This 40-point deviation from the expected distribution just isn’t realistic, as we know that in party registration is close to 50:50, and the Republicans actually won the popular vote in the last election. 

If LLMs are biased in politics, they are likely biased in other areas as well, and you need to be able to account for that when asking tools like ChatGPT for feedback on your ideas. For example, in one study it was found that LLM-generated personas were more likely to favor environmentally friendly cars, to say Liberal arts degrees were more valuable than STEM fields, and choose La La Land over Transformers. Withholding any value judgement, these hidden preferences deviate significantly from real world choices, and for better or worse, may guide heavy users of AI away from existing social or cultural norms.

LLM Bias Is A Common Problem

Both LLMs and humans suffer from bias on various topics, but it’s important to know where LLMs differ from humans in their bias, because it has implications for those of us who outsource an increasing amount of our thinking to AI.

This left-leaning bias in AI systems isn't just our observation—it's well-documented:

  • The Centre for Policy Studies found that left-leaning bias is commonplace in AI-powered chatbots

  • Brookings Institution research identified consistent political bias in AI systems

  • OpenAI CEO Sam Altman has acknowledged the left-leaning bias in ChatGPT, while pointing out that other LLMs like Grok suffer from the same problems

It makes perfect sense that models will reflect the opinions and preferences of those who trained them, even if unintentional. Researchers and those hired to label data will make millions of small decisions, which are encoded in the model through reinforcement learning. The training data is another source of bias, where high quality publications that are also left-leaning, like the New York Times, may be overrepresented. 

We also have to remember that the primary training data of LLMs is the internet, which is an imperfect mirror of reality. When social stigma is attached to a candidate, people may be less likely to express support in public, and private thoughts aren’t observed by the LLMs, leading to an imbalance.

I was aware of LLM’s left-leaning bias, which is why I made sure we generated our personas based on US census data. Researchers at Columbia found that when you ask LLMs to generate personas (without grounding in real data), they become increasingly skewed in opinions and preferences.

https://arxiv.org/pdf/2503.16527 

However, the average answer from an LLM can be wrong, so long as you adapt your prompts to adjust for this bias. LLMs have seen enough of everything on the internet and are capable of roleplaying any type of persona you need–you just need to find the right prompt.

In building Rally I’ve A/B tested hundreds of prompts to make our virtual audiences respond to questions more accurately, but this process takes time and can be hit or miss.

Correcting the Bias with DSPy Optimizers

The good news? We found that optimization techniques can dramatically improve the accuracy of AI-simulated perspectives. 

Rather than manual guesswork, I used DSPy, an automatic prompt optimization framework (actually it’s much more than that, but I use it primarily to optimize prompts). DSPy is a secret weapon, almost too good to share in case your competitors start using it. 

The library is an open-source project based on the earlier Demonstrate-Search-Predict framework, and it essentially treats prompt engineering as a software problem. There's a steep learning curve, which is why I built DSPyUI, an open source user interface for the library last year, but the results are worth it.

You define the inputs and outputs of your system, give it a dataset of examples to learn from – in our case 80% of the AI personas and their votes – and choose an optimizer to run. The optimizer will set about automatically discovering better instructions by analyzing what works and what doesn't across different examples.

We tested three different optimizers for this experiment:

1. BootstrapFewShotWithRandomSearch

This technique is a mouthful, but in practice it just chooses a few well-chosen examples to add to your prompt to improve the accuracy of its predictions. In our case we had 4 labeled demos (examples taken from my 500 AI personas dataset) and 4 bootstrapped demos (new examples made up by the model). 

This is what one of the examples added to the prompt looked like:

User message:


[[ ## persona ## ]]

You are a 21 year old College student/Retail from Houston, TX. Your background: You have been working as a college student/retail. You completed your some college and earn $16,000 annually.

[[ ## question ## ]]

Kamala Harris and Donald Trump are running in the 2024 election. Who would you vote for?

D: Kamala Harris 

R: Donald Trump

Return D or R only.

 

Assistant message:

[[ ## answer ## ]]

D

 

Simply adding 8 good examples to the prompt improved accuracy to 76.8% and reduced political bias significantly (though still leaning Democratic at 61.0%).

2. MIPROv2

MIPRO (Multiprompt Instruction PRoposal Optimizer) looks at the model's errors and automatically generates instructions to address them. This is much more heavy duty and takes longer to run, making thousands of requests before it finds the right combination of prompt instructions and few-shot examples. 

Here’s an example of the type of instruction MIPRO discovered was useful to add to the prompt, to improve realism:

Imagine you are a political consultant tasked with accurately predicting the voting preference of various personas for a high-profile client. Your reputation and future contracts depend on your ability to provide precise predictions.

This method achieved 78.8% accuracy with a 57.4% Democratic vote share—much closer to the actual distribution. We didn’t have to spend any time coming up with prompt instructions, and we can be confident that the final prompt was the best of many candidates.

3. SIMBA

SIMBA (Stochastic Introspective Mini-Batch Ascent) analyzes the model's errors and iteratively improves its prompting strategy. This is the newest optimizer and also the best in my testing.

The cool thing about this optimizer is you can follow along as it ‘thinks’ through the results it is getting, and gives itself advice on what to change about what it has learned:

Advice for self: If the module receives a persona that describes a specific socio-economic background and a question about a political choice, then it should analyze the persona's likely values and experiences more deeply. Specifically, consider factors such as income level, job stability, and regional political trends. For example, an 82-year-old retired individual with a low income may prioritize candidates who support working-class issues, which could lead to a preference for Donald Trump in this context.

It performed with 80.4% accuracy and a nearly perfect political distribution of 53.4% Republican and 46.6% Democrat votes, running faster and cheaper than MIPRO.

The end result was an unmistakable improvement in realism. We corrected for left-leaning bias in the AI responses, and got extremely close to the true distribution, within two percentage points.

Why This Matters for Business

This experiment has significant implications for any organization using AI for customer research, market analysis, or decision-making:

  1. Synthetic Persona Reliability: Businesses using AI-generated personas need to be aware of and correct for these biases to get reliable insights.

  2. Market Research Accuracy: Uncalibrated AI systems could dramatically skew market research findings in favor of more liberal/progressive preferences.

  3. Cost-Effective Correction: The good news is that techniques like SIMBA can effectively correct these biases without requiring massive computational resources.

We found this issue with bias doesn’t just apply to OpenAI: Elon Musk’s Grok 3 model also voted Democrat 68% of the time without calibration, and surprisingly Anthropic Claude 3.5 was biased in the opposite direction, voting Republican 70% of the time. 

It’s important you don’t rely too much on the recommendations given from black box synthetic AI systems. Instead investigate how the models and prompts have been calibrated to reflect real-world opinions, and test that it replicates your own studies.

Key Takeaways

  1. LLMs can be biased: Uncalibrated large language models are not neutral in their responses, and can lead you to the wrong conclusion without proper fine-tuning.

  2. Optimization makes a difference: With the right techniques, AI systems can be calibrated to provide much more accurate and balanced perspectives.

  3. DSPy is a secret weapon: Without having to guess what prompt might remove the bias, it’s possible to automatically optimize your AI audience to reflect real data.

For organizations investing in AI for customer insights or market research, these findings underscore the importance of properly calibrating your models and validating results to ensure you're getting accurate, balanced perspectives—not just reflections of the models' inherent biases.

If you’re interested in calibration experiments or synthetic research, get in touch.

Stay Updated

Join our waitlist to get notified about new articles and updates.

Mike Taylor
Mike Taylor

Mike Taylor is the CEO & Co-Founder of Rally. He previously co-founded a 50-person growth marketing agency called Ladder, created marketing & AI courses on LinkedIn, Vexpower, and Udemy taken by over 450,000 people, and published a book with O’Reilly on prompt engineering.

← Back to Articles