Goodhart’s Law nearly bankrupted one of the first startups I worked at: “When a measure becomes a target, it stops being a good measure.” In the marketing team, we worshipped one metric—clicks—because the CEO pushed for it at every occasion. Month after month we celebrated rising traffic “something is happening”, but quietly ignored the fact that despite a healthy conversion rate, hardly anyone ever came back to buy again. The day I ran a retention cohort and proved our funnel was a cul‑de‑sac, the broken business model was obvious to all… so we extended the product to be able to serve higher frequency use cases, and—after a lot of humble pie—found the next 10× leg up.

X outrage behaves the same way. Brands tally likes, reposts, replies, even “net sentiment” after a post blows up—as if the number itself will save them. But by the time your dashboard flashes red, the metric has already become the crowd’s target: trolls pile on to dunk, fans swarm to defend, and the statistic morphs into the very outcome you’re measuring to prevent.

So I flipped the script. Instead of waiting for a crowd of strangers to hijack the narrative, I asked our zero-thills default synthetic US population—100 AI personas primed to mimic America’s voices. And tested a post that had made headlines, to let the bots argue about overtime and taxes, then tweaked the prompt to come respectfully close to the narratives that played out for real.

I then used the learning to predict sentiment for several posts that would erupt the next day—in five minutes I saw the next day’s crisis in miniature. Watched the thread play out, having predicted sentiment with as high as 99.9% accuracy… before the comments had actually happened!

This writeup is the play‑by‑play of that experiment—how Goodhart’s warning about bad targets guided the build, why “accuracy” without nuance is a mirage, and the tiny prompt pivots that let me predict thousands of angry replies before they happened.

Testing Sentiment Accuracy

I started by downloading the raw JSON from a high-engagement X thread about the controversial “32-hour work-week”—a topic that has repeatedly sparked debate on Hacker News, making it a perfect starting point as we’d previously built a HN simulator.

After vibe‑coding the twitter thread downloader, I cleaned the data, stripping it of links, hashtags, and limited posters to 5 replies. Then I piped the full_text column through VADER to pin a compound sentiment score on each line—positive, neutral, or “somebody‑needs‑coffee.”

Twitter conversation downloader built and tested on first thread 🦾

Time to run some tests in @justaskrally to see where sentiment distribution lands.

Many thanks to @janwilmake for pointing me to a cost efficient solution 🙏 pic.twitter.com/PD5LcJ1tfh
— Rhys Fisher (@virtual_rf) April 21, 2025

Next I blasted the exact same headline at Rally’s default Synthetic U.S. Population—and let the 100 AI personas react. Rally lets you easily export your responses into CSV, so I pulled the data into a Google Notebook to rerun VADER. With both sentiment scores side-by-side, real vs synthetic, both annotated with the same sentiment rubric and ready for a face‑off. I made the first comparison.

Real Human Responses

OpenAI Fast

The initial synthetic results were far too positive. So I tested all our model / mode combinations and found Anthropic did much better.

Anthropic Fast

Anthropic Smart

What Do These Results Really Mean Though?

VADER helps us calculate the average compound scores, a single, normalized metric that captures the overall sentiment of a piece of text on a scale from −1 (most negative) through 0 (neutral) to +1 (most positive). It’s computed by:

Scoring each token (word or emoji) in the text according to a lexicon of valence.
Adjusting for rules like punctuation emphasis or degree modifiers (“very”, “extremely”).
Summing those adjusted scores.
Normalizing the sum to be between −1 and +1.

In summary, average compound scores are a vibe check. Are people mad? Jazzed? Or something in between?

Interpreting The Average Compound Score

Positive (>0): On average, vibe leans upbeat or approving.
Negative (<0): On average, vibe leans critical or disapproving.
Zero (≈0): Neutral overall—positive vibes and negative vibes are roughly balance out.

Calculating Compound Score Accuracy

To get a quick‑and‑dirty “accuracy” number, I first took the absolute gap between the bots’ average sentiment score and the real thread’s (| 0.2285 – (–0.0134) | ≈ 0.242). Then we normalised that gap against VADER’s full two‑point range and flipped it—accuracy = 1 – (gap ÷ 2)—so a perfect match is 100 % and a complete miss is 0 %.

Put plainly, the bot crowd scored the headline almost +0.23 while real humans hovered around 0, a quarter‑point gap on VADER’s –1 to +1 scale—that’s the emotional distance between a neutral shrug and a clear thumbs‑up. A quick t‑test confirms the split isn’t noise (p ≪ 0.05), so our first synthetic pass was measurably too cheerful compared with the real‑world mood.

Breaking The 90% Barrier With Clones

Now that I had locked onto the Anthropic model which consistently delivered better sentiment accuracy for this experiment, I decided to run the test against a different virtual audience–hacker news readers–because it’s one we’ve actually “designed” to some extent. Our synthetic hacker news personas were generated by scraping and organising real commenters and their comments, then using Rally’s clone-via-text feature. Here’s an example persona.

How did it perform?

Anthropic (fast)

Right out of the gate, I was now hitting +93% accuracy. Well above the 85 % threshold most scientific papers quote.

But “93 % mood‑match” is only the opening gambit—now I want to read the board. So I cracked open the negative buckets from both datasets and ran a battery of deeper tests: n‑gram overlap to see if the same buzz‑words surface, TF‑IDF “keyness” to spot which terms each crowd over‑indexes, and even LDA topic‑model with cosine‑similarity. In short, I lifted the hood to ask not just how angry they are, but angry about what—and whether the bots rage on the exact same talking‑points as the humans.

From Mood to Meaning: Mapping Real vs Synthetic Pushback

Anthropic (fast) - hacker news audience

The unigram‑overlap check exposed an immediate gulf: beyond the skeletal trio “32 / work / time,” the two vocabularies hardly touch. Human tweeters drill straight into pocket‑book anxiety—“hours,” “pay,” “overtime,” “people,” “make”—every third word a reminder that less clock‑time means less take‑home. The bots, meanwhile, orbit vague suspicion: hedges like “sounds,” “like,” the quirky metaphor “jail time,” and even a tech‑culture detour into “misinformation.”

Bottom line: we’ve matched the temperature of the vibe, but not the topic. Running this again but only on negative sentiment data results in the same result: outrage temperature still matches, but the topics don’t—real people fear smaller pay‑checks; faux people fear vague conspiracies.

Then I switched back to our general US audience. The synthetic responses did shift toward more concrete, policy‑specific terms—but shows my synthetic survey isn’t surfacing the actual economic pain points that Twitter users are vocalizing.

I decided to run a TF‑IDF “keyness” analysis, which measures how important a term is within a document relative to a collection of documents (i.e., relative to a corpus). But it told the same story we saw with the n‑grams.

Analysis

Real tweeters overwhelmingly focus on financial & labor‑related terms—“pay,” “overtime,” “tax,” “40” (the old workweek), plus frank profanity (“bullshit,” “fuck”) to signal strong emotional pushback. Makes sense. Real users are railing against hard numbers—what they’ll earn, what they’ll lose, how many hours or overtime they’ll be cut. Synthetic respondents are still operating in a more abstract or heuristic frame. I didn't test smart mode here, but I have found during my buying committee simulation that it produced more layered consideration factors. Something I explore further here.

Swapping Lower Average Compound Scores In Favor Of Narrative Alignment

To make the topics behind the rage more representative in this experiment––with more predictive power of real‐world backlash––I tried tweaking the simulator prompt using anchoring (think biased media diets), something covered more here. Specifically I added this “Assume this 32‑hour proposal cuts your overtime pay by 50 % and you pay more taxes per hour. Write your main objection in your own words.”

Sentiment accuracy took a hit, now hovering just below 90%, however, this last round shows a big leap forward in narrative alignment.

We just saw how the synthetic poll is now talking money—not just metaphors—so the prompt tweak overcame the biggest early gap. I verified this with a LDA, testing if the “themes” in the negative post responses emerge in both corpora (e.g. “lost pay,” “overtime cut,” “quality of life,” “health,” etc.) and in roughly the same proportions. It produced a similarity of 0.944 (on a 0–1 scale) meaning the high‑level thematic structure of the synthetic objections now very closely mirrors the real‑tweet objections.

With more prompt refinements (especially around “overtime”), or spending more effort testing different audience design techniques like (e.g. personification), or cloning digital twins with real data from surveys… I’m confident we could hit more of the real users’ top pain‑point language to truly mirror actual online backlash. This highlights the importance of simulator prompting and virtual audience design. Just how prompt engineering is a well paid skill, I suspect early movers here will gain some earning power.

Concluding Thoughts On Accuracy

I found that, with carefully calibrated simulator prompting, a small synthetic survey can reproduce both the sentiment polarity and the thematic contours of a large X conversation about a 32‑hour workweek. After some rounds of iterative refinement—anchoring respondents in concrete economic trade‑offs and discouraging certain language—our synthetic polls have shown to touch just shy of 90% accuracy, WITH a topic model cosine similarity of 0.944 against the actual negative tweets. These metrics show that a well‑designed iterative survey can serve as a rapid, low‑cost proxy for real‑time narrative analysis of online audiences.

While these alignment scores are high, they did depend critically on prompt design: without explicit mention of pay and overtime, the synthetic responses when using a fast model drifted into metaphors (“jail time,” “nonsense”). Future work should test this, but with more carefully designed audiences that have been modeled on real data.

Predicting Vibes

Cracking a single thread was encouraging, but comms teams don’t fight one fire—they juggle dozens every news cycle. So I zoomed out: could the same method forecast the emotional weather across the next day’s slate of posts—tens of thousands of replies—fast enough to let a brand pre‑write its responses before the sparks even land? I tested it by running some of Elon’s posts through Rally as they were popping off–before too many commenters had arrived. And posted my predictions so the world could see.

Prediction 1

Infiltration of the judiciary throughout The West is the greatest long con of the left pic.twitter.com/ffQzUcZbbp
— Elon Musk (@elonmusk) April 22, 2025

Rally Simulation

Results

Illegal Aliens	Real Commenters	Rally
Average Compound Score	-0.00435869611382459	-0.0021639999999999928
Sample Size	4709 responses	100 responses

Prediction Accuracy: 99.9 % sentiment‑alignment

Synthetic and human replies both landed in the same mildly negative zone, so sentiment was almost a perfect overlay.

Prediction 2

Several more censorship organizations will be deleted https://t.co/SWQyamkXwY
— Elon Musk (@elonmusk) April 22, 2025

Rally Simulation

Results

Censorship Org Deletion	Real Commenters	Rally
Average Compound Score	0.138299205929063	0.11619100000000002
Sample Size	1889 responses	100 responses

Prediction Accuracy: 98.9 % sentiment‑alignment

Free‑speech framing is familiar territory for the personas, driving a tight match with real‑world reactions.

Prediction 3

💯 https://t.co/njNeyu9SM5
— Elon Musk (@elonmusk) April 22, 2025

Rally Simulation

Results

Impeach Judges	Real Commenters	Rally
Average Compound Score	0.054794696311017264	-0.629887
Sample Size	8051 responses	100 responses

Prediction Accuracy: 65.8 % sentiment‑alignment
Bots missed the partisan heat and sarcasm, swinging hard negative while the crowd stayed only slightly positive. Note: In this prediction, I tried anchoring some bias.

Prediction 4

Low birth rates will end civilization https://t.co/KPxBor9lAJ
— Elon Musk (@elonmusk) April 22, 2025

Rally Simulation

Results

Fertility Decline	Real Commenters	Rally
Average Compound Score	0.028377472575691096	-0.27653400000000006
Sample Size	11395 responses	100 responses

Prediction Accuracy: 84.8% sentiment‑alignment

Doom‑laden language still pushed the simulation more negative than the mixed human mood. Note: In this prediction, I tried anchoring some bias.

Analysis

The model crushed the first two headlines—hitting 99.9 % and 98.9 % alignment—because both tweets leaned on the same “free‑speech vs establishment” framing our generic U.S. personas know by heart. It stumbled on Impeach Judges (65.8 %) and only partially recovered on Fertility Decline (84.8 %), but those two predictions were made once we injected anchoring bias. If our learnings above repeat, we’ve sacrificed sentiment‑alignment but gained a boost in narrative alignment (learnings TBC).

Final Thoughts

“Accuracy” is not a universal trophy; it’s a tool that changes shape with the job. If your brief is early‑warning sentiment radar, a ±5 % gap is stellar. If you’re writing rebuttal copy, narrative overlap—why people are mad—matters more than the mood percentage. What this experiment proves is that with the right prompt spices and audience recipes you can get both: < 90 % sentiment fidelity and a 0.94 cosine hit on the actual talking‑points.

We’re still at v1 here: different issues, media diets or linguistic quirks will need different calibration tricks. But the pathway is clear—and if you’re a comms, policy or product team that wants a bespoke synthetic audience (or just someone to stress‑test your messaging before the mob shows up) drop us a line, or sign up to one of our plans and give it a go yourself. We’ll help you decide which “accuracy” really moves the needle for your use‑case, then build the simulator to beat it.

Ask Rally

I Predicted The Sentiment of +26,044 Post Replies—Before They Happened—With 82.7% Accuracy

Testing Sentiment Accuracy

What Do These Results Really Mean Though?

Interpreting The Average Compound Score

Calculating Compound Score Accuracy

Breaking The 90% Barrier With Clones

From Mood to Meaning: Mapping Real vs Synthetic Pushback

Analysis

Swapping Lower Average Compound Scores In Favor Of Narrative Alignment

Concluding Thoughts On Accuracy

Predicting Vibes

Prediction 1

Prediction 2

Prediction 3

Prediction 4

Analysis

Final Thoughts

Stay Updated