As a performance marketer I used to spend $10m+ a year running ads, running 1,000s of creative tests to see what works (or doesn’t) to get people to click on ads and buy something. That might sound like a lot, but it’s nothing to what marketers do in the gaming industry. A big client for me would spend half a million a month, but companies like Supercell (makers of Clash of Clans) would spend that much in a day. With that much money on the line, it pays to be data driven, testing every combination of any variable that might improve performance.
I recently met up with James Cramer in London, who recently moved from games to books, and he shared with me the kind of hyper-detailed testing approach he pioneered at Skunkworks, the mobile game studio he used to run in Finland. It blew me away–I am trying to get him to share the presentation I saw, but it was meticulous–pages upon pages of experiments, surveys, and focus groups he ran to validate game ideas. He would rigorously test every aspect of the game before tasking the team to build it, from the visual style and theme, down to the characters and how they would be shown in the ads. He had it down to a science, and knew how much money a game would make before they even started building it.
We thought it’d be fun to build a virtual gaming audience in Rally, and see if the responses from our AI personas were realistic. If our AI personas could replicate the results of the real world tests he would spend thousands of dollars on, others with smaller budgets could get the same confidence James would have when launching a new game, or any other product in a competitive performance-driven market.
Virtual Audience Design for Mobile Gaming
If you can have any hope of replicating the results of an experiment, you first need to replicate the audience who participated in it. In Rally we create the audience from a prompt that outlines the recruitment criteria used for the test, and new personas are generated using OpenAI’s GPT-4o model (the same that powers ChatGPT), creating 50 unique, individual AI personas that collectively match our requirements. We call this process ‘personification’ – creating synthetic personas from a set of aggregate statistics and characteristics you know about the audience.
AI Accuracy vs Human Tests
Now we have our audience of 50 mobile gamers, let’s run a handful of the types of experiments James ran when designing the games Seaborne and Goblin Quest, to see how close we get to human-level results. For the ground truth human responses we are using the results of tests run on PickFu, a user testing tool used in the games industry for validating designs and other elements.
Q1: Character Designs for a Conquest MMO (Seaborne)
Our first question for development of the Seaborne game concept is “Which of these character designs for a conquest MMO do you like better, and why?”. We have two designs, one more realistic or Pixar-looking, and another one more blocky or polygon style. We want to know which to go with, because it can affect a lot of downstream design decisions as we develop the game. Getting it wrong and changing it later would be supremely costly.
Results:
Option |
PickFu (Human) |
Rally |
Realistic (A) |
86% |
96% |
Blocky (B) |
14% |
4% |
Analysis:
Rally correctly identified the realistic design's popularity, though Rally overestimated it (96% vs 86%). Rally's prediction captured the strong human preference despite overcompensating when I ran it using Google rather than OpenAI, in Smart mode (the larger Google Gemini 2.0 model). OpenAI in Fast mode (GPT-4o mini) slightly preferred the blocky design instead, highlighting the importance of calibration–checking the model and audience you chose gives results that match your past experiments.
Feedback:
One of the things I find most helpful is seeing not just what option was chosen, but why. They liked the polished look and broader appeal of the realistic design, whereas the blocky design was too reminiscent of Minecraft. If we ran this test with a younger audience we might see a completely different preference, highlighting how important it is to know what customers fit your ideal profile.
Q2: Theme Preferences (Seaborne)
The next query we ran is to figure out the theme: “Which theme would you most like to play from the four choices below?”. There is some blurb about the game to give context, and then four different images are provided that represent different styles. This casts a wider net than just character development, and we’re really trying to determine more broadly what type of game is more appealing to the audience.
Results:
Option |
PickFu (Human) |
Rally |
Cartoon (A) |
18% |
0% |
Anime (B) |
10% |
14% |
Sunset (C) |
44% |
66% |
Island (D) |
28% |
20% |
Analysis:
Rally strongly predicted theme preferences, but still over-compensates, going too hard on winning options. Sunset/Ships was correctly identified as the winner but overshot its popularity (66% vs 44%) while missing the Cartoon's limited level of appeal. Rally's high-conviction but spikily distributed predictions shows a trend that synthetic audiences tend to get things right, but with too much conviction.
Feedback:
Here we can see the potential issue with testing similar-looking designs – there was some confusion between the Anime and Sunset style. This probably conflated the final results to some degree, so it would make sense to develop the testing plan further with better naming conventions or more differentiated designs. We correctly identified the Sunset winner, but better than that we have a reason why: the players liked that it showed potential for action. We could incorporate that feedback into our next iteration with the design team, and see if different colors or scenes better evoke that sense of adventure in the audience.
Q3: Theme Preferences (Goblin Quest)
Let’s switch over to a different game, Goblin Quest, where we want to know “Which theme would you most like to play from the four choices below and why?”. These options are different combinations of the same elements, with either the happy friendly looking goblin family or the ones with red eyes, and a green / blue background or a purple one. Small stylistic elements like this can make a huge difference in the performance of your ads and how many people decide to download the game.
Results:
Option |
PickFu (Human) |
Rally |
Bright Sky (A) |
42% |
40% |
Purple Tower (B) |
17% |
12% |
Green Forest (C) |
19% |
16% |
Purple Forest (D) |
22% |
32% |
Analysis:
Rally nailed the winning Bright Sky theme, showing impressive accuracy (40% vs 42% actual). But here's where it gets interesting—Rally seems biased toward Purple Forest (32% prediction vs 22% reality) while discounting Purple Forest by 5 percentage points – these images were perhaps too similar to identify a significant difference. The takeaway here is to make sure you test bigger differences, particularly with visual assets.
Feedback:
There was a really interesting demographic breakdown in this analysis, with a split between people who liked Bright Sky for its cheery appearance and those who preferred the mystery of Purple Forest. In my opinion the personas were a little too harsh on Purple Tower considering it was so similar to Purple Forest, but as we have seen these systems tend to have too much conviction in what they like.
Q4: Advert Preference (Goblin Quest Images)
Now we get into some ad testing, and “If you saw an advert featuring the images below, which one would you be more likely to play and why?”. Seeing an ad in the social media feed is often the first encounter a user has with a brand, so it makes a big difference to the affinity users show your brand over their lifetime, which translates directly into earnings.
Results:
Option |
PickFu (Human) |
Rally |
Solo (A) |
38% |
88% |
Wide (B) |
16% |
0% |
Team (C) |
46% |
12% |
Analysis:
Question 4’s results expose a fascinating AI preference divergence—Rally showed a dramatic overindexing on Solo (88% vs 38% reality). This is the only time out of four tests that Rally guessed wrong, again this could be the similarity of the Solo and Wide images throwing it off. The other difficulty is in naming conventions: the images weren’t called anything in PickFu, but I found that changing the names of these options made a noticeable difference to the final results. I went with the names Solo, Wide, and Team because they felt the most neutral, but I could have juiced the accuracy by adjusting the names.
Feedback:
The feedback here helped me understand why the winning variation was selected, which was helpful because I didn’t expect that to be the winner. Apparently the hero theme was favored over the goblin team, and therefore it makes sense that the more zoomed in version of the hero image (Solo vs Wide) performed better, despite being quite similar. We also got further insight into the audience, with the majority of them preferring solo gameplay, which again I hadn’t expected.
Overall Conclusions
After running these synthetic testing experiments across multiple game design elements, the results are promising. Rally's virtual audience shows impressive directional accuracy that would be useful to any performance marketer, product manager, or entrepreneur testing new ideas.
Looking at the raw data:
-
75% winner-prediction accuracy — Rally correctly identified the top human preference in 3 out of 4 tests, only missing on the Team vs Solo advert preference test
-
85% distribution alignment with human votes on average, with some predictions (like the Bright Sky theme) coming within 2 percentage points of actual human preferences
Obviously there are many ways to test synthetic market research versus traditional methods, and you won't always see high correlations such as these. What strikes me most is how Rally sometimes develops high-conviction preferences that dramatically overindex certain options (that 88% Solo prediction!). This perfectly mirrors what I saw spending millions on creative testing—different audience segments develop wildly different aesthetic preferences that aren't always predictable.
It’s important to calibrate your synthetic audiences to historical experiments, so you know how much you can trust them (and where their blind spots are). However, for those without the budget for custom synthetic research, the directional guidance you get out of the box is extremely useful given the low price point ($20/m-$100/m) and immediate turnaround time you get with AI tools vs traditional research.
For game designers looking to complement thousands of dollars of traditional market research studies with synthetic alternatives, these results suggest AI testing can provide valuable directional guidance during early concept development. Moving fast on the ideation phase with Rally could potentially save millions by avoiding failed product launches, while helping you move faster between rounds of more expensive human testing for final validation.
Appendix:
How we worked out distribution alignment:
-
Q1:
-
86 vs 96 = 10
-
14 vs 4 = 10
-
(10 + 10) / 2 = 10
-
Q2:
-
18 vs 0 = 18
-
10 vs 14 = 4
-
44 vs 66 = 22
-
28 vs 20 = 8
-
(18 + 4 + 22 + 8) / 4 = 13
-
Q3:
-
42 vs 40 = 2
-
17 vs 12 = 5
-
19 vs 16 = 3
-
22 vs 32 = 10
-
(2 + 5 + 3 + 10) / 4 = 5
-
Q4:
-
38 vs 88 = 50
-
16 vs 0 = 16
-
46 vs 12 = 34
-
(50 + 16 + 34) / 3 = 33.3
-
Total:
-
Q1 = 10
-
Q2 = 14
-
Q3 = 4.8
-
Q4 = 33.3
-
(10 + 13 + 5 + 33.3) / 4 = 15.3