Back to Experiments
🎯 Calibration Experiment

Q1: Character Designs for a Conquest MMO (Seaborne)

Mike Taylor Mike Taylor
Related: Concept Testing

Tested preference between realistic/Pixar-style character designs versus blocky/polygon-style designs for a conquest MMO game. AI personas showed 96% preference for realistic design compared to 86% human preference, correctly identifying the winner but with overconfidence.

Q1: Character Designs for a Conquest MMO (Seaborne) results

πŸ“Š Hypothesis

"Realistic character designs will be preferred over blocky designs by mobile gamers due to their polished appearance and broader market appeal, making them more suitable for mainstream MMO audiences."

Introduction

Character design represents one of the most critical early decisions in game development, as it influences downstream design choices and can be extremely costly to change later. This experiment compares two distinct visual approaches for Seaborne, a conquest MMO, to determine which character style resonates more strongly with target players and why certain aesthetic choices drive preference.

πŸ”¬ Methodology

Our first question for development of the Seaborne game concept is “Which of these character designs for a conquest MMO do you like better, and why?”. We have two designs, one more realistic or Pixar-looking, and another one more blocky or polygon style. We want to know which to go with, because it can affect a lot of downstream design decisions as we develop the game. Getting it wrong and changing it later would be supremely costly.

Data Source: James Cramer, Skunkworks

Audience: US Mobile Games - Core Demographics: Geographic Focus: United States-based mobile game players Age Distribution: Primarily middle-aged gamers, with the largest segment being 35-44 years old (42%), followed by equal representation from 25-34 and 45-54 age groups (20% each) Gender Split: Male-dominated audience (66%) with significant female representation (30%), plus small percentages of non-binary and other gender identities Gaming Preferences: This audience shows diverse gaming interests across multiple genres: Top Categories: Adventure games (11.5%), Strategy (11%), and Role-playing (10%) represent the most popular genres Secondary Interests: Word games (9%), Card games (7%), Trivia (7%), and Simulation games (7%) Action & Casual: Moderate interest in Action (12%), Racing (5.5%), Sports (5.5%) Niche Segments: Smaller but notable groups enjoy Educational (3.5%), Casino (3%), Family (1.5%), and Music games (0.5%)

Simulator: chat

πŸ“ˆ Results

Performance Metrics

Baseline

0%

Optimized

90%

Metric: Alignment (90% achieved)

Option

PickFu (Human)

Rally

Realistic (A)

86%

96%

Blocky (B)

14%

4%

One of the things I find most helpful is seeing not just what option was chosen, but why. They liked the polished look and broader appeal of the realistic design, whereas the blocky design was too reminiscent of Minecraft. If we ran this test with a younger audience we might see a completely different preference, highlighting how important it is to know what customers fit your ideal profile.

πŸ” Analysis

Rally correctly identified the realistic design's popularity, though Rally overestimated it (96% vs 86%). Rally's prediction captured the strong human preference despite overcompensating when I ran it using Google rather than OpenAI, in Smart mode (the larger Google Gemini 2.0 model). OpenAI in Fast mode (GPT-4o mini) slightly preferred the blocky design instead, highlighting the importance of calibration–checking the model and audience you chose gives results that match your past experiments.

πŸ’‘ Conclusions

Rally successfully identified the correct winner but demonstrated a pattern of overconfidence, predicting 96% preference versus the actual 86% human preference. The AI captured the core insight that realistic designs have broader appeal due to their polished appearance, while correctly identifying that blocky designs reminded players too much of Minecraft. This experiment reveals that AI testing provides directionally accurate results but may overestimate the margin of preference, suggesting the need for calibration against historical data.

πŸ§ͺ Similar Experiments

About the Researcher

Mike Taylor

Mike Taylor

Mike Taylor is the CEO & Co-Founder of Rally. He previously co-founded a 50-person growth marketing agency called Ladder, created marketing & AI courses on LinkedIn, Vexpower, and Udemy taken by over 450,000 people, and published a book with O’Reilly on prompt engineering.