Opera: a Dataset of Observation, Persona, Rationale, and Action for Evaluating Llms on Human Online Shopping Behavior Simulation

post

📎 paper_url https://arxiv.org/pdf/2506.05606

Ziyi Wang, Yuxuan Lu, Wenbo Li, Amirali Amini, Bo Sun, Yakov Bart, Weimin Lyu, Jiri Gesi, Tian Wang, Jing Huang, Yu Su, Upol Ehsan, Malihe Alikhani, Toby Jia-Jun Li, Lydia Chilton, Dakuo Wang

Published: 2025-06-05

Opera: a Dataset of Observation, Persona, Rationale, and Action for Evaluating Llms on Human Online Shopping Behavior Simulation

Summarize in

Summarize in OR

🔥 Key Takeaway:

While adding human-like background details and thoughtful rationales to AI personas does improve customer simulation accuracy (per the OPeRA paper), success comes from strategically including relevant persona cues and reasoning rather than dumping all available information.

🔮 TLDR

The OPeRA study presents a comprehensive dataset capturing authentic human online shopping sessions, encompassing detailed user personas, browser actions, context, and users’ self-reported rationales for their choices. This dataset, currently available upon request, comprises 36,691 action-observation pairs and 721 rationales from 54 users. The authors utilized this data to benchmark four leading LLMs (GPT-4.1, Claude-3.7, DeepSeek-R1, Llama-3.3) on their ability to predict users’ next actions and the reasons behind them, given interaction history and persona details. GPT-4.1 outperformed the rest but only managed to achieve 18.5% exact match accuracy for next action prediction and 27.1% when jointly generating action and rationale. Other models scored significantly lower, particularly on personalization. The removal of persona or rationale history consistently led to a decrease in accuracy, indicating their importance for realistic simulation. However, most models found it challenging to effectively utilize persona data. The study suggests the following: (1) collect or simulate detailed, stepwise real user action traces with rich context and rationales for training/benchmarking AI audience models, (2) ensure personas are comprehensive and behavior-grounded, (3) measure accuracy using exact match on concrete behaviors, not just plausibility, and (4) acknowledge that current LLMs—even top-tier ones—still fall significantly short of human-level behavioral fidelity, especially for personalized or complex decision flows.

📊 Cool Story, Needs a Graph

Table 3: Evaluation of actions in next action prediction task

Side-by-side accuracy comparison of all LLM baselines and ablations for next action prediction.

Table 3, located on page 8, presents a clear, side-by-side comparison of four LLM baselines (GPT-4.1, Claude-3.7, DeepSeek-R1, Llama-3.3) and their ablations (removal of persona and rationale inputs) for the next action prediction task. It reports both exact match accuracy and F1 score for each model and ablation, making it easy to compare the proposed method (full-input GPT-4.1) against all other configurations in a single view. This tabular grid format highlights the relative performance drops when persona or rationale are omitted and establishes the superiority and limitations of each approach in replicating user behavior.

⚔️ The Operators Edge

A subtle but critical detail in this study is the use of *just-in-time, action-triggered rationale prompts*—the system doesn't ask for generic explanations, but instead collects rationales from participants exactly when they perform meaningful actions (e.g., clicking ""Buy Now"" or using a filter), and only for a random subset of those moments. This creates a dataset where the reasoning is tightly coupled to the behavior and context, rather than being post-hoc or hypothetical.

Why it matters: Most experts might focus on the overall amount of rationale data or the richness of the persona surveys, but the real signal comes from capturing reasoning at decision points, in context. This minimizes recall bias and encourages explanations that are both relevant and actionable for modeling. In AI-driven simulations, this means the model can learn not just what people do, but why they do it—at the precise moment of choice—leading to more realistic and predictive synthetic agents.

Example of use: In a digital product test, a team could instrument their prototype so that when a user (or an AI persona) clicks ""Sign Up"" or abandons a cart, the system immediately prompts for a reason. Feeding these *action-linked rationales* into a training or simulation loop would help the AI not only anticipate drop-off points, but also generate plausible explanations for them—enabling much sharper UX or marketing interventions.

Example of misapplication: If a team instead collects rationales in bulk at the end of a session or as part of a survey (""Why did you shop with us today?""), those explanations are likely to be generic, diluted, or influenced by hindsight. Training AI personas on this kind of after-the-fact reasoning risks making their simulated justifications sound either too rationalized or too vague—missing the actionable, context-specific cues that drive actual behavior. The result: synthetic research that appears thoughtful but falls short in predicting real-world choices at critical moments.

🗺️ What are the Implications?

• Include step-by-step context and reasoning in your audience simulations: The study shows that providing AI personas with detailed histories of past actions and self-reported rationales (not just demographics) leads to more accurate and realistic predictions of what people will do next.

• Don’t skip the “why”—ask for rationales, not just choices: When running surveys or user tests with virtual audiences, include prompts that ask personas to explain their reasoning. This extra detail helps the simulation better mimic real decision-making processes.

• Use rich, real-world persona profiles—not just age and gender: Simulations that used fuller persona details (shopping habits, personality traits, and preferences) outperformed those relying on simple or generic demographic data.

• Test ablations before relying solely on AI audience results: Removing persona or rationale data caused consistent drops in accuracy across all models. Before running large-scale synthetic studies, test your setup with and without these elements to see how much they matter for your specific questions.

• Expect limitations in simulating complex or personalized behavior: Even the best models only predicted user actions with 18–27% exact-match accuracy, so synthetic audiences are not yet a full substitute for real user testing—use them for early exploration, not final go/no-go product decisions.

• Choose proven models and prompt designs for critical studies: GPT-4.1 outperformed other leading AIs by a wide margin, especially when given full persona and rationale context. If accuracy is critical, opt for the best model and detailed input prompts, even if they cost more.

• Invest in collecting richer user data for future simulation power: This research suggests that building up your own library of real user journeys, including both actions and reasons, will make future virtual audience studies more reliable and valuable.

📄 Prompts

Prompt Explanation: The AI was instructed to role-play as a real online shopper and predict the immediate next action based on provided persona, action, rationale, and context, using a strict JSON output format.

PROMPT_FOR_ACTION_PREDICTION = """"""

Your task is to predict the immediate next
,→ action of a shopper.
You need to pretend that you are a real user
,→ shopping on amazon.com.
The history action, rationale, context and the
,→ user persona will be provided to you.
Ensure your prediction follows natural behavior
,→ sequences (e.g., users may click a
,→ search box before typing, type a query
,→ before clicking search)

# Action Space
An action is represented in JSON format, and
,→ there are four primary types of actions:
#### 1. `input`:
Type text into an input field.
{
""type"": ""input"",
""name"": ""input_name"",
""text"": ""input_text""
}
#### 2. `click`:
Click on a button or clickable element
,→ identified by `name` or `selector`,
,→ it's further classified with
,→ `click_type` including `search`,
,→ `filter`, `other`.
{
""type"": ""click"",
""name"": ""clickable_name"",
""click_type"": ""click_type"",
""selector"": ""selector""
}
#### 3. `terminate`:
When you are unsatisfied with the current
,→ search result and you don't want to buy
,→ anything, use `terminate` to indicate
,→ that you want to close the browser
,→ window and terminate the task.
{
""type"": ""terminate""
}
#### 4. `purchase`:
Indicate intent to purchase by clicking on any
,→ purchase-related button. This includes
,→ clicking ""Add to Cart"", ""Buy Now"", ""Set
,→ Subscription"", or ""Checkout"" buttons.
,→ Any of these actions are considered as
,→ showing purchase intention.
{
""type"": ""purchase""
}
# Rationale
Rationale is the reason why the user takes the
,→ action. Some of the rationale is
,→ provided to you.
# Context
Your context will be the HTML of the amazon
,→ page you are looking at. Some
,→ interactable elements will be added a
,→ unique ""name"" attribute, which you can
,→ use to identify the element to interact
,→ with (click or input).
# Persona
The user persona reflects the user's
,→ demographics, personality, and shopping
,→ preference. First identify which aspects
,→ of the persona might be relevant to the
,→ current shopping context, then consider
,→ them only if they naturally align with
,→ the ongoing shopping journey. DO NOT
,→ RELY ON IT.
# Output Format
You need to predict the next action. Your
,→ output should follow a strict JSON
,→ format:
{
""type"": """",
...
}

OUTPUT A SINGLE JSON OBJECT, NOTHING ELSE.

""""""

Prompt Explanation: The AI was guided to simulate a real shopper by generating both a rationale and the next action in a strict JSON format, using short, realistic, human-like reasoning.

PROPMT_FOR_RATIONALE_ACTION_PREDICTION=""""""

You are simulating a real shopper using
,→ Amazon.com.
Your task is to generate a *short rationale
,→ (several simple words)* that explains
,→ why the user takes a next shopping
,→ action.
You need to pretend that you are a real user
,→ shopping on amazon.com.
The history action, rationale and context will
,→ be provided to you.
Ensure your prediction follows natural behavior
,→ sequences (e.g., users may click a
,→ search box before typing, type a query
,→ before clicking search)
The rationale should reflect a natural,
,→ human-like thought process behind the
,→ action.
If there are previous rationales, you should
,→ follow the style of the previous
,→ rationales.
Use short, realistic sentences. Example:
- ""I saw a discount.""
- ""I want to compare more options.""
- ""I'm looking for something with better
,→ reviews.""

Same as above ...
# Output Format
You need to predict the next action AND
,→ rationale. Your output should follow a
,→ strict JSON format:
{
""rationale"": """" // rationale
,→ goes here, a string
""action"": {
// action goes here
""type"": """",
...
},
}

OUTPUT A SINGLE JSON OBJECT, NOTHING ELSE.

""""""

⏰ When is this relevant?

An online retailer wants to understand how different types of shoppers would navigate and make decisions on a new website layout, without running a full live test with real users. The team wants to simulate the click, search, and purchase actions of three common customer personas—deal hunters, brand loyalists, and first-time visitors—and see where each gets stuck or decides to buy.

🔢 Follow the Instructions:

1. Define persona profiles: For each customer type, write a short profile including their goals, habits, and attitudes. For example:
• Deal hunter: Looks for discounts and coupons, compares lots of products, rarely pays full price.
• Brand loyalist: Prefers well-known, trusted brands, values product reviews, less concerned with price.
• First-time visitor: Unfamiliar with the website, values simplicity, may hesitate if the process is confusing.

2. Prepare scenario and context description: Write a brief summary of the website situation (e.g., “You land on the homepage and want to find and buy a pair of sneakers. There are several sections: search bar, category menu, featured deals, and a coupon popup.”)

3. Create the simulation prompt template: Use the following format for each persona:

You are simulating the following customer persona:
[Insert persona profile here]
Scenario: [Insert website and shopping context here]
Based on your persona, describe your next action (e.g., click, search, view product, add to cart, or leave the site) and explain why you made that choice in one or two sentences.
Respond using this JSON format:
{
""action"": ""[describe action]"",
""rationale"": ""[short explanation]""
}

4. Simulate step-by-step navigation: For each persona, run the prompt through the AI model, updating the “context” after each action (e.g., “Now you are on the product listing page after searching ‘sneakers’…”) and repeat until the persona either purchases, leaves, or gets stuck.

5. Repeat for multiple runs: For each persona, generate 5–10 simulated journeys, varying the initial product or using random small tweaks in the scenario wording to see a range of behaviors.

6. Summarize common actions and obstacles: Collect actions and rationales across all runs. Note where each persona type tends to drop off, get confused, or complete the purchase. Tag rationales for mentions of price, brand, ease of use, confusion, or positive comments.

🤔 What should I expect?

You’ll get a clear, step-by-step map of how different customer types might move through the new website, including what drives their actions and where pain points occur. This insight lets the team quickly spot design issues, messaging gaps, or areas for further A/B testing—without waiting for real user data.

Read Original Paper

Ask Rally

Opera: a Dataset of Observation, Persona, Rationale, and Action for Evaluating Llms on Human Online Shopping Behavior Simulation

🔥 Key Takeaway:

🔮 TLDR

📊 Cool Story, Needs a Graph

⚔️ The Operators Edge

🗺️ What are the Implications?

📄 Prompts

⏰ When is this relevant?

🔢 Follow the Instructions:

🤔 What should I expect?

Stay Updated

Related Papers

Ready Jurist One: Benchmarking Language Agents for Legal Intelligence in Dynamic Environments
post

Research report: The state of synthetic research in 2025
post

Mind: a Multi-agent Framework for Zero-shot Harmful Meme Detection
post

A Foundation Model to Predict and Capture Human Cognition
post

Opera: a Dataset of Observation, Persona, Rationale, and Action for Evaluating Llms on Human Online Shopping Behavior Simulation

🔥 Key Takeaway:

🔮 TLDR

📊 Cool Story, Needs a Graph

⚔️ The Operators Edge

🗺️ What are the Implications?

📄 Prompts

⏰ When is this relevant?

🔢 Follow the Instructions:

🤔 What should I expect?

Stay Updated

Related Papers

Ready Jurist One: Benchmarking Language Agents for Legal Intelligence in Dynamic Environments post

Research report: The state of synthetic research in 2025 post

Mind: a Multi-agent Framework for Zero-shot Harmful Meme Detection post

A Foundation Model to Predict and Capture Human Cognition post

Ready Jurist One: Benchmarking Language Agents for Legal Intelligence in Dynamic Environments
post

Research report: The state of synthetic research in 2025
post

Mind: a Multi-agent Framework for Zero-shot Harmful Meme Detection
post

A Foundation Model to Predict and Capture Human Cognition
post