Large Language Models Are Zero-shot Reasoners

post

📎 paper_url https://arxiv.org/pdf/2205.11916

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, Yusuke Iwasawa

Published: 2023-01-29

Large Language Models Are Zero-shot Reasoners

🔥 Key Takeaway:

The fastest way to make AI personas solve tough, multi-step problems like a human isn’t to overwhelm them with examples or extra data—it’s to give them one short, generic nudge (“Let’s think step by step”) before each question, which works better than expensive fine-tuning or elaborate prompt engineering for most reasoning tasks.

🔮 TLDR

This paper demonstrates that large language models (LLMs) like GPT-3 and PaLM can substantially improve their performance on complex, multi-step reasoning tasks—not just simple questions—by using a single, generic “Let’s think step by step” prompt before each answer, a method they call Zero-shot-CoT (Chain of Thought). Across arithmetic, symbolic, logical, and commonsense reasoning benchmarks, this approach boosts accuracy by 20–60 percentage points versus standard zero-shot prompts (e.g., on MultiArith: 17.7% to 78.7%, GSM8K: 10.4% to 40.7% with text-davinci-002), and closes much of the gap to specialized few-shot or fine-tuned baselines. The effect is robust for large models (tens/hundreds of billions of parameters), but has little impact for smaller models. The technique is highly general: no task-specific prompt engineering or examples are needed, and it works across a wide variety of question formats. However, performance gains are minimal for tasks that don’t require multi-step reasoning or for certain commonsense questions. Templates that explicitly encourage step-by-step reasoning perform best, while misleading or irrelevant prompts do not help. Actionable takeaway: For simulations or surveys where you want more humanlike, reasoned responses from LLM-based personas, prepend a simple “Let’s think step by step” instruction—especially for complex or multi-hop tasks—to generate more accurate, interpretable, and realistic outputs without extensive prompt tuning.

📊 Cool Story, Needs a Graph

Figure 3: "Model scale study with various types of models"

Overlaid line plots compare Zero-shot and Zero-shot-CoT accuracy across model scales and types, revealing the performance gap and scaling behavior for each method.

Figure 3 presents three panels of line graphs, each showing accuracy on key reasoning tasks (MultiArith and GSM8K) as a function of model size for different architectures (Original GPT-3, Instruct GPT-3, and PaLM). Each plot overlays the performance of standard Zero-shot prompting and the proposed Zero-shot-CoT method, making it visually clear how Zero-shot-CoT consistently outperforms Zero-shot at nearly every scale, with the gap widening for larger models. This side-by-side, overlaid presentation provides an immediate, comprehensive comparison of the proposed method against the primary baseline across all tested model scales and types.

⚔️ The Operators Edge

A subtle but crucial detail in this study is that the massive jump in reasoning accuracy only happens when the *prompt explicitly encourages step-by-step thinking*—and not just any extra instruction will do. The researchers tested a variety of prompt phrasings (see Table 4, page 8) and found that ""Let’s think step by step."" consistently outperformed other reasonable-sounding cues like ""Let’s think about this logically"" or ""Let’s solve this problem by splitting it into steps."" In fact, some prompts that sounded only slightly less targeted dropped performance by 20–40 points, and misleading or irrelevant cues collapsed accuracy to near zero. This shows the model’s reasoning skills are highly sensitive to the *exact wording* used to trigger chain-of-thought reasoning, not just the intent behind the prompt.

Why it matters: It’s easy to assume that any nudge toward reasoning or explanation (“walk me through your answer” or “explain your thinking”) would be equally effective, but the study proves otherwise. The right triggering phrase acts like a key that unlocks the model’s latent reasoning abilities, while small deviations dramatically degrade results. This means prompt wording is not just a technical afterthought—it’s a hidden lever that can make or break the reliability of AI-driven research or product testing.

Example of use: Suppose a product manager is running a large-scale AI survey to test customer reactions to a complicated pricing plan. By embedding the exact phrase “Let’s think step by step.” before each AI persona’s response, they can ensure the personas actually “work through” the pricing math and surface nuanced objections or misunderstandings—yielding insights far closer to what a real customer would say in interviews.

Example of misapplication: If the same manager instead uses a prompt like “Give me your detailed thoughts,” “Please be logical,” or “Explain your reasoning,” thinking it’s all the same, the AI personas may revert to shallow, single-step answers or even hallucinate logic, missing subtle errors or concerns. As a result, the research could dramatically underestimate confusion, overstate comprehension, or miss churn risks—leading to false confidence in a new product or campaign.

🗺️ What are the Implications?

• Simple prompt changes can drastically improve accuracy: Market researchers can boost the realism and correctness of synthetic audience responses by adding a basic “Let’s think step by step” instruction before asking complex or multi-step questions—no technical changes or extra data required.

• Few-shot examples still outperform, but require more effort: While providing a handful of real, worked-out examples for each question type (“few-shot” prompting) yields the highest accuracy, this process is time-consuming and requires task-specific expertise, making it more expensive and less scalable for large studies.

• Zero-shot chain-of-thought (CoT) is a strong, low-cost baseline: If you don’t have the resources to design example-rich prompts for every survey, using the zero-shot CoT method (just the “Let’s think step by step” phrase) gives much of the benefit for reasoning-heavy tasks and is easy to implement across all types of studies.

• Invest in prompt writing over model upgrades: The study shows that the way you phrase your instructions matters more than which AI model you use—clear, step-by-step prompts can bring older or smaller models close to state-of-the-art on reasoning tasks, saving on technology costs.

• Use few-shot for high-stakes or niche questions: For critical or highly specialized market research questions (e.g., sensitive pricing, rare scenarios), consider investing the extra effort in building a few-shot prompt with detailed worked-out examples, as this reliably gives the highest accuracy.

• Validate complex findings with human spot checks: Because few-shot and zero-shot CoT still make occasional logical or commonsense errors, it’s wise to have a human review a small sample of complex or surprising responses before taking action on new product or strategy ideas.

• Don’t expect big gains for simple or factual questions: These techniques offer the largest improvements on open-ended, reasoning-heavy or multi-step questions—on basic factual, one-step, or yes/no queries, the impact is minimal, so resources are better spent elsewhere for those.

📄 Prompts

Prompt Explanation: The AI was prompted to perform zero-shot chain-of-thought reasoning by adding an instructive phrase before answering, to elicit multi-step reasoning without task-specific examples.

Q: A juggler can juggle 16 balls. Half of the balls are golf balls, and half of the golf balls are blue. How many blue golf balls are there?
A: Let’s think step by step.

Prompt Explanation: The AI was guided to use a variety of instructive, misleading, and irrelevant templates before reasoning, to test the robustness of zero-shot chain-of-thought prompting.

Q: For Halloween Megan received 11 pieces of candy from neighbors and 5 pieces from her older sister. If she only ate 8 pieces a day, how long would the candy last her?
A: Let’s think step by step.

A: First, we need to find the total number of pieces of candy Megan has. She has 11 pieces from neighbors and 5 pieces from her older sister, so the total number of pieces is 16. If she only ate 8 pieces a day, it would take her 2 days to finish the candy. Therefore, the answer (arabic numerals) is 2.

A: Let’s think about this logically.

A: Let’s solve this problem by splitting it into steps.

A: Let’s be realistic and think step by step.

A: Let’s think like a detective step by step.

A: Let’s think

A: Before we dive into the answer,

Prompt Explanation: The AI was prompted in a two-stage process: first to generate step-by-step reasoning with a specific phrase, then to extract the final answer using a follow-up answer extraction phrase.

Q: On average Joe throws 25 punches per minute. A fight lasts 5 rounds of 3 minutes. How many punches did he throw?
A: Let's think step by step.

[model outputs chain-of-thought reasoning]

Therefore, the answer (arabic numerals) is

⏰ When is this relevant?

A consumer bank wants to understand how different customer types would react to a new digital budgeting tool, and whether certain messaging will increase signups. The team will use AI personas to simulate customer interviews across three segments: fee-sensitive retirees, young professionals focused on financial growth, and busy parents juggling family expenses.

🔢 Follow the Instructions:

1. Define audience segments: Write out 2–3 sentence profiles for each segment, capturing their age, financial habits, and priorities. Example:
• Fee-sensitive retiree: 67, fixed income, wary of hidden charges, prefers simplicity, values trust.
• Young professional: 29, urban, tracks net worth, wants automated insights, open to new tech.
• Busy parent: 41, three kids, overwhelmed by bills, short on time, seeks easy solutions.

2. Prepare prompt template for AI persona simulation: Use the following template to generate consistent responses:

You are simulating a [persona description].
Here is the new product concept: ""A free digital budgeting tool that automatically categorizes spending, offers personalized saving tips, and sends alerts before bills are due. Signup is instant and there are no hidden fees.""
You are being interviewed by a bank product manager.
Respond honestly and in character, using 3–5 sentences.
The manager will ask follow-up questions. Stay in role and reply as the customer.

First question: What is your initial reaction to this budgeting tool?

3. Run the initial prompt through the AI model: For each persona, generate 5–10 first-impression responses using the prompt above. Slightly reword the interviewer question for natural variation (e.g., ""Would you use this tool?"" or ""What concerns would you have about signing up?"").

4. Add follow-up prompts: For each persona, ask a logical follow-up such as, ""What would make you trust this tool enough to use it with your real bank account?"" or ""How could this tool make your life easier or save you money?"" Generate 2–3 responses per persona.

5. Tag and summarize responses: Read through the outputs and tag key themes (e.g., “trust,” “ease of use,” “desire for insights,” “concern about privacy,” “interest in automation”). Note positive, negative, or neutral tones.

6. Compare across segments: Summarize which features or messages were most appealing to each segment and where objections or hesitations arose. Look for patterns (e.g., retirees mention trust and fees; young professionals want automation; parents cite time-saving).

🤔 What should I expect?

You'll gain a clear, directional picture of what matters most to each customer type, which messages are likely to increase signups, and what concerns could block adoption. This will let you quickly prioritize messaging, product tweaks, or real-world testing before investing in large-scale campaigns.

Read Original Paper

Ask Rally

Large Language Models Are Zero-shot Reasoners

🔥 Key Takeaway:

🔮 TLDR

📊 Cool Story, Needs a Graph

⚔️ The Operators Edge

🗺️ What are the Implications?

📄 Prompts

⏰ When is this relevant?

🔢 Follow the Instructions:

🤔 What should I expect?

Stay Updated

Related Papers

Large Language Models Pass the Turing Test
post

Casevo: a Cognitive Agents and Social Evolution Simulator
post

Evaluating the Bias in LLMs for Surveying Opinion and Decision Making in Healthcare
post

Large Language Models That Replace Human Participants Can Harmfully Misportray and Flatten Identity Groups
post

Quick Survey

Large Language Models Are Zero-shot Reasoners

🔥 Key Takeaway:

🔮 TLDR

📊 Cool Story, Needs a Graph

⚔️ The Operators Edge

🗺️ What are the Implications?

📄 Prompts

⏰ When is this relevant?

🔢 Follow the Instructions:

🤔 What should I expect?

Stay Updated

Related Papers

Large Language Models Pass the Turing Test post

Casevo: a Cognitive Agents and Social Evolution Simulator post

Evaluating the Bias in LLMs for Surveying Opinion and Decision Making in Healthcare post

Large Language Models That Replace Human Participants Can Harmfully Misportray and Flatten Identity Groups post

Quick Survey

Large Language Models Pass the Turing Test
post

Casevo: a Cognitive Agents and Social Evolution Simulator
post

Evaluating the Bias in LLMs for Surveying Opinion and Decision Making in Healthcare
post

Large Language Models That Replace Human Participants Can Harmfully Misportray and Flatten Identity Groups
post