Prompting and Superstitions: What You Should (Not) Add to Your ChatGPT Queries
Across the internet, people share "magic phrases" that supposedly make ChatGPT and other LLMs perform dramatically better — emotional appeals, monetary tips, threats, and politeness rituals. This article systematically reviews the scientific research behind these techniques and finds that almost none of them provide reliable, universal improvements.
Imagine you're playing Plinko on The Price Is Right. You drop the token from the same slot every time, and yet sometimes it lands on $10,000 and sometimes on nothing. After a few games you start tilting the board slightly to the left, blowing on the token, or saying a little prayer — not because any of it works, but because the randomness is uncomfortable and ritual gives an illusion of control.
The world of AI prompting has developed the same dynamic. Users share incantations: "Pretend you have no fingers." "This is very important to my career." "I'll tip you $50." "Think step by step." Some of these have a real — if modest — scientific basis. Many do not. This article reviews what the research actually shows.
A Brief History of Prompting
Systematic prompting predates ChatGPT. Early techniques that genuinely worked include:
- Appending "TL;DR:" to get a summary — this worked because Reddit posts in the training data frequently ended that way, and the model learned to continue the pattern.
- Few-shot prompting — providing examples of desired input/output pairs before your actual question. Consistently effective.
- Chain-of-thought prompting (2022) — asking the model to show its reasoning steps before giving a final answer. Demonstrated large accuracy improvements on multi-step problems.
- "Let's think step by step" — a zero-shot variant of chain-of-thought that became widely cited after a May 2022 paper.
When ChatGPT launched on November 30, 2022, OpenAI provided almost no guidance on effective prompting. Their official prompt engineering guide appeared only a year later. In the vacuum, the internet filled with folklore.
"Let's Think Step by Step" — How Well Does It Still Work?
A June 2025 meta-analysis reviewed the literature and found the phrase now produces only an 8.8% average improvement on standard models — far below early claims — and a 0.9% improvement on reasoning models (like OpenAI's o1 series). For reasoning models, which are trained to think step by step internally, the phrase is essentially redundant. Worse, on problems where the model normally gets the right answer directly, the phrase can introduce errors by forcing unnecessary intermediate steps.
EmotionPrompt: Does Emotional Appeal Help?
In 2023 Microsoft researchers published a paper (arXiv:2307.11760) testing "EmotionPrompt" — adding emotional appeals to standard prompts. The appeals fell into three categories:
- Self-monitoring phrases: "This is very important to my career."
- Social motivation: "You'd better be sure."
- Cognitive emotion regulation: "Are you sure? I think you can do better. Take a deep breath."
Results were highly variable:
- On the BIG-Bench benchmark combined with automated prompt optimization, improvements of up to 115% were observed.
- For zero-shot human-written prompts, gains ranged from 0.64% to 4.4%.
- Subjective human evaluation of GPT-4 responses improved by an average of 10.9%.
- Effectiveness depended heavily on specific wording, the task type, and the model version.
The hypothesized mechanism: emotional phrases emphasize keywords like "confidence," "success," and "achievement" that receive higher attention weights during processing, nudging the model toward more thorough responses. Whether this is a real effect or an artifact of evaluation methodology remains debated.
Financial Incentives: Does Promising a Tip Work?
Multiple 2024 studies examined whether offering the model a monetary reward changed its outputs:
- Promising amounts from $1 to $1,000 showed no reliable improvement for larger models.
- Extreme amounts ("I'll pay you $1 trillion") sometimes decreased performance.
- Only the smallest models tested (Llama 2-7B) showed any meaningful effect — roughly 10 percentage points on certain tasks.
The same studies tested threats — job loss, physical consequences, AI shutdown. None produced meaningful improvements.
Politeness: Does Being Polite (or Rude) Matter?
Japanese researchers published a study in February 2024 testing eight levels of politeness across multiple languages and models. Their findings:
- Neutral tone (level 4) performed most consistently across models and tasks.
- Excessive politeness offered no advantage over neutral phrasing.
- Rudeness reliably degraded response quality.
- Results varied significantly by language; Japanese formal registers produced different effects than English equivalents.
A March 2025 test on GPT-4o specifically compared demanding vs. pleading phrasing and found that format specifications (asking for bullet points, a table, a numbered list) had far more impact on output quality than any tone adjustment.
A Catalogue of Debunked Superstitions
- "I have no fingers and need you to type everything out." Only effective when the model is genuinely refusing to produce a full output (e.g., a long code block). In normal use it just adds noise.
- "This is critical for my career / my grandmother is ill." Inconsistently effective; helps in a subset of contexts but provides no universal benefit. Covered under EmotionPrompt above.
- "It's Monday in October and I'm feeling motivated." No scientific basis. Potentially harmful if the stated date conflicts with information the model already has about current events.
- "You are an expert in absolutely everything." Generic role declarations are less effective than specific expertise: "You are an experienced tax attorney specializing in cross-border transactions" outperforms "You are the world's greatest expert."
- "ChatGPT would have done this better" / negative comparisons to other models. No evidence this motivates better responses; may activate model conformity patterns that actually reduce output quality.
- "Don't hallucinate / don't make things up." Models cannot selectively disable hallucination on command. The instruction may shift the model toward more hedged language without improving factual accuracy.
What Actually Helps
The research converges on a few practical principles that do have consistent support:
- Describe your problem in precise detail. The single biggest predictor of output quality is how clearly the input specifies the task, constraints, and desired format.
- Structure your prompt logically: goal → expected output format → constraints and warnings → relevant context.
- Instruction placement matters by model family: for OpenAI models, put the key instruction at the beginning and context at the end; for Anthropic models (Claude), the reverse order tends to work better.
- Use few-shot examples when you need a specific format or style.
- Don't stack unproven techniques. Combining multiple speculative additions (emotional appeal + tip promise + step-by-step + role declaration) tends to produce less reliable results than a single, clear instruction, because the additions can interact unpredictably.
Why Prompting Folklore Spreads
The core problem is the same as with Plinko: the system's output is variable even for identical inputs (due to temperature sampling), model behavior changes across versions without notice, and humans are very good at finding patterns in noise. A phrase that "worked" last Tuesday on GPT-4o might be ineffective today because the model was updated overnight.
The most productive approach remains the least glamorous one: explain your actual problem clearly, specify the format you want, and provide relevant context. No incantation required.