- 0) Key takeaways
- Definitions (read this first)
- 1) Why prompt-induced conclusion flips matter
- 2) What counts as a “conclusion flip”
- 3) Minimal sensitivity protocol
- 4) Perturbation menu
- 5) Scoring robustness
- 6) Worked example
- 7) What to do when you find a flip
- 8) Reporting
- Stretch: governance
- 9) FAQ
- 10) Short conclusion
- References
0) Key takeaways
- Sensitivity testing is the fastest way to spot brittle synthetic conclusions. The question isn’t “did the output change?” (it always will). It’s “did the decision change?”
- Synthetic market research is simulation, not measurement. Without validation controls, it’s easy to produce a convincing simulation that gets treated like a fact report. (See SMRA Glossary: “Simulation vs measurement”.)
- A conclusion flip is a governance signal. If a small prompt/context tweak flips the winner, threshold, or “most receptive segment,” label the output exploratory and/or escalate validation before decision-grade use. (See SMRA Standards & Ethics.)
-
Minimal protocol you can adopt immediately:
- Lock protocol (stimuli, wording, scales, run settings).
- Run test–retest (two identical runs).
- Run a sensitivity check (change one thing).
- Report robustness + failure triggers (what caused the flip).
- Escalate to benchmarks / fieldwork for decision-grade usage.
- Reliability ≠ accuracy. A system can be consistently wrong. Stability and sensitivity testing are reliability controls, not a validity stamp. (See SMRA Glossary: “Reliability vs accuracy”.)
If you only do one thing… run the same study twice and run one controlled perturbation. If it can’t survive that, don’t present the conclusion as stable. (See SMRA Methods & Validation.)
Definitions (read this first)
Definitions box (copy/paste into your study template)
- Sensitivity testing / sensitivity analysis: vary inputs slightly (prompt framing, context, ordering, parameters) and measure output change. (See SMRA Glossary: “Sensitivity analysis”.)
- Conclusion: the decision-relevant claim (winner, threshold, segment difference), not the raw transcript.
- Conclusion flip: a small perturbation changes the conclusion (e.g., A>B becomes B>A, or the “top segment” changes).
- Robustness: conclusions hold across small perturbations and across repeat runs.
- Failure trigger: the specific perturbation that causes the flip (report it—don’t hide it).
SMRA links you’ll use throughout this guide (use as governance anchors, not “extra reading”):
- Methods & Validation (the official SMRA playbook + minimal checklist)
- Standards & Ethics (why sensitivity is a baseline validation expectation + guardrail against prompt/operator bias)
- Glossary (shared definitions: sensitivity analysis; test–retest stability; reliability vs accuracy; simulation vs measurement)
- Vendor Evaluation Checklist (procurement tie-in: ask vendors for sensitivity evidence)
- Vendor Evaluation Guide (Gate 0 + pilots that force repeatability and robustness)
- Resources (templates, disclosure concepts, and reading library)
1) Why are “prompt-induced conclusion flips” the failure mode that matters?
A synthetic system can change wording, examples, or answer distributions without threatening your work. But when a small change flips the decision—your winner, your threshold, your recommended segment—that’s a reliability failure with governance consequences.
Here’s the key distinction:
- Output variability is normal: LLM-based systems are stochastic, sensitive to context, and often produce different phrasings or reasons run to run.
- Decision variability is dangerous: if your recommendation changes under tiny, plausible prompt shifts, you don’t have a stable basis for action.
Why this matters specifically in synthetic market research
SMRA’s framing is blunt: synthetic market research is simulation—a structured way to generate plausible outcomes under assumptions—not direct measurement of what humans did or said. That’s why sensitivity testing is one of the controls that prevents “simulation dressed as fact.” (See SMRA Glossary: “Simulation vs measurement”.)
Sensitivity testing is one of the controls that keeps simulation honest. It answers:
- “Are we seeing a stable signal?”
- “Or are we seeing a prompt artifact?”
SMRA’s recommended workflow treats this as baseline governance: lock protocol, run test–retest, run a sensitivity check, and escalate to at least one benchmark when the stakes require it. (See SMRA Methods & Validation.)
Sensitivity testing is also an anti-manipulation control
Prompt / operator bias isn’t always malicious. Often it’s accidental: a researcher adds “helpful” context, slightly leading wording, or a more persuasive setup. But the governance risk is the same: the operator becomes a hidden instrument, able to steer outcomes through wording choices.
SMRA flags prompt/operator bias as an integrity risk and encourages standardized protocols, transparency, and robustness checks. (See SMRA Standards & Ethics.)
What the research says (fast literature scan → market-research consequences)
Dominguez‑Olmedo et al.: elicitation choices can dominate survey-style outputs. They document strong ordering and labeling effects, and show that adjusting for some biases can shift outcomes dramatically. Market research consequence: your message-test winner can flip if option order or label conventions change between runs or teams. (Source: Dominguez‑Olmedo, Hardt & Mendler‑Dünner.)
Tjuatja et al. (BiasMonkey): LLMs can be sensitive in ways humans are not. They evaluate whether LLMs exhibit human-like survey response biases and find many models fail to reproduce expected human patterns and can shift under perturbations that do not meaningfully affect humans. Market research consequence: “human-style survey design” does not guarantee “human-like stability” in synthetic panels—so sensitivity must be measured, not assumed. (Source: Tjuatja et al., TACL 2024.)
Rupprecht et al. (WVS perturbations): a concrete perturbation menu plus evidence of systematic order effects. They test multiple perturbations on World Values Survey items and find consistent recency bias and sensitivity to semantic variations, including interaction effects. Market research consequence: the “top segment” or “acceptable price” conclusion can change if scale order or minor wording shifts. (Source: Rupprecht, Ahnert & Strohmaier.)
The throughline is simple: don’t trust one prompt—and don’t trust one run.
2) What counts as a “conclusion flip”? A practical taxonomy
A “conclusion flip” is not “the wording changed” or “the verbatims are different.” A conclusion flip is: your decision rule produces a different decision under a minimal, plausible perturbation.
Below is a taxonomy you can use as a checklist. The point isn’t to be academically complete; it’s to make your governance decision fast.
| Flip type | What flips (decision-level) | How to detect (simple) | Why it matters |
|---|---|---|---|
| Rank flip | A beats B becomes B beats A (top‑1 or top‑2) | Compare winner across conditions; track rank correlation | Changes what you ship / spend |
| Threshold flip | Crosses a cutoff (“acceptable” vs “not acceptable”) | Compare metric to threshold in each condition | Triggers go/no‑go decisions |
| Segment flip | “Most receptive segment” changes | Compare segment-level winner/top segment | Changes targeting / messaging |
| Driver flip | Top reasons/objections change materially | Compare top‑N themes; flag large churn | Changes creative strategy |
| Policy flip | “Safe to publish” vs “too unstable” | Apply disclosure/stability gate | Prevents overclaiming |
| Confidence flip | “Strong preference” becomes “too close to call” | Track margin drift + variance | Determines whether you act at all |
Practical note: you don’t need fancy stats to detect these flips. You need (1) a decision rule and (2) a controlled perturbation.
A minimal decision rule template (fill-in)
Write this before running anything:
- Decision: We will choose [Option] if [Metric] is highest by ≥ [margin] and is stable across [runs] and robust across [perturbations].
- Otherwise: label exploratory and/or escalate to benchmark/fieldwork.
This aligns with “methods as governance”: define the method and interpretation before you see outputs. (See SMRA Methods & Validation.)
3) What’s the minimal sensitivity protocol (SMRA‑lite) you can run this week?
If your organization is adopting synthetic panels, you need a protocol that works under real constraints: limited time, multiple stakeholders, and vendor tooling that may not be fully transparent.
SMRA’s validation workflow is consistent: synthetic research becomes more credible when it behaves like a measurable instrument—fixed stimuli, fixed wording, disclosed run settings, repeat runs, sensitivity checks, and at least one benchmark where feasible. (See SMRA Methods & Validation and SMRA Standards & Ethics.)
Step 1) What is your conclusion, exactly?
Start by defining the “answer that matters.”
- “Message A beats Message B overall and within Segment X.”
- “Price point $Y is acceptable (≥3.8/5) among Segment Y.”
- “Segment Z is most receptive (top‑1 on weighted score).”
Rule of thumb: if you can’t write the conclusion as a one‑sentence decision rule, you’re not ready to test robustness.
Step 2) Lock protocol (version it)
This is the most common failure in synthetic studies: teams treat prompting as improvisation.
“Lock protocol” means:
- Lock stimuli (concept cards, messages, pricing table).
- Lock question wording and scales.
- Lock segmentation definitions and population frame.
- Lock run settings (panel size, number of runs, sampling randomness/temperature equivalents, model versions if disclosed).
SMRA recommends specifying run settings explicitly and logging enough metadata for a comparable re-run. (See SMRA Methods & Validation.)
Protocol versioning tip: treat your protocol like software. Name it (e.g., “MSGTEST_v1.2”), store it, and record diffs.
Step 3) Run test–retest (two identical runs)
This is the baseline stability gate.
- Run #1: locked protocol, no changes
- Run #2: exact same protocol, same conditions
Then compute:
- winner stability (does the winner change?)
- rank stability (does ordering change?)
- margin stability (does the gap shrink or expand?)
SMRA’s minimum is two identical runs for a stability check. (See SMRA Glossary: “Test–retest stability”.)
Step 4) Run a sensitivity check (change one thing)
Now you test the core question: do small, plausible changes flip the decision?
Design principle: change one thing at a time. You’re diagnosing failure triggers, not “trying different prompts.”
A minimal sensitivity design is a 2×2:
| Baseline prompt | Perturbed prompt (one change) | |
|---|---|---|
| Run 1 | Baseline‑1 | Perturb‑1 |
| Run 2 | Baseline‑2 | Perturb‑2 |
This gives you stability evidence (baseline‑1 vs baseline‑2) and sensitivity evidence (baseline vs perturbation) in a format that’s easy to disclose.
Step 5) Build a minimal perturbation set (3–8 items)
You don’t need 50 perturbations. You need a diagnostic set that covers high-yield failure modes:
- framing / priming
- option ordering
- scale labels / response format
- small wording paraphrases
- context injection/removal
- run settings (randomness)
Rupprecht et al.’s perturbation framework is a useful concrete menu: order reversal, missing “don’t know,” paraphrase/synonyms/typos, priming, and interaction effects. (Source: Rupprecht et al. (2025).)
Step 6) Compute flip indicators
For each perturbation, compute:
- Conclusion under baseline (using your decision rule)
- Conclusion under perturbation
- Flip? (Y/N)
- Failure trigger label (e.g., “response order reversal,” “paraphrase”)
Then compute: flip rate, rank stability, and margin drift (defined in Section 5).
Step 7) Label + disclose in the output
SMRA’s reporting norms emphasize: don’t hide instability. Report what you tested, what held, and what broke. (See SMRA Standards & Ethics.)
At minimum disclose:
- protocol summary + version
- run settings
- perturbations tested
- robustness metrics
- failure triggers
- limitations + intended use (exploratory vs decision-support)
If you can’t disclose population frame, panel construction approach, and validation checks, treat outputs as exploratory. (See SMRA Vendor Evaluation Checklist.)
Checklist: the minimal “SMRA‑lite” sensitivity workflow (quote-ready)
- Define the conclusion (decision rule + margin + thresholds).
- Lock protocol (stimuli, wording, scales, segments, run settings).
- Run test–retest (two identical runs).
- Run sensitivity (3–8 one-at-a-time perturbations; baseline + perturbation, each with 2 repeats).
- Compute flip rate + rank stability + margin drift.
- Report robustness + failure triggers.
- Escalate to benchmarks / fieldwork when the decision is high-stakes or flips occur.
This maps directly to SMRA’s validation workflow: fixed protocol → stability + sensitivity → benchmark/known-truth checks → disclose limitations. (See SMRA Methods & Validation.)
4) What perturbations should you test? A menu for synthetic market research
The goal of a perturbation menu is not to “stress test everything.” It’s to create a minimal, controlled set of changes that reveal whether your conclusion is stable or brittle—and what causes brittleness.
Below is a practical menu you can reuse across study types. It adapts survey-perturbation ideas into synthetic market research workflows.
A. Framing and priming (tests prompt/operator bias risk)
Direct answer: if a neutral vs leading framing changes the decision, your conclusion is vulnerable to operator influence.
- Neutral setup vs “make the case for…” setup
- Add/remove urgency language (“This is very important to my research…”)
Rupprecht et al. include emotional priming as a perturbation class. (Source: Rupprecht et al. (2025).)
SMRA flags prompt/operator bias as a key operational risk and recommends standardized protocols and robustness checks. (See SMRA Standards & Ethics.)
B. Ordering and response format (tests order effects + extraction artifacts)
Direct answer: if reversing option order flips outcomes, your ranking is not decision-grade.
- Reverse answer option order (1→5 becomes 5→1)
- Reverse stimulus order (Message A shown first vs second)
- Swap response format (forced-choice vs “explain then choose”; numeric-only vs labeled Likert; include vs remove “don’t know”)
Dominguez‑Olmedo et al. document ordering and labeling effects in LLM survey responses. (Source: Dominguez‑Olmedo et al. (2023).)
Rupprecht et al. explicitly test response order reversal and missing refusal (“don’t know”). (Source: Rupprecht et al. (2025).)
C. Minor wording changes (tests semantic brittleness)
Direct answer: if a paraphrase flips the decision, you’re measuring prompt sensitivity more than preference.
- Paraphrase the question (same meaning, different wording)
- Synonym replacement (swap a few key words)
- Minimal typos/noise
Rupprecht et al. test synonym replacement, paraphrasing, and typos; paraphrasing can reduce robustness more than synonym changes. (Source: Rupprecht et al. (2025).)
D. Context injection / constraint changes (tests leakage + dependence on supplied context)
Direct answer: if adding one “fact” flips the conclusion, your result may be driven by the context you injected—not the stimulus.
- Add/remove one contextual sentence (e.g., “competitor X is known for Y”)
- Add/remove persona constraints (“assume you are…”)
- Add/remove product category “facts” (pricing norms, common objections)
SMRA explicitly warns about knowledge boundary failures (domain leakage) and recommends restricting context injection and running boundary tests. (See SMRA Methods & Validation.)
E. Parameters / run settings (tests randomness + reproducibility)
Direct answer: if changing randomness settings or sample size flips the conclusion, your outcome is not stable enough to treat as decision-grade.
- Temperature/sampling randomness (low vs medium)
- Sample size (small vs larger)
- Seeds (if supported)
- Model version change (if vendor updates models)
SMRA emphasizes disclosing run settings and enabling comparable re-runs where possible. (See SMRA Methods & Validation.)
Perturbation → what it reveals → common flip pattern → what to do next
| Perturbation | What it reveals | Common flip pattern | What to do next |
|---|---|---|---|
| Neutral vs leading framing | Operator bias vulnerability | Winner changes when framing “pushes” | Standardize prompts; blind comparisons |
| Reverse option order | Order sensitivity / recency effects | Rank flip; threshold shifts | Fix ordering in protocol; report as failure trigger |
| Remove “don’t know” | Forced-response artifacts | More extreme answers; threshold flips | Decide DK policy up front; disclose |
| Scale structure change (odd/even) | Scale dependence | Midpoint effects; confidence flip | Lock scale; interpret with caution |
| Paraphrase question | Semantic brittleness | Segment flip; driver flip | Lock exact wording; test paraphrases in pilot |
| Add one context fact | Context dependence / leakage risk | Winner flips with injected “facts” | Restrict context; run boundary tests |
| Temperature/randomness change | Stochastic instability | Winner changes across settings | Increase repeats; treat as exploratory; benchmark |
| Combined perturbations | Interaction effects | Sudden instability | Expand sensitivity set; escalate validation |
Grounding note: the perturbation set above maps closely to the framework used in Rupprecht et al.’s WVS robustness study. (Source: Rupprecht et al. (2025).)
5) How do you score robustness without building a whole new analytics stack?
You want metrics that (1) are easy to compute in a spreadsheet, (2) map to decision risk, and (3) are easy to disclose.
SMRA’s language is useful here: report robustness and failure triggers (what causes instability). (See SMRA Methods & Validation.)
1) Flip rate (the headline metric)
Direct answer: flip rate tells you how often a small change flips your decision.
Define:
- Let P = number of perturbations tested (e.g., 6)
- Let F = number of perturbations that change the conclusion (based on your decision rule)
Flip rate = F / P
Two variants (choose one and disclose):
- Strict flip rate: count a flip if any repeat under that perturbation yields a different conclusion
- Averaged flip rate: average repeats per condition first, then compute conclusion and flip
2) Rank stability (winner vs “shape of preference”)
Winner flips are obvious; rank instability can be subtle but still risky.
Spreadsheet-friendly options:
- Spearman rank correlation between baseline ranking and perturbed ranking (3+ options)
- Pairwise order retention (works even for 2 options): compute all option pairs and measure % of pairs that keep ordering
3) Margin drift (how close are we to a flip?)
A conclusion can be “stable” but only because it’s barely above the threshold or barely ahead.
- Margin = score(winner) − score(runner-up)
- Margin drift = margin(perturbed) − margin(baseline)
SMRA encourages avoiding false precision: margin and rank stability often matter more than absolute scores when decision-grade validity is uncertain. (See SMRA Standards & Ethics.)
A simple “traffic light” robustness scorecard
| Classification | Rule of thumb | How to label output |
|---|---|---|
| Green | Flip rate = 0 on minimal set AND stable across test–retest | Decision-support candidate (still benchmark) |
| Yellow | Flips only when margin is tiny or under 1–2 edge perturbations | Exploratory with caveats; consider tightening protocol |
| Red | Multiple flips across common perturbations OR unstable test–retest | Exploratory only; escalate validation |
Important: Green is not “true.” Green is “not obviously brittle under this small test.” Accuracy still requires benchmarks/fieldwork where appropriate. (See SMRA Glossary: “Reliability vs accuracy”.)
6) Worked example: a toy message test where the winner flips
This example is fabricated but realistic. The point is to show the workflow and the disclosure, not to claim real consumer truth.
Study setup (locked protocol)
- Study type: message test (2 messages)
- Segments:
- Segment 1: “Budget-conscious families”
- Segment 2: “Time-poor young professionals”
- Metrics (1–5 Likert): clarity, credibility, differentiation
- Decision rule: compute a simple average across the three metrics. Winner = higher overall average by ≥0.10. If margin < 0.10, label “too close to call.”
Baseline prompt (Protocol v1.0)
You are a simulated respondent who represents Segment 1 (Budget-conscious families).
Read two messages for a new grocery delivery service.
Message A: "Fresh groceries delivered for less. Transparent prices, no surprises."
Message B: "Premium groceries delivered fast. Curated quality you can trust."
Rate EACH message on:
1) Clarity
2) Credibility
3) Differentiation
Use a 1–5 scale where:
1 = Strongly disagree
2 = Disagree
3 = Neutral
4 = Agree
5 = Strongly agree
Return your answers as numbers only.
Perturbed prompt (one small change)
We change only one element: reverse the order of the response options (same labels, different order).
- Use a 1–5 scale where:
- 1 = Strongly disagree
- 2 = Disagree
- 3 = Neutral
- 4 = Agree
- 5 = Strongly agree
+ Use a 1–5 scale where:
+ 5 = Strongly agree
+ 4 = Agree
+ 3 = Neutral
+ 2 = Disagree
+ 1 = Strongly disagree
Why test this? Because ordering and option structure effects have been documented in survey-style LLM elicitation work. (See Dominguez‑Olmedo et al. (2023) and Rupprecht et al. (2025).)
Runs
We run a minimal 2×2:
- Baseline: 2 repeats
- Perturbation: 2 repeats
Results summary (illustrative)
Segment 1: Budget-conscious families
| Condition | Message A avg | Message B avg | Winner |
|---|---|---|---|
| Baseline (mean of 2 repeats) | 4.02 | 3.93 | A (margin +0.09) → too close to call |
| Reversed option order (mean of 2 repeats) | 3.88 | 4.01 | B (margin +0.13) |
Segment 2: Time-poor young professionals
| Condition | Message A avg | Message B avg | Winner |
|---|---|---|---|
| Baseline (mean of 2 repeats) | 3.74 | 3.89 | B (margin +0.15) |
| Reversed option order (mean of 2 repeats) | 3.80 | 3.84 | B (margin +0.04) → too close to call |
Conclusion and flip detection
- Segment 1 flips from “too close to call / slight A lean” to B wins under a minimal ordering change.
- Segment 2 keeps the same winner, but margin collapses.
Flip indicators:
- Flip rate: 1 flip out of 1 perturbation tested → 1.0 (red for Segment 1)
- Failure trigger: response option order reversal
- Decision action: do not claim “A wins with families.” Treat as exploratory; tighten protocol; add benchmark/fieldwork for decision-grade selection.
This is the “report robustness + failure triggers” behavior SMRA recommends before decision-grade usage. (See SMRA Methods & Validation.)
7) What should you do when you find a conclusion flip?
A flip is not just a “method detail.” It’s a governance event.
1) Stop presenting the conclusion as stable
If your winner flips under a minimal perturbation, you cannot responsibly present it as “the answer.” That’s the point of the test.
2) Label the output as exploratory (and say why)
“This result is exploratory. Under a controlled sensitivity check (response option order reversal), the message winner flipped. We are not treating the winner as decision-grade without additional validation.”
3) Tighten the protocol (reduce ambiguity)
- Lock exact wording (no paraphrases across runs)
- Lock response option order
- Fix scale labels and extraction rules
- Remove unnecessary persona flourishes
- Restrict context injection
4) Add a benchmark (or targeted fieldwork) for decision-grade use
If the decision matters (spend, pricing, positioning), you need at least one benchmark or known-truth check where feasible—not just stability. (See SMRA Methods & Validation.)
Benchmark options:
- small human sample (directional)
- historical back-testing where outcomes are known
- published statistics (for bounded questions)
5) If flips concentrate in one segment, downgrade segment claims
- avoid incidence claims (“X% prefer…”)
- avoid over-interpreting small gaps
- use segment insights as prioritization for fieldwork
6) If flips are tied to leading phrasing, treat it as operator-bias risk
When leading setup changes the conclusion, don’t “pick the better prompt.” That’s how prompt laundering happens. Instead:
- require a standard prompt library
- use blinded comparisons where possible
- require pre-specified decision rules for pilots
SMRA flags prompt/operator bias risk and encourages safeguards aligned with transparency and repeatability. (See SMRA Standards & Ethics.)
8) What should you disclose so others can evaluate your sensitivity testing?
If sensitivity testing stays in your notebook, it doesn’t govern anything. Disclosure is what makes the method legible, auditable, and comparable.
SMRA’s reporting norms emphasize: include protocol summary, run settings, perturbations tested, robustness metrics, failure triggers, and limitations. (See SMRA Methods & Validation.)
Reporting norms (minimum viable disclosure)
- Protocol summary (versioned): stimuli, questionnaire, segment definitions, population frame, scoring/aggregation method.
- Run settings: panel size per segment, number of runs and repeats, randomness settings, model versions (if disclosed).
- Stability results (test–retest): winner consistency, rank stability, variance / margin drift.
- Sensitivity design: perturbations tested, “one change at a time” confirmation.
- Robustness metrics: flip rate, rank stability, margin drift.
- Failure triggers: which perturbations caused flips and how the conclusion changed.
- Benchmarking plan (when decision-grade): what external check you’ll use.
- Limitations + intended use label: exploratory vs decision-support; what not to conclude.
Consistency matters: consistent wording for disclosure fields improves comparability across studies and vendors. (See SMRA Vendor Evaluation Checklist.)
Stretch: Sensitivity testing as governance (preventing “prompt laundering”)
“Sensitivity testing” sounds technical. In practice, it’s a governance control against a predictable organizational failure mode:
Prompt laundering: shopping for the prompt that produces the conclusion you already want—then presenting that conclusion as if it were a stable research finding.
This can happen without bad intent. A stakeholder says “can you try rewording it?” A researcher iterates. The output changes. The team picks the version that aligns with expectations. The chosen prompt becomes invisible—and the conclusion becomes “the finding.”
SMRA’s standards treat prompt/operator bias as an integrity risk and recommend standardized protocols, robustness checks, and disclosure. (See SMRA Standards & Ethics.)
A simple internal policy that works:
- Any synthetic conclusion used for a decision must include:
- protocol version
- test–retest result
- at least one sensitivity check
- a disclosure label (or equivalent)
- a benchmark plan when stakes are high
That’s “methods as governance” in one sentence: you control the degrees of freedom that otherwise allow the organization to manufacture certainty.
9) FAQ
How many perturbations are “enough”?
For a minimal gate, 3–8 perturbations is usually enough to expose brittleness. Start with high-yield ones: option order reversal, paraphrase, context add/remove, and one run-setting change. Expand if the decision is high-stakes.
How is sensitivity different from test–retest stability?
Test–retest stability asks: “Do I get similar results if I run the same locked protocol twice?”
Sensitivity testing asks: “Do small, plausible input changes shift results—and do they flip the conclusion?”
You need both. (See SMRA Glossary.)
If results are stable, are they true?
No. Stability is reliability, not accuracy. A system can be reliably wrong. Decision-grade usage requires at least one benchmark or known-truth check where feasible. (See SMRA Glossary: “Reliability vs accuracy”.)
Can we average across prompts to “smooth out” sensitivity?
Sometimes—but be careful. Averaging can hide failure triggers and reduce auditability. If you do it: disclose exactly how prompts were varied, report flip rates before averaging, and keep the prompt set fixed for repeatability.
When do we need real fieldwork?
When the decision is high stakes, when outputs make incidence claims, when conclusions flip under minimal sensitivity tests, or when benchmarks show systematic misalignment. (See SMRA Standards & Ethics.)
10) Short conclusion
Sensitivity testing isn’t an academic luxury. It’s a baseline reliability gate that stops simulation dressed as fact.
If you adopt one habit, adopt this: lock protocol, run test–retest, run a sensitivity check, and report robustness plus failure triggers. Then use benchmarks and targeted fieldwork when the decision demands decision-grade evidence. (See SMRA Methods & Validation.)
References
- SMRA — Methods & Validation
- SMRA — Standards & Ethics
- SMRA — Glossary
- SMRA — Vendor Evaluation Checklist
- SMRA — How to Choose and Evaluate Synthetic Market Research Vendors
- SMRA — Resources
- Dominguez‑Olmedo, Hardt & Mendler‑Dünner (arXiv:2306.07951)
- Tjuatja et al., TACL 2024
- Rupprecht, Ahnert & Strohmaier (arXiv:2507.07188)
- Conclusion flips are governance signals—label outputs exploratory or escalate validation.
- Minimal SMRA-lite: lock protocol, test–retest, one perturbation, report robustness + failure triggers.
- Use high-yield perturbations: order, paraphrase, context add/remove, run-setting change.
- Score robustness with flip rate, rank stability, and margin drift; disclose failure triggers.
- Benchmark and fieldwork still matter for decision-grade claims.