Sensitivity testing 101: how to detect prompt-induced “conclusion flips”

Q: How many perturbations are “enough”?

For a minimal gate, 3–8 perturbations is usually enough to expose brittleness. Start with high-yield ones: option order reversal, paraphrase, context add/remove, and one run-setting change. Expand if the decision is high-stakes.

Q: How is sensitivity different from test–retest stability?

Test–retest stability asks: “Do I get similar results if I run the same locked protocol twice?” Sensitivity testing asks: “Do small, plausible input changes shift results—and do they flip the conclusion?” You need both. (See SMRA Glossary .)

Q: If results are stable, are they true?

No. Stability is reliability, not accuracy. A system can be reliably wrong. Decision-grade usage requires at least one benchmark or known-truth check where feasible. (See SMRA Glossary: “Reliability vs accuracy” .)

Q: Can we average across prompts to “smooth out” sensitivity?

Sometimes—but be careful. Averaging can hide failure triggers and reduce auditability. If you do it: disclose exactly how prompts were varied, report flip rates before averaging, and keep the prompt set fixed for repeatability.

Q: When do we need real fieldwork?

When the decision is high stakes, when outputs make incidence claims, when conclusions flip under minimal sensitivity tests, or when benchmarks show systematic misalignment. (See SMRA Standards & Ethics .)

Table of contents

0) Key takeaways
Definitions (read this first)
1) Why prompt-induced conclusion flips matter
2) What counts as a “conclusion flip”
3) Minimal sensitivity protocol
4) Perturbation menu
5) Scoring robustness
6) Worked example
7) What to do when you find a flip
8) Reporting
Stretch: governance
9) FAQ
10) Short conclusion
References

0) Key takeaways

Sensitivity testing is the fastest way to spot brittle synthetic conclusions. The question isn’t “did the output change?” (it always will). It’s “did the decision change?”
Synthetic market research is simulation, not measurement. Without validation controls, it’s easy to produce a convincing simulation that gets treated like a fact report. (See SMRA Glossary: “Simulation vs measurement”.)
A conclusion flip is a governance signal. If a small prompt/context tweak flips the winner, threshold, or “most receptive segment,” label the output exploratory and/or escalate validation before decision-grade use. (See SMRA Standards & Ethics.)
Minimal protocol you can adopt immediately:
1. Lock protocol (stimuli, wording, scales, run settings).
2. Run test–retest (two identical runs).
3. Run a sensitivity check (change one thing).
4. Report robustness + failure triggers (what caused the flip).
5. Escalate to benchmarks / fieldwork for decision-grade usage.
(See SMRA Methods & Validation.)
Reliability ≠ accuracy. A system can be consistently wrong. Stability and sensitivity testing are reliability controls, not a validity stamp. (See SMRA Glossary: “Reliability vs accuracy”.)

If you only do one thing… run the same study twice and run one controlled perturbation. If it can’t survive that, don’t present the conclusion as stable. (See SMRA Methods & Validation.)

Definitions (read this first)

Definitions box (copy/paste into your study template)

Sensitivity testing / sensitivity analysis: vary inputs slightly (prompt framing, context, ordering, parameters) and measure output change. (See SMRA Glossary: “Sensitivity analysis”.)
Conclusion: the decision-relevant claim (winner, threshold, segment difference), not the raw transcript.
Conclusion flip: a small perturbation changes the conclusion (e.g., A>B becomes B>A, or the “top segment” changes).
Robustness: conclusions hold across small perturbations and across repeat runs.
Failure trigger: the specific perturbation that causes the flip (report it—don’t hide it).

SMRA links you’ll use throughout this guide (use as governance anchors, not “extra reading”):

Methods & Validation (the official SMRA playbook + minimal checklist)
Standards & Ethics (why sensitivity is a baseline validation expectation + guardrail against prompt/operator bias)
Glossary (shared definitions: sensitivity analysis; test–retest stability; reliability vs accuracy; simulation vs measurement)
Vendor Evaluation Checklist (procurement tie-in: ask vendors for sensitivity evidence)
Vendor Evaluation Guide (Gate 0 + pilots that force repeatability and robustness)
Resources (templates, disclosure concepts, and reading library)

1) Why are “prompt-induced conclusion flips” the failure mode that matters?

A synthetic system can change wording, examples, or answer distributions without threatening your work. But when a small change flips the decision—your winner, your threshold, your recommended segment—that’s a reliability failure with governance consequences.

Here’s the key distinction:

Output variability is normal: LLM-based systems are stochastic, sensitive to context, and often produce different phrasings or reasons run to run.
Decision variability is dangerous: if your recommendation changes under tiny, plausible prompt shifts, you don’t have a stable basis for action.

Why this matters specifically in synthetic market research

SMRA’s framing is blunt: synthetic market research is simulation—a structured way to generate plausible outcomes under assumptions—not direct measurement of what humans did or said. That’s why sensitivity testing is one of the controls that prevents “simulation dressed as fact.” (See SMRA Glossary: “Simulation vs measurement”.)

Sensitivity testing is one of the controls that keeps simulation honest. It answers:

“Are we seeing a stable signal?”
“Or are we seeing a prompt artifact?”

SMRA’s recommended workflow treats this as baseline governance: lock protocol, run test–retest, run a sensitivity check, and escalate to at least one benchmark when the stakes require it. (See SMRA Methods & Validation.)

Sensitivity testing is also an anti-manipulation control

Prompt / operator bias isn’t always malicious. Often it’s accidental: a researcher adds “helpful” context, slightly leading wording, or a more persuasive setup. But the governance risk is the same: the operator becomes a hidden instrument, able to steer outcomes through wording choices.

SMRA flags prompt/operator bias as an integrity risk and encourages standardized protocols, transparency, and robustness checks. (See SMRA Standards & Ethics.)

What the research says (fast literature scan → market-research consequences)

Dominguez‑Olmedo et al.: elicitation choices can dominate survey-style outputs. They document strong ordering and labeling effects, and show that adjusting for some biases can shift outcomes dramatically. Market research consequence: your message-test winner can flip if option order or label conventions change between runs or teams. (Source: Dominguez‑Olmedo, Hardt & Mendler‑Dünner.)

Tjuatja et al. (BiasMonkey): LLMs can be sensitive in ways humans are not. They evaluate whether LLMs exhibit human-like survey response biases and find many models fail to reproduce expected human patterns and can shift under perturbations that do not meaningfully affect humans. Market research consequence: “human-style survey design” does not guarantee “human-like stability” in synthetic panels—so sensitivity must be measured, not assumed. (Source: Tjuatja et al., TACL 2024.)

Rupprecht et al. (WVS perturbations): a concrete perturbation menu plus evidence of systematic order effects. They test multiple perturbations on World Values Survey items and find consistent recency bias and sensitivity to semantic variations, including interaction effects. Market research consequence: the “top segment” or “acceptable price” conclusion can change if scale order or minor wording shifts. (Source: Rupprecht, Ahnert & Strohmaier.)

The throughline is simple: don’t trust one prompt—and don’t trust one run.

2) What counts as a “conclusion flip”? A practical taxonomy

A “conclusion flip” is not “the wording changed” or “the verbatims are different.” A conclusion flip is: your decision rule produces a different decision under a minimal, plausible perturbation.

Below is a taxonomy you can use as a checklist. The point isn’t to be academically complete; it’s to make your governance decision fast.

Flip type	What flips (decision-level)	How to detect (simple)	Why it matters
Rank flip	A beats B becomes B beats A (top‑1 or top‑2)	Compare winner across conditions; track rank correlation	Changes what you ship / spend
Threshold flip	Crosses a cutoff (“acceptable” vs “not acceptable”)	Compare metric to threshold in each condition	Triggers go/no‑go decisions
Segment flip	“Most receptive segment” changes	Compare segment-level winner/top segment	Changes targeting / messaging
Driver flip	Top reasons/objections change materially	Compare top‑N themes; flag large churn	Changes creative strategy
Policy flip	“Safe to publish” vs “too unstable”	Apply disclosure/stability gate	Prevents overclaiming
Confidence flip	“Strong preference” becomes “too close to call”	Track margin drift + variance	Determines whether you act at all

Practical note: you don’t need fancy stats to detect these flips. You need (1) a decision rule and (2) a controlled perturbation.

A minimal decision rule template (fill-in)

Write this before running anything:

Decision: We will choose [Option] if [Metric] is highest by ≥ [margin] and is stable across [runs] and robust across [perturbations].
Otherwise: label exploratory and/or escalate to benchmark/fieldwork.

This aligns with “methods as governance”: define the method and interpretation before you see outputs. (See SMRA Methods & Validation.)

3) What’s the minimal sensitivity protocol (SMRA‑lite) you can run this week?

If your organization is adopting synthetic panels, you need a protocol that works under real constraints: limited time, multiple stakeholders, and vendor tooling that may not be fully transparent.

SMRA’s validation workflow is consistent: synthetic research becomes more credible when it behaves like a measurable instrument—fixed stimuli, fixed wording, disclosed run settings, repeat runs, sensitivity checks, and at least one benchmark where feasible. (See SMRA Methods & Validation and SMRA Standards & Ethics.)

Step 1) What is your conclusion, exactly?

Start by defining the “answer that matters.”

“Message A beats Message B overall and within Segment X.”
“Price point $Y is acceptable (≥3.8/5) among Segment Y.”
“Segment Z is most receptive (top‑1 on weighted score).”

Rule of thumb: if you can’t write the conclusion as a one‑sentence decision rule, you’re not ready to test robustness.

Step 2) Lock protocol (version it)

This is the most common failure in synthetic studies: teams treat prompting as improvisation.

“Lock protocol” means:

Lock stimuli (concept cards, messages, pricing table).
Lock question wording and scales.
Lock segmentation definitions and population frame.
Lock run settings (panel size, number of runs, sampling randomness/temperature equivalents, model versions if disclosed).

SMRA recommends specifying run settings explicitly and logging enough metadata for a comparable re-run. (See SMRA Methods & Validation.)

Protocol versioning tip: treat your protocol like software. Name it (e.g., “MSGTEST_v1.2”), store it, and record diffs.

Step 3) Run test–retest (two identical runs)

This is the baseline stability gate.

Run #1: locked protocol, no changes
Run #2: exact same protocol, same conditions

Then compute:

winner stability (does the winner change?)
rank stability (does ordering change?)
margin stability (does the gap shrink or expand?)

SMRA’s minimum is two identical runs for a stability check. (See SMRA Glossary: “Test–retest stability”.)

Step 4) Run a sensitivity check (change one thing)

Now you test the core question: do small, plausible changes flip the decision?

Design principle: change one thing at a time. You’re diagnosing failure triggers, not “trying different prompts.”

A minimal sensitivity design is a 2×2:

	Baseline prompt	Perturbed prompt (one change)
Run 1	Baseline‑1	Perturb‑1
Run 2	Baseline‑2	Perturb‑2

This gives you stability evidence (baseline‑1 vs baseline‑2) and sensitivity evidence (baseline vs perturbation) in a format that’s easy to disclose.

Step 5) Build a minimal perturbation set (3–8 items)

You don’t need 50 perturbations. You need a diagnostic set that covers high-yield failure modes:

framing / priming
option ordering
scale labels / response format
small wording paraphrases
context injection/removal
run settings (randomness)

Rupprecht et al.’s perturbation framework is a useful concrete menu: order reversal, missing “don’t know,” paraphrase/synonyms/typos, priming, and interaction effects. (Source: Rupprecht et al. (2025).)

Step 6) Compute flip indicators

For each perturbation, compute:

Conclusion under baseline (using your decision rule)
Conclusion under perturbation
Flip? (Y/N)
Failure trigger label (e.g., “response order reversal,” “paraphrase”)

Then compute: flip rate, rank stability, and margin drift (defined in Section 5).

Step 7) Label + disclose in the output

SMRA’s reporting norms emphasize: don’t hide instability. Report what you tested, what held, and what broke. (See SMRA Standards & Ethics.)

At minimum disclose:

protocol summary + version
run settings
perturbations tested
robustness metrics
failure triggers
limitations + intended use (exploratory vs decision-support)

If you can’t disclose population frame, panel construction approach, and validation checks, treat outputs as exploratory. (See SMRA Vendor Evaluation Checklist.)

Checklist: the minimal “SMRA‑lite” sensitivity workflow (quote-ready)

Define the conclusion (decision rule + margin + thresholds).
Lock protocol (stimuli, wording, scales, segments, run settings).
Run test–retest (two identical runs).
Run sensitivity (3–8 one-at-a-time perturbations; baseline + perturbation, each with 2 repeats).
Compute flip rate + rank stability + margin drift.
Report robustness + failure triggers.
Escalate to benchmarks / fieldwork when the decision is high-stakes or flips occur.

This maps directly to SMRA’s validation workflow: fixed protocol → stability + sensitivity → benchmark/known-truth checks → disclose limitations. (See SMRA Methods & Validation.)

The goal of a perturbation menu is not to “stress test everything.” It’s to create a minimal, controlled set of changes that reveal whether your conclusion is stable or brittle—and what causes brittleness.

Below is a practical menu you can reuse across study types. It adapts survey-perturbation ideas into synthetic market research workflows.

A. Framing and priming (tests prompt/operator bias risk)

Direct answer: if a neutral vs leading framing changes the decision, your conclusion is vulnerable to operator influence.

Neutral setup vs “make the case for…” setup
Add/remove urgency language (“This is very important to my research…”)

Rupprecht et al. include emotional priming as a perturbation class. (Source: Rupprecht et al. (2025).)

SMRA flags prompt/operator bias as a key operational risk and recommends standardized protocols and robustness checks. (See SMRA Standards & Ethics.)

B. Ordering and response format (tests order effects + extraction artifacts)

Direct answer: if reversing option order flips outcomes, your ranking is not decision-grade.

Reverse answer option order (1→5 becomes 5→1)
Reverse stimulus order (Message A shown first vs second)
Swap response format (forced-choice vs “explain then choose”; numeric-only vs labeled Likert; include vs remove “don’t know”)

Dominguez‑Olmedo et al. document ordering and labeling effects in LLM survey responses. (Source: Dominguez‑Olmedo et al. (2023).)

Rupprecht et al. explicitly test response order reversal and missing refusal (“don’t know”). (Source: Rupprecht et al. (2025).)

C. Minor wording changes (tests semantic brittleness)

Direct answer: if a paraphrase flips the decision, you’re measuring prompt sensitivity more than preference.

Paraphrase the question (same meaning, different wording)
Synonym replacement (swap a few key words)
Minimal typos/noise

Rupprecht et al. test synonym replacement, paraphrasing, and typos; paraphrasing can reduce robustness more than synonym changes. (Source: Rupprecht et al. (2025).)

D. Context injection / constraint changes (tests leakage + dependence on supplied context)

Direct answer: if adding one “fact” flips the conclusion, your result may be driven by the context you injected—not the stimulus.

Add/remove one contextual sentence (e.g., “competitor X is known for Y”)
Add/remove persona constraints (“assume you are…”)
Add/remove product category “facts” (pricing norms, common objections)

SMRA explicitly warns about knowledge boundary failures (domain leakage) and recommends restricting context injection and running boundary tests. (See SMRA Methods & Validation.)

E. Parameters / run settings (tests randomness + reproducibility)

Direct answer: if changing randomness settings or sample size flips the conclusion, your outcome is not stable enough to treat as decision-grade.

Temperature/sampling randomness (low vs medium)
Sample size (small vs larger)
Seeds (if supported)
Model version change (if vendor updates models)

SMRA emphasizes disclosing run settings and enabling comparable re-runs where possible. (See SMRA Methods & Validation.)

Perturbation → what it reveals → common flip pattern → what to do next

Perturbation	What it reveals	Common flip pattern	What to do next
Neutral vs leading framing	Operator bias vulnerability	Winner changes when framing “pushes”	Standardize prompts; blind comparisons
Reverse option order	Order sensitivity / recency effects	Rank flip; threshold shifts	Fix ordering in protocol; report as failure trigger
Remove “don’t know”	Forced-response artifacts	More extreme answers; threshold flips	Decide DK policy up front; disclose
Scale structure change (odd/even)	Scale dependence	Midpoint effects; confidence flip	Lock scale; interpret with caution
Paraphrase question	Semantic brittleness	Segment flip; driver flip	Lock exact wording; test paraphrases in pilot
Add one context fact	Context dependence / leakage risk	Winner flips with injected “facts”	Restrict context; run boundary tests
Temperature/randomness change	Stochastic instability	Winner changes across settings	Increase repeats; treat as exploratory; benchmark
Combined perturbations	Interaction effects	Sudden instability	Expand sensitivity set; escalate validation

Grounding note: the perturbation set above maps closely to the framework used in Rupprecht et al.’s WVS robustness study. (Source: Rupprecht et al. (2025).)

5) How do you score robustness without building a whole new analytics stack?

You want metrics that (1) are easy to compute in a spreadsheet, (2) map to decision risk, and (3) are easy to disclose.

SMRA’s language is useful here: report robustness and failure triggers (what causes instability). (See SMRA Methods & Validation.)

1) Flip rate (the headline metric)

Direct answer: flip rate tells you how often a small change flips your decision.

Define:

Let P = number of perturbations tested (e.g., 6)
Let F = number of perturbations that change the conclusion (based on your decision rule)

Flip rate = F / P

Two variants (choose one and disclose):

Strict flip rate: count a flip if any repeat under that perturbation yields a different conclusion
Averaged flip rate: average repeats per condition first, then compute conclusion and flip

2) Rank stability (winner vs “shape of preference”)

Winner flips are obvious; rank instability can be subtle but still risky.

Spreadsheet-friendly options:

Spearman rank correlation between baseline ranking and perturbed ranking (3+ options)
Pairwise order retention (works even for 2 options): compute all option pairs and measure % of pairs that keep ordering

3) Margin drift (how close are we to a flip?)

A conclusion can be “stable” but only because it’s barely above the threshold or barely ahead.

Margin = score(winner) − score(runner-up)
Margin drift = margin(perturbed) − margin(baseline)

SMRA encourages avoiding false precision: margin and rank stability often matter more than absolute scores when decision-grade validity is uncertain. (See SMRA Standards & Ethics.)

A simple “traffic light” robustness scorecard

Classification	Rule of thumb	How to label output
Green	Flip rate = 0 on minimal set AND stable across test–retest	Decision-support candidate (still benchmark)
Yellow	Flips only when margin is tiny or under 1–2 edge perturbations	Exploratory with caveats; consider tightening protocol
Red	Multiple flips across common perturbations OR unstable test–retest	Exploratory only; escalate validation

Important: Green is not “true.” Green is “not obviously brittle under this small test.” Accuracy still requires benchmarks/fieldwork where appropriate. (See SMRA Glossary: “Reliability vs accuracy”.)

6) Worked example: a toy message test where the winner flips

This example is fabricated but realistic. The point is to show the workflow and the disclosure, not to claim real consumer truth.

Study setup (locked protocol)

Study type: message test (2 messages)
Segments:
- Segment 1: “Budget-conscious families”
- Segment 2: “Time-poor young professionals”
Metrics (1–5 Likert): clarity, credibility, differentiation
Decision rule: compute a simple average across the three metrics. Winner = higher overall average by ≥0.10. If margin < 0.10, label “too close to call.”

Baseline prompt (Protocol v1.0)

You are a simulated respondent who represents Segment 1 (Budget-conscious families).
Read two messages for a new grocery delivery service.

Message A: "Fresh groceries delivered for less. Transparent prices, no surprises."
Message B: "Premium groceries delivered fast. Curated quality you can trust."

Rate EACH message on:
1) Clarity
2) Credibility
3) Differentiation

Use a 1–5 scale where:
1 = Strongly disagree
2 = Disagree
3 = Neutral
4 = Agree
5 = Strongly agree

Return your answers as numbers only.

Perturbed prompt (one small change)

We change only one element: reverse the order of the response options (same labels, different order).

- Use a 1–5 scale where:
- 1 = Strongly disagree
- 2 = Disagree
- 3 = Neutral
- 4 = Agree
- 5 = Strongly agree
+ Use a 1–5 scale where:
+ 5 = Strongly agree
+ 4 = Agree
+ 3 = Neutral
+ 2 = Disagree
+ 1 = Strongly disagree

Why test this? Because ordering and option structure effects have been documented in survey-style LLM elicitation work. (See Dominguez‑Olmedo et al. (2023) and Rupprecht et al. (2025).)

Runs

We run a minimal 2×2:

Baseline: 2 repeats
Perturbation: 2 repeats

Results summary (illustrative)

Segment 1: Budget-conscious families

Condition	Message A avg	Message B avg	Winner
Baseline (mean of 2 repeats)	4.02	3.93	A (margin +0.09) → too close to call
Reversed option order (mean of 2 repeats)	3.88	4.01	B (margin +0.13)

Segment 2: Time-poor young professionals

Condition	Message A avg	Message B avg	Winner
Baseline (mean of 2 repeats)	3.74	3.89	B (margin +0.15)
Reversed option order (mean of 2 repeats)	3.80	3.84	B (margin +0.04) → too close to call

Conclusion and flip detection

Segment 1 flips from “too close to call / slight A lean” to B wins under a minimal ordering change.
Segment 2 keeps the same winner, but margin collapses.

Flip indicators:

Flip rate: 1 flip out of 1 perturbation tested → 1.0 (red for Segment 1)
Failure trigger: response option order reversal
Decision action: do not claim “A wins with families.” Treat as exploratory; tighten protocol; add benchmark/fieldwork for decision-grade selection.

This is the “report robustness + failure triggers” behavior SMRA recommends before decision-grade usage. (See SMRA Methods & Validation.)

7) What should you do when you find a conclusion flip?

A flip is not just a “method detail.” It’s a governance event.

1) Stop presenting the conclusion as stable

If your winner flips under a minimal perturbation, you cannot responsibly present it as “the answer.” That’s the point of the test.

2) Label the output as exploratory (and say why)

“This result is exploratory. Under a controlled sensitivity check (response option order reversal), the message winner flipped. We are not treating the winner as decision-grade without additional validation.”

3) Tighten the protocol (reduce ambiguity)

Lock exact wording (no paraphrases across runs)
Lock response option order
Fix scale labels and extraction rules
Remove unnecessary persona flourishes
Restrict context injection

4) Add a benchmark (or targeted fieldwork) for decision-grade use

If the decision matters (spend, pricing, positioning), you need at least one benchmark or known-truth check where feasible—not just stability. (See SMRA Methods & Validation.)

Benchmark options:

small human sample (directional)
historical back-testing where outcomes are known
published statistics (for bounded questions)

5) If flips concentrate in one segment, downgrade segment claims

avoid incidence claims (“X% prefer…”)
avoid over-interpreting small gaps
use segment insights as prioritization for fieldwork

6) If flips are tied to leading phrasing, treat it as operator-bias risk

When leading setup changes the conclusion, don’t “pick the better prompt.” That’s how prompt laundering happens. Instead:

require a standard prompt library
use blinded comparisons where possible
require pre-specified decision rules for pilots

SMRA flags prompt/operator bias risk and encourages safeguards aligned with transparency and repeatability. (See SMRA Standards & Ethics.)

8) What should you disclose so others can evaluate your sensitivity testing?

If sensitivity testing stays in your notebook, it doesn’t govern anything. Disclosure is what makes the method legible, auditable, and comparable.

SMRA’s reporting norms emphasize: include protocol summary, run settings, perturbations tested, robustness metrics, failure triggers, and limitations. (See SMRA Methods & Validation.)

Reporting norms (minimum viable disclosure)

Protocol summary (versioned): stimuli, questionnaire, segment definitions, population frame, scoring/aggregation method.
Run settings: panel size per segment, number of runs and repeats, randomness settings, model versions (if disclosed).
Stability results (test–retest): winner consistency, rank stability, variance / margin drift.
Sensitivity design: perturbations tested, “one change at a time” confirmation.
Robustness metrics: flip rate, rank stability, margin drift.
Failure triggers: which perturbations caused flips and how the conclusion changed.
Benchmarking plan (when decision-grade): what external check you’ll use.
Limitations + intended use label: exploratory vs decision-support; what not to conclude.

Consistency matters: consistent wording for disclosure fields improves comparability across studies and vendors. (See SMRA Vendor Evaluation Checklist.)

Stretch: Sensitivity testing as governance (preventing “prompt laundering”)

“Sensitivity testing” sounds technical. In practice, it’s a governance control against a predictable organizational failure mode:

Prompt laundering: shopping for the prompt that produces the conclusion you already want—then presenting that conclusion as if it were a stable research finding.

This can happen without bad intent. A stakeholder says “can you try rewording it?” A researcher iterates. The output changes. The team picks the version that aligns with expectations. The chosen prompt becomes invisible—and the conclusion becomes “the finding.”

SMRA’s standards treat prompt/operator bias as an integrity risk and recommend standardized protocols, robustness checks, and disclosure. (See SMRA Standards & Ethics.)

A simple internal policy that works:

Any synthetic conclusion used for a decision must include:
- protocol version
- test–retest result
- at least one sensitivity check
- a disclosure label (or equivalent)
- a benchmark plan when stakes are high

That’s “methods as governance” in one sentence: you control the degrees of freedom that otherwise allow the organization to manufacture certainty.

9) FAQ

How many perturbations are “enough”?

For a minimal gate, 3–8 perturbations is usually enough to expose brittleness. Start with high-yield ones: option order reversal, paraphrase, context add/remove, and one run-setting change. Expand if the decision is high-stakes.

How is sensitivity different from test–retest stability?

Test–retest stability asks: “Do I get similar results if I run the same locked protocol twice?”

Sensitivity testing asks: “Do small, plausible input changes shift results—and do they flip the conclusion?”

You need both. (See SMRA Glossary.)

If results are stable, are they true?

No. Stability is reliability, not accuracy. A system can be reliably wrong. Decision-grade usage requires at least one benchmark or known-truth check where feasible. (See SMRA Glossary: “Reliability vs accuracy”.)

Can we average across prompts to “smooth out” sensitivity?

Sometimes—but be careful. Averaging can hide failure triggers and reduce auditability. If you do it: disclose exactly how prompts were varied, report flip rates before averaging, and keep the prompt set fixed for repeatability.

When do we need real fieldwork?

When the decision is high stakes, when outputs make incidence claims, when conclusions flip under minimal sensitivity tests, or when benchmarks show systematic misalignment. (See SMRA Standards & Ethics.)

10) Short conclusion

Sensitivity testing isn’t an academic luxury. It’s a baseline reliability gate that stops simulation dressed as fact.

If you adopt one habit, adopt this: lock protocol, run test–retest, run a sensitivity check, and report robustness plus failure triggers. Then use benchmarks and targeted fieldwork when the decision demands decision-grade evidence. (See SMRA Methods & Validation.)

References

At a glance

Stop prompt-induced flips.

Conclusion flips are governance signals—label outputs exploratory or escalate validation.
Minimal SMRA-lite: lock protocol, test–retest, one perturbation, report robustness + failure triggers.
Use high-yield perturbations: order, paraphrase, context add/remove, run-setting change.
Score robustness with flip rate, rank stability, and margin drift; disclose failure triggers.
Benchmark and fieldwork still matter for decision-grade claims.

0) Key takeaways

Definitions (read this first)

1) Why are “prompt-induced conclusion flips” the failure mode that matters?

Why this matters specifically in synthetic market research

Sensitivity testing is also an anti-manipulation control

What the research says (fast literature scan → market-research consequences)

2) What counts as a “conclusion flip”? A practical taxonomy

A minimal decision rule template (fill-in)

3) What’s the minimal sensitivity protocol (SMRA‑lite) you can run this week?

Step 1) What is your conclusion, exactly?

Step 2) Lock protocol (version it)

Step 3) Run test–retest (two identical runs)

Step 4) Run a sensitivity check (change one thing)

Step 5) Build a minimal perturbation set (3–8 items)

Step 6) Compute flip indicators

Step 7) Label + disclose in the output

Checklist: the minimal “SMRA‑lite” sensitivity workflow (quote-ready)

4) What perturbations should you test? A menu for synthetic market research

A. Framing and priming (tests prompt/operator bias risk)

B. Ordering and response format (tests order effects + extraction artifacts)

C. Minor wording changes (tests semantic brittleness)

D. Context injection / constraint changes (tests leakage + dependence on supplied context)

E. Parameters / run settings (tests randomness + reproducibility)

Perturbation → what it reveals → common flip pattern → what to do next

5) How do you score robustness without building a whole new analytics stack?

1) Flip rate (the headline metric)

2) Rank stability (winner vs “shape of preference”)

3) Margin drift (how close are we to a flip?)

A simple “traffic light” robustness scorecard

6) Worked example: a toy message test where the winner flips

Study setup (locked protocol)

Baseline prompt (Protocol v1.0)

Perturbed prompt (one small change)

Runs

Results summary (illustrative)

Conclusion and flip detection

7) What should you do when you find a conclusion flip?

1) Stop presenting the conclusion as stable

2) Label the output as exploratory (and say why)

3) Tighten the protocol (reduce ambiguity)

4) Add a benchmark (or targeted fieldwork) for decision-grade use

5) If flips concentrate in one segment, downgrade segment claims

6) If flips are tied to leading phrasing, treat it as operator-bias risk

8) What should you disclose so others can evaluate your sensitivity testing?

Reporting norms (minimum viable disclosure)

Stretch: Sensitivity testing as governance (preventing “prompt laundering”)

9) FAQ

How many perturbations are “enough”?

How is sensitivity different from test–retest stability?

If results are stable, are they true?

Can we average across prompts to “smooth out” sensitivity?

When do we need real fieldwork?

10) Short conclusion

References