Vendor Evaluation Checklist for Synthetic Market Research

This checklist aligns with the association’s emphasis on transparent disclosure, validation discipline, and governance. Use it alongside the Standards & Ethics baseline and related templates on this site.

How to use this checklist

Start with Gate 0 (non-negotiable disclosures). If a vendor cannot provide these, treat the product as exploratory only.
Run a structured demo using Sections 1–10. Ask for evidence, not slogans.
Require a pilot with repeatable study design and benchmark comparisons (Section 11).
Score and decide using the rubric (Appendix). Keep your paper trail for governance and procurement.

Gate 0: Minimum disclosure pack (non-negotiable)

Require the vendor to provide the following before you schedule a pilot or sign an agreement. If they cannot, you do not have a basis to evaluate validity, privacy risk, or comparability.

Population frame statement: Who do results represent (geography, language, time window, age range), who is excluded, and why. Include how segments/quotas/weighting are defined.
Method card / study disclosure example: A completed “study disclosure label” (or equivalent) for a real client study showing panel design, grounding inputs, workflow, validation performed, limitations, privacy posture, and reproducibility notes. (If the vendor refuses to share an example, require a redacted one.)
Grounding and provenance summary: What sources were used (public stats, panels, first-party client data, third-party data), what transformations were applied, and what retention/deletion rules apply.
Validation pack: Test-retest stability results, sensitivity tests, at least one external benchmark (or a plan to run one with you), and an explicit “known failure modes” document.
Auditability description: What is logged (prompts, seeds/settings if applicable, model versions, grounding inputs), how runs can be reproduced, and what is available to customers vs internal-only.
Security and privacy controls: Access controls, encryption posture, isolation between customers, incident response, and red-teaming practices.
Use policy: Prohibited and restricted use cases (e.g., sensitive traits, minors, political persuasion, or high-stakes domains), plus enforcement mechanisms.

Procurement rule of thumb: if the vendor cannot disclose population frame, panel construction approach, and validation checks, then outputs should be labelled and treated as exploratory/hypothesis-generating only (not decision-grade).

1. Category clarity: what exactly is being sold?

The fastest way to get burned is to buy a “synthetic research” product without pinning down whether you are buying (a) prompt-based personas, (b) a statistically grounded synthetic panel, (c) a twin-like simulation layer, (d) an interface on top of an LLM, or (e) an agency workflow that mixes human and synthetic components.

Questions to ask

Is your core product a synthetic panel, a persona generator, digital twins, or a research workflow tool?
What is simulated vs measured? What parts (if any) are derived from real respondent data in the specific study?
What claims do you explicitly not make (e.g., “this represents real individuals,” “this is a replacement for fieldwork”)?

Evidence to request

Example deliverables labelled “synthetic” with clear method disclosure and limitations.
Architecture overview at a level suitable for audit (what components exist; what goes in/out; what is persistent vs ephemeral).

Red flags

“It’s proprietary” used as a blanket refusal to disclose essentials (population frame, grounding class, validation approach).
Marketing language that implies “digital twin” as a magic label rather than a testable modelling claim.

2. Population framing & coverage: who does this represent?

Synthetic market research only has meaning if “who it represents” is explicit. Without a defensible population frame, you cannot interpret results or compare vendors.

Questions to ask

What is the target population (country/region, language, time window)?
How are segments defined (demographics, psychographics, behaviours), and what evidence supports those definitions?
What is your coverage statement: where are you reliable, and where are you weak or out-of-domain?
Do personas/panels drift over time? If yes: how, how often, and what is held constant?

Evidence to request

A written population frame, plus an example of how it is attached to every study output.
A description of how quotas/weighting are implemented (if any) and how you validate that distributions match the intended frame.

Red flags

“Global consumers” claims without region/language validation evidence.
No clarity on what changes between runs (freshness updates, retrieval sources, model versioning, etc.).

3. Personas, twins, and comparability: require a clear persona specification (including SPL)

One of the largest sources of mis-selling in this category is the word “persona.” It can mean anything from a single prompt to a persistent agent with memory, state, and longitudinal coherence. To make offerings comparable, require vendors to state what kind of persona they provide and what capabilities are actually implemented.

As a practical nomenclature, you can ask vendors to declare a Synthetic Persona Level (SPL) for each persona type they sell, and to provide evidence for that level. See: The Ten Levels of Synthetic Personas (SPL). Treat SPL as a claim that must be validated, not as a badge.

Minimum persona specification (what you require in writing)

Persona type: segment-level persona vs twin-like individual proxy vs respondent generator.
Representation target: archetype, micro-cohort, or individual-like proxy (and whether you ever claim to represent specific people).
SPL declaration: the SPL level(s) supported, per the referenced SPL ladder, plus what features are implemented to justify it.
Memory model: none / retrieval memory / structured episodic + decay / other (and what is persisted).
Temporal context: does “today” exist (context streaming), and how do you prevent leakage of irrelevant or privileged information?
State variables: what latent state exists (goals, beliefs, affect), how it evolves, and whether it is auditable.
World connections: what external tools/feeds are accessed (if any), with logging and allow/deny controls.
Interaction model: single-turn Q&A vs multi-turn interviews vs agentic workflows vs multi-agent simulations.

SPL reality-check questions (designed to expose “prompt dressed as product”)

If you claim SPL 3+ (memory): show how memory is stored, retrieved, and constrained. How do you prevent perfect recall artefacts?
If you claim SPL 4+ (context streaming): what feeds are allowed, what is blocked, and how is provenance logged?
If you claim SPL 5+ (state): what are the state variables, and can we audit state transitions across a run?
If you claim SPL 8 (closed-loop runtime): what is the runtime loop, and what artefacts can a customer inspect (not just your internal team)?
If you claim SPL 9–10 (social/multi-agent): can you show emergent dynamics are stable under reruns, not just theatrical transcripts?

Red flags

“Our personas are digital twins” with no operational definition, no calibration story, and no restrictions on person-like inference.
Refusal to specify whether personas are prompt-only vs persistent agents with memory/state.
Claims of “SPL X” (or equivalent) without any artefacts that allow independent evaluation.

4. Grounding & data provenance: what is this built on, and is it legitimate?

Synthetic outputs do not eliminate provenance risk; they can intensify it by obscuring data lineage behind fluent narratives. Require auditable provenance.

Questions to ask

What are your grounding inputs: public statistics, curated corpora, survey microdata, first-party client data, third-party data?
What is your position on using personal data about identifiable individuals? If you do: what is your lawful/ethical basis and purpose limitation?
What is the “purpose distance” between original data collection and your modelling use?
Do you train/improve your models on customer interactions or outputs? Is this opt-in or opt-out?

Evidence to request

A provenance statement template you can attach to every study (“what data built this, what changed, what is retained”).
Data flow diagram and retention schedule (including logs, prompts, outputs, embeddings, and derived features).

Red flags

“Synthetic means no privacy risk” reasoning.
Unclear or evasive answers about whether customer data trains shared models.

5. Validation: can they prove reliability (not just plausibility)?

The key ethical failure mode in synthetic research is over-claiming: treating simulation as measurement. Vendors must show validation evidence appropriate to your intended use.

Questions to ask

What is your test-retest stability on a standard study design (same stimuli, same wording, same settings)?
How sensitive are results to prompt/context changes (controlled perturbations)?
What external benchmarks have you run (public stats, known-truth tasks, fieldwork comparisons), and can we replicate them?
What are your documented failure modes (domains, populations, question types) and how do you prevent misuse?

Evidence to request

A stability report (variance ranges, not cherry-picked “nice” examples).
Benchmark results with methodology sufficient for replication (not just screenshots).
A “known-truth” task pack relevant to your domain (or willingness to run one during pilot).

Red flags

Validation framed as “look how human it sounds.”
No willingness to run a blinded benchmark against a small human sample where feasible.

6. Research workflow integrity: do they support real study design?

Synthetic market research is not a chat demo. It is a research workflow that must behave like a measurement system: fixed stimuli, fixed wording, repeatable settings, and disclosed aggregation.

Questions to ask

Which study types are supported (concept tests, message tests, pricing exploration, scenario simulation)?
How are responses aggregated (panel-level statistics, distributions, uncertainty indicators, segmentation cuts)?
Can you lock stimuli, wording, and run settings to enable repeatability?
What controls exist to reduce operator prompting bias?

Evidence to request

A protocol template and an example of a repeatable run configuration.
Documentation of aggregation logic and any weighting/normalisation steps.

Red flags

“We just ask the model” with no stable protocol, controls, or variance reporting.

7. Bias, fairness & representational harm

Bias is not limited to offensive outputs; it includes systematic representational gaps where certain groups are mis-modelled or erased. Require subgroup evaluation and a coverage statement.

Questions to ask

How do you evaluate representational coverage across demographics/segments relevant to the population frame?
How do you detect and mitigate stereotyping or narrative harm in persona outputs?
Do you provide “do not use for X group / X topic” constraints where reliability is weak?

Evidence to request

Bias assessment documentation and examples of mitigations (not just principles).
Segment-level benchmark results where feasible.

Red flags

Refusal to discuss subgroup performance (or pretending it is always “neutral”).
“We removed bias” claims with no measurement plan.

8. Privacy posture & security: “synthetic” is not automatically anonymous

You are buying a system that can produce large volumes of human-like data. That creates privacy and security risks whether or not you store names. Require explicit threat modelling and testing.

Questions to ask

What privacy tests do you run (e.g., uniqueness checks, membership inference-style probes, red teaming)?
What data is stored (inputs, prompts, outputs, embeddings, logs), for how long, and who can access it?
How is customer data isolated? What prevents cross-customer leakage?
Do you have controls for sensitive topics and re-identification attempts?

Evidence to request

Security overview (access control, encryption, incident response) and audit artefacts if available.
Retention/deletion policy that explicitly covers prompts and outputs.

Red flags

Privacy described as a marketing property rather than a tested property.
Vague answers about prompt/output retention or use for training.

9. Auditability & reproducibility

If two teams cannot re-run the same method and get comparable results (within reported variance), you do not have a research instrument; you have a story generator.

Questions to ask

What is logged per run (model version, run settings, prompts, grounding sources, aggregation steps)?
Can customers export run artefacts for internal governance reviews?
How are model updates handled and communicated, and how is drift tracked?

Evidence to request

An example audit log and a reproducibility guide for re-running a study configuration.
Change logs describing what changed between versions and expected impact on outputs.

Red flags

“We can’t share prompts/settings” with no alternative method for reproducibility.
No drift monitoring despite frequent model or data updates.

10. Misuse safeguards & governance: prevent Cambridge-Analytica-style dynamics

Synthetic market research can lower the cost of iterative profiling and message testing. Without hard boundaries, it can industrialise manipulation. Your procurement process should explicitly test whether the vendor has enforceable safeguards, not just aspirational ethics.

Questions to ask

What uses are prohibited (e.g., misinformation, exploitation of vulnerability, discriminatory targeting, sensitive trait inference)?
What are the restricted domains requiring enhanced review (minors, health, financial distress, addiction-linked products, elections)?
What enforcement exists (policy gates, monitoring, contractual clauses, customer offboarding)?
How is accountability assigned across vendor, client, and integrators?

Evidence to request

Use policy and enforcement description (including escalation and incident handling).
Example disclosures and disclaimers that are embedded in the product (not buried in terms).

Red flags (deal-breakers)

Encouraging deception, manipulation, or targeted harassment.
Claiming the system describes specific real individuals, or implying “twin” equals a person-like proxy without strict consent boundaries.
“Black box by design” posture that prevents audit and governance.

11. Pilot plan: the minimum test you should run

Do not rely on a single demo. Require a pilot that forces repeatability and benchmark discipline. A minimal pilot should include:

Two repeatable study types (e.g., a concept test and a message test), with fixed stimuli, fixed wording, and fixed run settings.
Test-retest runs: run each study multiple times and report variance. Require the vendor to explain variance drivers.
Sensitivity tests: apply controlled perturbations (minor wording changes, context constraints) and measure how conclusions shift.
At least one benchmark: compare outputs against a small human sample, public stats, or a known-truth dataset where feasible.
Disclosure package: every pilot output must include population frame, grounding class, validation performed, limitations, and reproducibility notes.

Pilot acceptance criteria (example): The vendor must demonstrate stable directional conclusions under repeat runs, provide transparent disclosures, and show at least one credible benchmark alignment for a relevant task. If they cannot, treat the tool as hypothesis generation only.

Appendix: Scoring rubric (optional, but recommended)

Score each category 0–2 (0 = absent/evaded, 1 = partial/opaque, 2 = complete/testable). Use this to compare vendors consistently.

Category	0	1	2
Population frame & coverage	Unspecified	Stated, weak justification	Stated + defensible + coverage limits
Persona specification (incl. SPL)	Marketing-only	Partial spec	Complete spec + evidence + tests
Grounding & provenance	Opaque	High-level only	Auditable provenance + retention rules
Validation & benchmarking	None / anecdotes	Internal checks only	Stability + sensitivity + external benchmark
Disclosure & limitations	Absent	Partial	Standard disclosure per study + clear limits
Auditability & reproducibility	No logs / no reruns	Limited	Exportable logs + comparable re-runs
Privacy posture & security	Hand-waving	Controls without tests	Controls + threat model + privacy testing
Misuse safeguards	None	Policy only	Policy + enforcement + escalation

Decision guidance (example): Vendors that score low on validation, disclosure, provenance, or auditability should be restricted to exploratory use. Vendors that cannot meet Gate 0 should not be procured for decision support.

Gate 0: Minimum disclosure pack (non-negotiable)

1. Category clarity: what exactly is being sold?

Questions to ask

Evidence to request

Red flags

2. Population framing & coverage: who does this represent?

Questions to ask

Evidence to request

Red flags

3. Personas, twins, and comparability: require a clear persona specification (including SPL)

Minimum persona specification (what you require in writing)

SPL reality-check questions (designed to expose “prompt dressed as product”)

Red flags

4. Grounding & data provenance: what is this built on, and is it legitimate?

Questions to ask

Evidence to request

Red flags

5. Validation: can they prove reliability (not just plausibility)?

Questions to ask

Evidence to request

Red flags

6. Research workflow integrity: do they support real study design?

Questions to ask

Evidence to request

Red flags

7. Bias, fairness & representational harm

Questions to ask

Evidence to request

Red flags

8. Privacy posture & security: “synthetic” is not automatically anonymous

Questions to ask

Evidence to request

Red flags

9. Auditability & reproducibility

Questions to ask

Evidence to request

Red flags

10. Misuse safeguards & governance: prevent Cambridge-Analytica-style dynamics

Questions to ask

Evidence to request

Red flags (deal-breakers)

11. Pilot plan: the minimum test you should run

Appendix: Scoring rubric (optional, but recommended)

Further reading (recommended context for buyers)