Scope. “Synthetic market research” sits at the intersection of (i) synthetic data generation, (ii) LLM-based “synthetic respondents” that can be prompted to answer surveys or participate in experiments, and (iii) higher-resolution agent and digital-twin style simulations that can be run repeatedly under counterfactual conditions. The sources below are selected because they either: (a) provide foundational methods, (b) expose known failure modes (representation, bias, instability), (c) establish evaluation/benchmarking discipline, or (d) encode professional standards that help prevent methodological over-claiming and unethical deployment.
How to use this list. Treat each paper as a “module” in a responsible synthetic research stack: you need both capability papers (how to generate/simulate) and constraint papers (how to measure, validate, and govern). For each entry, you’ll find: what the paper is, core takeaways, what it unlocked, and the practical relevance to synthetic market research.
1) LLMs as Synthetic Respondents and Population Simulators
Out of One, Many: Using Language Models to Simulate Human Samples (Argyle et al., 2022/2023)
Paper (PDF) | Paper (landing page)
- What it is. A core early paper arguing that language models can be studied as proxies for specific subpopulations under controlled conditioning, with an emphasis on comparing distributions rather than cherry-picking outputs.
- Main takeaways. (i) Treat the model as a noisy proxy for a distribution, not a single “answer”; (ii) conditioning and prompt design can move outputs toward subgroup-like tendencies; (iii) fidelity is highly uneven across topics and groups.
- Why it matters. It gave the field a research vocabulary for “LLMs as population simulators” and pushed practitioners toward distributional comparisons (not anecdotal plausibility).
- What it unlocked. A wave of work on benchmarking LLM outputs against real surveys and on formalising “whose opinions” LLMs reflect.
Whose Opinions Do Language Models Reflect? (Santurkar et al., 2023)
Paper (PDF) | Paper (arXiv landing page)
- What it is. A quantitative framework and dataset (OpinionQA) to measure alignment between LLM response distributions and public opinion polls across demographic groups.
- Main takeaways. (i) Default model responses can be systematically misaligned with many groups; (ii) “steering” can help but does not fix representational gaps; (iii) apparent alignment can shift with fine-tuning regimes.
- Why it matters. It makes “representation” measurable and turns vague claims about bias into testable statements.
- What it unlocked. A practical template for evaluating synthetic panels: build a survey-backed dataset, measure divergence across groups, and track drift across model versions.
Towards Measuring the Representation of Subjective Global Opinions in Language Models (Durmus et al., 2023)
Paper (PDF) | Paper (landing page)
- What it is. Extends “whose opinions” evaluation beyond one country: builds GlobalOpinionQA from cross-national surveys and measures which countries’ opinions LLM responses resemble.
- Main takeaways. (i) Default outputs can over-represent certain geographies; (ii) prompting can shift perspective but may induce stereotyping; (iii) translation does not guarantee cultural alignment.
- Why it matters. Synthetic market research is inherently multinational for many brands; this paper shows why “global synthetic panels” must be validated per region and language.
- What it unlocked. A rigorous path for country-level calibration and evaluation rather than “one-model-fits-all” assumptions.
Questioning the Survey Responses of Large Language Models (Dominguez-Olmedo et al., 2024)
- What it is. A systematic critique and evaluation of LLMs as survey respondents, interrogating methodological pitfalls in extracting “survey-like” distributions from models.
- Main takeaways. (i) Response distributions can be highly sensitive to elicitation method; (ii) superficially plausible aggregate similarity can mask unstable or artefactual mechanisms; (iii) evaluation must include robustness checks (wording, order, format).
- Why it matters. This is a “cold shower” paper: it pushes synthetic research teams to treat survey simulation as an experimental method requiring controls, not a shortcut.
- What it unlocked. A stronger norm: report elicitation protocol details and run perturbation tests as standard practice.
Evaluating the Moral Beliefs Encoded in LLMs (Scherrer et al., 2023)
Paper (PDF) | Paper (arXiv landing page)
- What it is. A careful “survey administration” methodology for LLMs, including measures for uncertainty and consistency, applied to moral scenario questions.
- Main takeaways. (i) Many models show high sensitivity to wording in ambiguous cases; (ii) uncertainty is an observable property that should be measured, not ignored; (iii) model families can exhibit systematic preference patterns.
- Why it matters. Synthetic market research often asks normative questions (trust, fairness, discomfort, acceptability). This paper demonstrates why those results can be brittle without uncertainty-aware instrumentation.
- What it unlocked. Better practice for “LLM surveys”: quantify uncertainty/choice consistency rather than treating outputs as stable attitudes.
Specializing Large Language Models to Simulate Survey Response Distributions (Cao et al., 2025)
- What it is. Moves beyond prompting by training/specialising LLMs to match survey response distributions (tested on country-level cultural survey results).
- Main takeaways. (i) Distribution-matching objectives can reduce divergence from real survey marginals; (ii) “first-token probability” style training can be used to fit multiple-choice distributions; (iii) specialisation provides a path toward repeatable, auditable survey simulation.
- Why it matters. It provides a credible bridge from “prompt art” to a reproducible modelling approach for synthetic panels.
- What it unlocked. A research direction where synthetic market research can become more like calibrated measurement, at least for certain question types and domains.
Large Language Models as Simulated Economic Agents (Horton, 2023)
- What it is. A foundational argument and set of examples for treating LLMs as simulated agents in economic experiments and decision contexts.
- Main takeaways. (i) LLMs can be used to run rapid “agent-style” experiments; (ii) realism must be validated against known results; (iii) the core value is fast hypothesis exploration, not replacing all human experiments.
- Why it matters. Synthetic market research frequently borrows from experimental economics (choice, trade-offs, incentives). This paper legitimises that bridge while emphasising validation discipline.
- What it unlocked. A conceptual foundation for “synthetic experiments,” not just synthetic surveys.
2) Digital-Twin and Agent-Based Simulation Papers (Human Behaviour as Systems)
Generative Agents: Interactive Simulacra of Human Behavior (Park et al., 2023)
Paper (PDF) | Paper (landing page)
- What it is. Introduces an architecture for LLM-driven agents with memory, reflection, and planning in an interactive environment.
- Main takeaways. (i) Memory and retrieval are structural requirements for coherent, time-extended behaviour; (ii) “believability” emerges from combining observation, planning, and reflection; (iii) multi-agent interaction can produce emergent social dynamics.
- Why it matters. If “digital twins” in market research are to be more than a prompt with demographics, they require architectures that handle time, context, and state. This paper is a blueprint.
- What it unlocked. A design pattern for longitudinal synthetic personas (diaries, evolving preferences, multi-touch journeys).
Large Language Models for Agent-Based Modelling: Current and possible uses across the modelling cycle (Vanhée et al., 2025)
Paper (PDF) | Paper (landing page)
- What it is. A structured overview of where LLMs can slot into agent-based modelling (ABM): from problem formulation and rule design to calibration and analysis.
- Main takeaways. (i) LLMs can serve as agent “brains,” scenario generators, or explanation layers; (ii) ABM requires explicit assumptions and validation; (iii) LLM integration changes both capability and risk (hallucination, instability, hidden priors).
- Why it matters. Synthetic market research increasingly wants system-level simulation (word-of-mouth, adoption, churn cascades). ABM + LLMs is a plausible route, but only if governed as a modelling discipline.
- What it unlocked. A roadmap for turning “chatty personas” into model-based simulations with explicit lifecycle steps.
AI-enabled consumer digital twins as a platform for research aimed at enhancing customer experience (Hornik & Rachamim, 2025)
Paper (publisher landing page)
- What it is. A marketing/consumer research perspective paper on consumer digital twins (CDTs), positioning them as a research platform and proposing how they might reshape consumer and marketing research.
- Main takeaways. (i) Frames CDTs around repeatability and predictability; (ii) highlights potential for customer-experience research; (iii) emphasises the conceptual shift from episodic research to ongoing simulation.
- Why it matters. It represents the “disciplinary uptake” of digital twins into marketing academia, making the topic legible to the research community outside computer science.
- What it unlocked. A mainstream scholarly vocabulary for CDTs that synthetic market research teams can map onto operational practice.
3) Synthetic Data Generation Foundations (The Data Substrate for Synthetic Research)
Modeling Tabular Data using Conditional GAN (CTGAN) (Xu et al., 2019)
- What it is. A classic deep-learning approach for generating synthetic tabular data, addressing mixed discrete/continuous columns and imbalanced categories.
- Main takeaways. (i) Tabular synthesis needs specialised conditioning mechanisms; (ii) naive GANs struggle with mode structure and discrete columns; (iii) benchmarking across datasets is essential.
- Why it matters. Many synthetic market research stacks rely on tabular synthesis (panels, cohorts, microdata). CTGAN is a major baseline and a conceptual anchor for “tabular-first” thinking.
- What it unlocked. A practical template for synthetic panel generation where relational features (income, household, behaviour flags) matter as much as text.
PATE-GAN: Generating Synthetic Data with Differential Privacy (Jordon et al., 2018)
Paper (PDF) | Paper (OpenReview landing page)
- What it is. A privacy-preserving GAN framework using PATE-style teacher ensembles to provide differentially private training signals.
- Main takeaways. (i) “Synthetic” is not automatically private; privacy must be engineered; (ii) DP mechanisms can be integrated into generative training; (iii) privacy-utility trade-offs are unavoidable and must be measured.
- Why it matters. If synthetic market research is sold as privacy-preserving, DP-grounded approaches are part of the credible foundation.
- What it unlocked. A concrete, publishable route to privacy claims beyond hand-waving.
PrivBayes: Private Data Release via Bayesian Networks (Zhang et al., 2017)
- What it is. A major differential privacy method for releasing synthetic data via probabilistic modelling (Bayesian networks) with DP guarantees.
- Main takeaways. (i) Classical probabilistic structure can outperform some neural methods under privacy constraints; (ii) modelling dependencies explicitly can improve utility; (iii) DP data release is achievable for practical datasets with careful design.
- Why it matters. It remains a key reference for DP synthetic tabular data, useful when your synthetic panel is derived from sensitive customer microdata.
- What it unlocked. A durable “non-neural” path for privacy-preserving synthesis that is easier to reason about in audits.
The Synthetic Data Vault (SDV) (Patki et al., 2016)
Paper (PDF) | Related MIT thesis/record (landing page)
- What it is. A system-level vision for generating synthetic data for relational databases, pairing modelling with usability and repeatable generation.
- Main takeaways. (i) Real-world data is relational; (ii) synthesis needs to respect schema, constraints, and dependencies; (iii) tooling matters as much as models if the method is to be adopted responsibly.
- Why it matters. Synthetic market research is not only text: it is structured data + constraints + traceability. SDV’s worldview matches production reality.
- What it unlocked. A practical ecosystem (and later open-source tooling) that made tabular/relational synthesis accessible to practitioners.
4) Evaluation and Benchmarking: How to Know Your “Synthetic Panel” Is Not Self-Deception
Synthcity: a benchmark framework for diverse use cases of tabular synthetic data (Qian et al., 2023)
- What it is. A benchmarking framework and library that evaluates synthetic tabular data across fidelity, utility, privacy, and other dimensions, spanning multiple modalities/use cases.
- Main takeaways. (i) Synthetic data quality is multi-dimensional; (ii) “good” depends on the downstream task; (iii) benchmarking must be reproducible and threat-model aware.
- Why it matters. Synthetic market research often devolves into “it looks plausible.” Synthcity is the antidote: measure, compare, and report.
- What it unlocked. A lingua franca of metrics and baselines that makes vendor and internal claims testable.
SDNist: Benchmark Data and Evaluation Tools for Synthetic Data Generators on Structured Data (NIST/AAAI PPAI, 2022)
- What it is. A benchmark dataset and evaluation tooling designed to standardise comparisons of synthetic data generators, motivated by NIST challenge work.
- Main takeaways. (i) Evaluation should be packaged and repeatable; (ii) multiple metrics are necessary; (iii) benchmarking should lower the barrier for rigorous comparison.
- Why it matters. If synthetic market research is to have standards, “standard evaluation harnesses” are an essential building block.
- What it unlocked. Practical, shareable evaluation primitives that teams can adopt rather than inventing bespoke scorecards.
Train on Synthetic, Test on Real (TSTR) as an evaluation paradigm (Esteban et al., 2017)
- What it is. A widely cited evaluation idea: train a model on synthetic data and test on a real holdout (and/or the inverse), used to quantify downstream utility.
- Main takeaways. (i) Utility is best measured by downstream performance; (ii) synthetic data can be “statistically similar” yet useless for prediction; (iii) TSTR provides a task-grounded reality check.
- Why it matters. For synthetic market research, the analogue is: do synthetic insights predict real outcomes (purchase, churn, lift) when you validate?
- What it unlocked. A pragmatic norm: synthetic outputs must be judged by real-world predictive alignment, not aesthetics.
5) Surveys That Map the Field (Useful for “What am I missing?”)
Comprehensive Exploration of Synthetic Data Generation: A Survey (Bauer et al., 2024)
Paper (landing page; includes “View PDF”)
- What it is. A broad survey attempting to systematically catalogue synthetic data generation models and trends across categories.
- Main takeaways. (i) The space is fragmented; (ii) evaluation is inconsistent across subfields; (iii) privacy-preserving synthesis remains a distinct, harder track than “high-fidelity” synthesis.
- Why it matters. Synthetic market research touches many subfields; a broad map prevents teams from reinventing partial solutions or missing evaluation norms from adjacent domains.
- What it unlocked. A “field map” useful for standards work (taxonomy, common metrics, recurring gaps).
A Comprehensive Survey of Synthetic Tabular Data Generation (Shi et al., 2025)
- What it is. A focused survey on tabular synthetic data methods, which is directly relevant to synthetic panels, cohorts, and customer microdata synthesis.
- Main takeaways. (i) Tabular synthesis is its own technical regime; (ii) method classes vary by constraints (privacy, imbalance, missingness); (iii) evaluation must reflect both fidelity and downstream utility.
- Why it matters. Market research data is often tabular at the core (profiles, purchases, behaviours). This paper helps teams choose methods and evaluation strategies intentionally.
- What it unlocked. A more systematic approach to “panel realism” beyond demographic marginals.
Synthetic Data Generation Using Large Language Models: Advances in Text and Code (Nadas et al., 2025)
- What it is. A survey of LLM-driven synthetic data generation techniques, with emphasis on workflows, quality pitfalls, and emerging issues like model collapse.
- Main takeaways. (i) LLMs can generate large volumes of synthetic text quickly; (ii) quality control and filtering are central; (iii) feedback loops (training on synthetic) can degrade models if unmanaged.
- Why it matters. Synthetic market research frequently relies on synthetic text: open-ends, persona narratives, diaries. This paper provides the “data pipeline” lens for quality and failure modes.
- What it unlocked. A more industrial view of LLM-synthesis: generation + filtering + evaluation as a reproducible pipeline.
6) Standards and “Sources of Discipline” (Because capability without governance is how industries get burned)
ICC/ESOMAR International Code on Market, Opinion and Social Research and Data Analytics (2017 revision)
- What it is. A global professional code that sets baseline expectations for ethical conduct, transparency, and responsibilities in research and data analytics.
- Main takeaways. (i) Don’t misrepresent methods; (ii) respect participants and data; (iii) maintain professional integrity and confidentiality.
- Why it matters. Synthetic market research should inherit these norms explicitly, especially around disclosure and avoidance of deceptive reporting.
- What it unlocked. A shared ethical “floor” that can be extended into synthetic-specific standards (validation, auditability, model governance).
ESOMAR “20 Questions” to Help Buyers of AI-based Services for Market Research and Insights
- What it is. A structured procurement and due-diligence checklist focused on AI-based research services, aimed at forcing clarity on data governance, oversight, and ethical commitments.
- Main takeaways. (i) Buyers should demand explainability about data, models, and oversight; (ii) “ethics statements” must be tied to operational controls; (iii) governance and QA are part of product quality.
- Why it matters. Synthetic market research is becoming vendorised. This checklist is a practical lever to prevent “black-box synthetic insights” from entering enterprises unvetted.
- What it unlocked. A concrete purchasing discipline that can evolve into formal standards.
MRS Code of Conduct (2023)
- What it is. A major national professional code (UK) that operationalises ethical standards for research practice.
- Main takeaways. (i) Integrity, transparency, and duty of care; (ii) clear responsibilities in handling data and reporting; (iii) professional accountability.
- Why it matters. If synthetic market research is to be trusted, it needs to be legible within existing research ethics regimes, not treated as an exemption.
- What it unlocked. A compliance-ready reference point for building internal SOPs (disclosure templates, audit trails, complaint handling).
Closing Notes: What “Top Papers” Tell You About Building a Responsible Synthetic Research Stack
- Evaluation is the centre of gravity. The strongest trend across this literature is a move from “plausible outputs” to measurable distributional alignment, robustness testing, and downstream utility evaluation.
- Representation is an empirical question. Papers like OpinionQA and GlobalOpinionQA show that “whose perspectives” are reflected cannot be assumed; it must be measured per topic, region, and model version.
- Digital-twin rhetoric demands architecture. Agent papers highlight that longitudinal coherence requires memory, state, and explicit modelling choices-otherwise “twins” are merely chat prompts with demographics.
- Standards are not optional. The more synthetic methods influence real decisions, the more they must inherit the discipline of market research ethics (disclosure, integrity) and the discipline of model risk management (validation, auditability).
- Pair capability papers (generation/simulation) with constraint papers (evaluation/standards).
- Representation is measurable: use survey-aligned datasets to check alignment and drift.
- Digital twins need architecture (memory, state, validation), not just demographics in a prompt.
- Benchmarking and TSTR-style tests keep “plausible” outputs honest.
- Professional codes (ICC/ESOMAR, MRS) are the ethical floor-extend them for synthetic systems.