Improving the Generation and Evaluation of Synthetic Data for Downstream Medical Causal Inference

This repository contains the implementation for the NeurIPS 2025 paper: "Improving the Generation and Evaluation of Synthetic Data for Downstream Medical Causal Inference".

Overview

This repository provides:

Proposed metrics: Novel evaluation metrics for assessing synthetic data quality for causal inference tasks (see src/metrics.py)
Generation methods: STEAM, our proposed synthetic data generation approach (see src/generation.py)
Experimental reproducibility: Code to reproduce all experimental results from the paper (see exps/ folder)

Repository structure

steam/
├── src/                          # Core implementation
│   ├── metrics.py                # Proposed evaluation metrics
│   ├── generation.py             # Generic and STEAM synthetic data generation methods
│   └── catenets_dp/              # Differentially private PO estimators
├── exps/                         # Experimental scripts
├── ...
│   └── cgms/                     # CGMs experiments
|   └── ...
├── data/                         # Data loading and preprocessing
├──results/                       # Results from experiments
├── requirements.txt              # Generic/STEAM generation dependencies
├── requirements_dcm.txt          # Causal generation dependencies
└── setup.py                      # Package installation

Installation

Setup for generic comparison experiments

Create and activate a conda environment:

conda create -n steam python=3.10
conda activate steam

Install dependencies:

pip install -r requirements.txt
pip install .

Setup for CGM generation

For the CGM-specific experiments, create a separate environment, using the dedicated requirements file:

conda create -n steam_cgm python=3.10
conda activate steam_cgm
pip install -r requirements_cgm.txt
pip install .

Data preparation

Before running experiments, prepare the data:

conda activate steam
cd data
python load_data.py
cd ..

Running experiments

Metric evaluation and generic generative model comparisons

Navigate to the experiments folder and run the desired experiment script:

cd exps
python metric_exp.py       # Existing and our metric analysis
python real_generic_exp.py       # Real data generic v STEAM
python simulated_generic_exp.py  # Simulated data generic v STEAM
python hyperparameter_exp.py       # Stability analysis
python ablation_exp.py                       # Joint generation of X,W ablation study
python classifier_ablation_exp.py     # Q_W|X classifier ablation

By default, the experiments will run for all model/data/iterations combinations from the paper. Specific settings can be made with the CLI.

CGM experiments

For CGM-specific experiments, generate the data with:

conda activate steam_cgm
cd exps/cgms
python causal_generation.py
cd ..

By default, the experiments will run for all model/data/iterations combinations from the paper. Specific settings can be made with the CLI.

After generating the CGM data, compare with real data via our metrics:

conda activate steam
python get_cgm_results.py

Citation

@inproceedings{
  amad2025improving,
  title={Improving the generation and evaluation of synthetic data for downstream medical causal inference},
  author={Harry Amad and Zhaozhi Qian and Dennis Frauen and Julianna Piskorz and Stefan Feuerriegel and Mihaela van der Schaar},
  booktitle={The 39th Conference on Neural Information Processing Systems},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Improving the Generation and Evaluation of Synthetic Data for Downstream Medical Causal Inference

Overview

Repository structure

Installation

Setup for generic comparison experiments

Setup for CGM generation

Data preparation

Running experiments

Metric evaluation and generic generative model comparisons

CGM experiments

Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
exps		exps
src		src
README.md		README.md
requirements.txt		requirements.txt
requirements_cgm.txt		requirements_cgm.txt
setup.py		setup.py

harrya32/STEAM

Folders and files

Latest commit

History

Repository files navigation

Improving the Generation and Evaluation of Synthetic Data for Downstream Medical Causal Inference

Overview

Repository structure

Installation

Setup for generic comparison experiments

Setup for CGM generation

Data preparation

Running experiments

Metric evaluation and generic generative model comparisons

CGM experiments

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages