causal-learn Tutorial 15: Benchmarking, Stability, And Sensitivity
A single causal discovery run is rarely enough. Discovery algorithms make assumptions about independence tests, scores, linearity, noise, hidden variables, sample size, and tuning choices. If a learned graph changes completely when one of those choices changes, the graph should be treated as a fragile hypothesis rather than a reliable structure.
This notebook builds a compact benchmarking workflow for causal-learn. We will simulate data from a known observed DAG, run several algorithm families, and then stress-test the learned graphs across sample size, noise, tuning settings, nonlinearity, and hidden confounding.
The goal is not to crown one universal best algorithm. The goal is to learn how to report causal discovery results honestly: what was stable, what changed, and which assumptions each method needed.
Estimated runtime: about 2-5 minutes. Most cells are fast; the repeated benchmark grids do several PC, GES, and DirectLiNGAM fits.
Learning Goals
By the end of this notebook, you should be able to:
design a small benchmark with known graph truth;
compare constraint-based, score-based, and functional causal discovery methods;
separate skeleton recovery from edge direction recovery;
run sample-size, noise, tuning, and seed stability checks;
recognize when hidden confounding or nonlinear structure breaks an observed-DAG benchmark;
produce compact benchmark tables and figures suitable for a causal discovery report.
Notebook Flow
We will work in this order:
Set up imports, outputs, algorithms, and graph helpers.
Define a reusable six-variable teaching DAG.
Simulate baseline data and draw the true graph.
Run PC, GES, and DirectLiNGAM on the same baseline data.
Compare skeleton and direction metrics.
Run sample-size and seed stability checks.
Run noise and tuning sensitivity checks.
Run stress scenarios: Gaussian noise, nonlinear effects, and hidden confounding.
Run a targeted FCI check under hidden confounding.
Save reporting guidance and an artifact manifest.
Benchmarking Philosophy
Causal discovery benchmarking has a subtle trap: a method can look excellent on data generated exactly from its assumptions and weak elsewhere. That does not make the method bad. It means the benchmark must say what assumptions were tested.
This notebook uses three representative methods:
PC: constraint-based discovery using conditional independence tests;
GES: score-based search using a decomposable graph score;
DirectLiNGAM: functional causal discovery for linear non-Gaussian models.
We evaluate two graph layers separately. Skeleton metrics ask whether the right variables are adjacent, ignoring arrow direction. Directed metrics ask whether arrows point the same way as the true DAG. This separation matters because PC and GES often return partially directed equivalence-class graphs, while LiNGAM returns a directed graph.
Setup
This cell imports the scientific stack, causal-learn algorithms, and plotting utilities. It also applies the same local BIC-score compatibility wrapper used in earlier notebooks, which keeps GES working cleanly in this Python and NumPy environment without editing the installed package on disk.
The exact versions are now saved. This matters because graph search and numerical thresholds can shift slightly across library releases.
GES BIC Compatibility Wrapper
This wrapper keeps the GES BIC score numerically safe when recent NumPy versions return one-element matrix objects. The score formula is unchanged; we only convert a one-element result into a Python scalar.
def local_score_BIC_from_cov(Data, i, PAi, parameters=None):"""Safe local BIC score used by GES in this notebook.""" cov, n = Data lambda_value =1.0if parameters isNoneor parameters.get("lambda_value") isNoneelse parameters.get("lambda_value") parent_indices =list(PAi)iflen(parent_indices) ==0: residual_variance = cov[i, i]else: yX = cov[np.ix_([i], parent_indices)] XX = cov[np.ix_(parent_indices, parent_indices)] beta = np.linalg.solve(XX, yX.T) residual_variance = cov[i, i] -float((yX @ beta).item()) residual_variance =max(float(residual_variance), 1e-12)returnfloat(-(n /2) * np.log(residual_variance) - (len(parent_indices) +1) * lambda_value * np.log(n) /2)ges_module.local_score_BIC_from_cov = local_score_BIC_from_covprint("GES BIC score wrapper is active.")
GES BIC score wrapper is active.
The wrapper is active for this notebook session only. It does not modify the installed package files.
Define The Benchmark DAG
The benchmark graph has six variables and six directed edges. It is small enough to run many times, but rich enough to include a collider, a chain, and a downstream variable with two parents.
This truth table is used by all benchmark metrics. The same true graph is intentionally used across stress scenarios so we can see when violated assumptions create extra or missing observed relationships.
Simulate Data From The DAG
The simulator can generate several regimes from the same graph:
linear Laplace noise for a baseline that favors DirectLiNGAM;
linear Gaussian noise for Markov-equivalence-focused methods;
nonlinear transformations for a simple functional misspecification check;
hidden common causes that violate causal sufficiency.
The hidden-confounding regime should be read carefully: once a hidden common cause exists, the observed DAG is no longer a complete causal graph.
The correlation matrix shows strong relationships along the graph paths, but correlation alone cannot decide which adjacencies are direct or which way arrows point.
Draw The True DAG
The figure below is the benchmark target. The same box-and-arrow style is used across the tutorial series to keep graph outputs readable.
The metric functions make the benchmark transparent. A partially directed PC or GES graph can still score well on skeleton recovery while receiving lower directed recall.
Algorithm Runners
This cell wraps PC, GES, FCI, and DirectLiNGAM so later benchmark grids can call them consistently. We keep the defaults conservative and expose the main tuning knobs.
These wrappers are intentionally small. They avoid hidden defaults in the benchmark loops and make each algorithm’s tuning parameters visible.
Baseline Algorithm Comparison
Now we run PC, GES, and DirectLiNGAM on the same baseline data. This is the cleanest comparison because the data are linear, acyclic, causally sufficient, and non-Gaussian.
The baseline is deliberately friendly to all three methods. PC and GES usually recover the correct skeleton and many directions; DirectLiNGAM can recover all directions because the data match its linear non-Gaussian assumptions.
Plot Baseline Metrics
A compact metric plot lets us see the skeleton-versus-direction distinction immediately.
The plot shows why benchmark reports should not collapse everything into one number. Recovering adjacencies and orienting arrows are related but different tasks.
Draw Baseline Learned Graphs
The next cell saves one graph image per method. Undirected or partially oriented edges are drawn as plain lines so they are not mistaken for confirmed arrow directions.
for method_name in baseline_edges["run"].unique(): method_edges = baseline_edges[baseline_edges["run"] == method_name].copy() draw_box_graph( method_edges,f"Baseline Learned Graph: {method_name}", FIGURE_DIR /f"{NOTEBOOK_PREFIX}_baseline_{method_name.lower()}_graph.png", note="Directed arrows count toward direction metrics; plain lines count only toward skeleton metrics.", )
These plots are useful for qualitative review. The metric table tells us what changed; the graph drawings make the change easier to inspect.
Sample-Size And Seed Stability
A stable discovery method should not depend on one lucky sample. This grid varies sample size and random seed, then reruns all three benchmark methods.
The summary shows how recovery improves as the sample grows. Small samples can preserve the rough skeleton while making directions less reliable.
Plot Sample-Size Stability
The line plot uses the mean F1 scores across seeds for each sample size.
sample_plot = sample_seed_summary.melt( id_vars=["method", "n_samples"], value_vars=["mean_skeleton_f1", "mean_directed_f1"], var_name="metric", value_name="score",)fig, ax = plt.subplots(figsize=(11, 5))sns.lineplot(data=sample_plot, x="n_samples", y="score", hue="method", style="metric", markers=True, dashes=False, ax=ax)ax.set_title("Sample-Size Stability Across Methods")ax.set_xlabel("Number Of Samples")ax.set_ylabel("Mean F1 Across Seeds")ax.set_ylim(0, 1.05)plt.tight_layout()fig.savefig(FIGURE_DIR /f"{NOTEBOOK_PREFIX}_sample_size_stability.png", dpi=160, bbox_inches="tight")plt.show()
The plot makes the practical sample-size story visible. If the directed score is unstable at small n, a report should avoid strong directional claims from that region.
Noise Sensitivity
Next we hold sample size fixed and vary the measurement noise scale. Higher noise weakens the signal-to-noise ratio and can make direct adjacencies harder to recover.
Noise sensitivity is a useful robustness check because real datasets rarely have the clean signal strength used in synthetic demos.
Plot Noise Sensitivity
The next figure tracks directed F1 as noise increases. Skeleton scores are saved in the table above; the directed score is usually the more fragile layer.
fig, ax = plt.subplots(figsize=(10, 5))sns.lineplot(data=noise_sensitivity, x="noise_scale", y="directed_f1", hue="method", marker="o", ax=ax)ax.set_title("Directed F1 Under Increasing Noise")ax.set_xlabel("Noise Scale")ax.set_ylabel("Directed F1")ax.set_ylim(0, 1.05)plt.tight_layout()fig.savefig(FIGURE_DIR /f"{NOTEBOOK_PREFIX}_noise_sensitivity.png", dpi=160, bbox_inches="tight")plt.show()
If a method’s directed score drops sharply as noise increases, its arrow claims should be reported with caution in low-signal applications.
Tuning Sensitivity: PC Alpha
PC depends on a conditional-independence significance level. Larger alpha values tend to remove fewer edges, while smaller values tend to remove more edges. The right value is data-dependent, so it should be scanned rather than hidden.
The PC alpha scan shows whether the learned graph is stable across reasonable independence-test thresholds.
Tuning Sensitivity: GES Penalty
GES uses a score penalty. Larger penalties favor simpler graphs. Smaller penalties can admit extra edges. We scan the penalty multiplier used by the local BIC score.
DirectLiNGAM returns a weighted adjacency matrix. To report a graph, we need a coefficient threshold. This cell fits DirectLiNGAM once and scans several thresholds.
The threshold scan is a reminder that weighted-output algorithms still require a reporting rule. A low threshold can over-report weak edges; a high threshold can drop true but weaker effects.
Plot Tuning Sensitivity
This plot puts the main tuning scans on one page. Each panel uses the method-specific tuning parameter, so read the x-axis labels separately.
Stable regions are more reassuring than isolated peaks. In an applied report, a graph chosen from a narrow peak should be described as sensitive to tuning.
Assumption Stress Scenarios
The baseline was friendly. Now we deliberately perturb the assumptions:
linear_laplace: baseline linear non-Gaussian data;
linear_gaussian: directions are harder for LiNGAM because non-Gaussianity is removed;
nonlinear_laplace: linear algorithms are misspecified;
hidden_confounding: causal sufficiency is violated by an unobserved common cause.
The hidden-confounding scores are not a fair observed-DAG target; they are included to show how observed-DAG methods can become unstable when a key assumption fails.
The stress table shows which failures are algorithm-specific and which are assumption-level. Gaussian noise mainly weakens DirectLiNGAM direction recovery; hidden confounding can create extra adjacencies or misleading directions for observed-DAG methods.
Plot Assumption Stress Results
The next figure compares directed F1 across the stress scenarios. Skeleton details are saved in the table above.
fig, ax = plt.subplots(figsize=(12, 5))sns.barplot(data=stress_metrics, x="scenario", y="directed_f1", hue="method", ax=ax)ax.set_title("Directed F1 Across Assumption Stress Scenarios")ax.set_xlabel("Scenario")ax.set_ylabel("Directed F1")ax.set_ylim(0, 1.05)ax.tick_params(axis="x", rotation=20)ax.legend(title="Method")plt.tight_layout()fig.savefig(FIGURE_DIR /f"{NOTEBOOK_PREFIX}_assumption_stress_directed_f1.png", dpi=160, bbox_inches="tight")plt.show()
The plot turns the warning into something visible. A method that performs beautifully under one data-generating regime can become much less reliable under another.
Targeted Hidden-Confounding Check With FCI
FCI is designed for settings where hidden common causes may exist. This cell runs FCI only on the hidden-confounding scenario. The goal is not to score FCI against the same observed DAG, because the target graph type is different. The goal is to inspect whether the output marks possible hidden structure.
FCI uses endpoint marks that should be read differently from a DAG. Bidirected or partially marked edges are signs that hidden structure may be involved, not ordinary directed arrows from the observed DAG benchmark.
Edge Stability Signatures
Stability can also be summarized by graph signatures. The next cell counts how many unique graphs each method produced across the sample-size and seed grid.
Unique graph counts help distinguish stable metrics from stable graphs. Two graphs can have the same F1 score but disagree about which specific edge changed.
Runtime Summary
Runtime is part of method selection. A method that is statistically attractive but too slow for repeated sensitivity checks may be hard to use responsibly on a larger graph.
The runtime table is small here because the graph has only six variables. On wider graphs, runtime and stability checks become a much bigger part of practical method choice.
Benchmark Reporting Checklist
A benchmark should make its assumptions and fragility visible. This checklist is a reusable template for reporting causal discovery comparisons.
reporting_checklist = pd.DataFrame( [ {"item": "Graph target","what_to_report": "Whether metrics target a DAG, CPDAG, PAG, skeleton, or weighted adjacency matrix.","why_it_matters": "Different algorithms return different graph objects and should not be collapsed casually.", }, {"item": "Data-generating assumptions","what_to_report": "Linearity, noise family, causal sufficiency, stationarity, and sample size.","why_it_matters": "Benchmark performance only applies to the assumptions actually tested.", }, {"item": "Skeleton and direction metrics","what_to_report": "Report adjacency recovery separately from arrow-direction recovery.","why_it_matters": "Partially directed graphs can have correct adjacencies but unresolved directions.", }, {"item": "Tuning settings","what_to_report": "PC alpha, GES penalty, coefficient threshold, and any prior knowledge used.","why_it_matters": "Graph output can change materially across tuning choices.", }, {"item": "Stability checks","what_to_report": "Seed, sample-size, noise, resampling, and graph-signature stability.","why_it_matters": "A single graph can hide high variance across plausible datasets.", }, {"item": "Assumption stress tests","what_to_report": "Nonlinearity, Gaussian noise, hidden confounding, missingness, or selection stress.","why_it_matters": "Failure modes are often more informative than the clean baseline result.", }, {"item": "Claim strength","what_to_report": "Whether learned edges are final claims, candidate hypotheses, or diagnostic signals.","why_it_matters": "Causal discovery rarely justifies strong claims without domain and design support.", }, ])reporting_checklist.to_csv(TABLE_DIR /f"{NOTEBOOK_PREFIX}_reporting_checklist.csv", index=False)display(reporting_checklist)
item
what_to_report
why_it_matters
0
Graph target
Whether metrics target a DAG, CPDAG, PAG, skel...
Different algorithms return different graph ob...
1
Data-generating assumptions
Linearity, noise family, causal sufficiency, s...
Benchmark performance only applies to the assu...
2
Skeleton and direction metrics
Report adjacency recovery separately from arro...
Partially directed graphs can have correct adj...
3
Tuning settings
PC alpha, GES penalty, coefficient threshold, ...
Graph output can change materially across tuni...
4
Stability checks
Seed, sample-size, noise, resampling, and grap...
A single graph can hide high variance across p...
5
Assumption stress tests
Nonlinearity, Gaussian noise, hidden confoundi...
Failure modes are often more informative than ...
6
Claim strength
Whether learned edges are final claims, candid...
Causal discovery rarely justifies strong claim...
The checklist is intentionally conservative. Good benchmarking does not make a graph automatically true; it makes the graph’s support and fragility easier to inspect.
Artifact Manifest
The final cell lists the datasets, tables, and figures saved by this notebook.
The notebook leaves us with a practical benchmark pattern: start from a known graph, compare methods by graph layer, scan tuning settings, test stability, then stress the assumptions before making any causal claim from a discovered graph.