causal-learn Tutorial 16: End-To-End Causal Discovery Case Study
This notebook pulls the tutorial pieces together into a complete causal discovery workflow. Instead of focusing on one algorithm, we will walk through the full lifecycle of an applied discovery analysis: problem framing, data dictionary, data audit, domain constraints, candidate graph estimation, stability checks, final graph selection, and report-ready limitations.
The case study is synthetic and company-neutral. We simulate a small product-engagement dataset where the true graph is known for teaching. In a real project, the true graph would not be available, so the same workflow would rely more heavily on domain review, stability checks, and follow-up validation.
Estimated runtime: about 2-4 minutes. The notebook runs PC, GES, DirectLiNGAM, FCI, and several bootstrap checks on a small dataset.
Learning Goals
By the end of this notebook, you should be able to:
turn a vague discovery question into a concrete graph target;
document variables, timing, and domain constraints before running algorithms;
audit a discovery dataset for missingness, scale, dependence, and suspicious structure;
compare candidate graphs from PC, GES, and DirectLiNGAM;
use background knowledge to orient edges that time order already rules out;
summarize edge consensus and bootstrap stability;
produce a final candidate graph with appropriate caveats.
Notebook Flow
The workflow follows the order a careful discovery project should use:
Define the case-study question and variable dictionary.
Simulate a teaching dataset and save the synthetic truth.
Audit the observed data before discovery.
Encode domain timing constraints.
Run candidate discovery algorithms.
Compare graphs to synthetic truth and to each other.
Check tuning and bootstrap stability.
Use FCI as a hidden-confounding screen.
Assemble a final candidate graph and report-ready edge table.
Save outputs and limitations.
Case-Study Question
Suppose a product analytics team wants to understand the causal structure behind a simple engagement funnel. They have observational measurements of user need, catalog fit, onboarding friction, discovery quality, early engagement, support contact, and retention value.
The discovery question is:
Which measured factors appear to be direct causes of downstream engagement and retention outcomes, after accounting for the rest of the observed system?
This is a graph-discovery question, not an effect-estimation question. The output is a candidate causal graph that can guide later experiments, quasi-experimental studies, or targeted causal effect estimation.
Why This Is End-To-End
An end-to-end discovery workflow is not just an algorithm call. The algorithm is one component inside a larger discipline:
define the time order and forbidden directions before looking at graph output;
inspect the data-generating plausibility of algorithm assumptions;
compare several method families rather than trusting one graph;
check edge stability across reasonable perturbations;
label final edges by evidence strength rather than pretending every edge is equally supported.
That full workflow is what this notebook demonstrates.
Setup
This cell imports the scientific Python stack, causal-learn algorithms, and graph drawing utilities. It also sets up output folders and silences known notebook-environment warnings.
The output folder is shared with the rest of the causal-learn tutorial series. Every artifact created here starts with 16_.
Package Versions
Saving package versions is part of reproducible graph discovery. Small numerical and graph-format differences can matter when comparing discovered structures.
The version table is now saved. This is useful when a graph needs to be reproduced or compared with another run later.
GES BIC Compatibility Wrapper
The GES BIC score needs a small local compatibility wrapper in this environment. The wrapper keeps the same BIC scoring logic while safely converting one-element matrix results into Python scalars.
def local_score_BIC_from_cov(Data, i, PAi, parameters=None):"""Safe local BIC score used by GES in this notebook.""" cov, n = Data lambda_value =0.5if parameters isNoneor parameters.get("lambda_value") isNoneelse parameters.get("lambda_value") parent_indices =list(PAi)iflen(parent_indices) ==0: residual_variance = cov[i, i]else: yX = cov[np.ix_([i], parent_indices)] XX = cov[np.ix_(parent_indices, parent_indices)] beta = np.linalg.solve(XX, yX.T) residual_variance = cov[i, i] -float((yX @ beta).item()) residual_variance =max(float(residual_variance), 1e-12)returnfloat(-(n /2) * np.log(residual_variance) - (len(parent_indices) +1) * lambda_value * np.log(n) /2)ges_module.local_score_BIC_from_cov = local_score_BIC_from_covprint("GES BIC score wrapper is active for this notebook session.")
GES BIC score wrapper is active for this notebook session.
The wrapper only affects this notebook session. It does not edit causal-learn on disk.
Data Dictionary
The case study uses seven measured variables. The timing column captures a rough process order. This timing information becomes background knowledge later, because later outcomes should not cause earlier setup variables.
VARIABLES = ["baseline_need","catalog_fit","onboarding_friction","discovery_quality","early_engagement","support_contact","retention_value",]VAR_INDEX = {name: i for i, name inenumerate(VARIABLES)}DISPLAY_LABELS = {"baseline_need": "baseline\nneed","catalog_fit": "catalog\nfit","onboarding_friction": "onboarding\nfriction","discovery_quality": "discovery\nquality","early_engagement": "early\nengagement","support_contact": "support\ncontact","retention_value": "retention\nvalue",}variable_dictionary = pd.DataFrame( [ {"variable": "baseline_need","plain_language_meaning": "Pre-existing user need or motivation before the experience starts.","timing_tier": 0,"role": "upstream context", }, {"variable": "catalog_fit","plain_language_meaning": "How well available items fit the user's likely interests.","timing_tier": 1,"role": "upstream context", }, {"variable": "onboarding_friction","plain_language_meaning": "Early confusion, setup cost, or friction before meaningful use.","timing_tier": 2,"role": "early barrier", }, {"variable": "discovery_quality","plain_language_meaning": "Quality of the initial discovery or recommendation experience.","timing_tier": 3,"role": "candidate mechanism", }, {"variable": "early_engagement","plain_language_meaning": "Depth of initial usage after discovery.","timing_tier": 4,"role": "intermediate outcome", }, {"variable": "support_contact","plain_language_meaning": "Whether the user needed help or support after early use.","timing_tier": 5,"role": "friction outcome", }, {"variable": "retention_value","plain_language_meaning": "Downstream user value or retention signal.","timing_tier": 6,"role": "final outcome", }, ])variable_dictionary.to_csv(TABLE_DIR /f"{NOTEBOOK_PREFIX}_variable_dictionary.csv", index=False)display(variable_dictionary)
variable
plain_language_meaning
timing_tier
role
0
baseline_need
Pre-existing user need or motivation before th...
0
upstream context
1
catalog_fit
How well available items fit the user's likely...
1
upstream context
2
onboarding_friction
Early confusion, setup cost, or friction befor...
2
early barrier
3
discovery_quality
Quality of the initial discovery or recommenda...
3
candidate mechanism
4
early_engagement
Depth of initial usage after discovery.
4
intermediate outcome
5
support_contact
Whether the user needed help or support after ...
5
friction outcome
6
retention_value
Downstream user value or retention signal.
6
final outcome
The tier order does not prove causality. It only encodes directions that are impossible or implausible given the measurement process.
Synthetic Truth For Teaching
Because this is a tutorial, we simulate the dataset from a known graph. In a real case study, this table would not exist; it is included here so we can evaluate how the workflow behaves.
The true graph is intentionally not too dense. That keeps the case study readable while still including multiple parents, downstream mediation, and a support-related pathway.
Simulate The Case-Study Dataset
The simulator generates a clean observational dataset from the true graph. The noise is non-Gaussian so DirectLiNGAM has a fair chance to orient edges, while PC and GES still have useful structure to recover.
The saved dataset contains only observed variables. The hidden-context option is reserved for a later stress check and is not present in the baseline case-study data.
Data Audit: Shape, Missingness, And Scale
Before discovery, we check the basics: missing values, variance, and range. A graph search can look sophisticated while quietly failing because of ordinary data quality problems.
The audit is clean: no missing values, no constant columns, and standardized scales. That lets the rest of the notebook focus on graph assumptions rather than preprocessing repairs.
Data Audit: Correlations
Correlation does not identify causality, but it is a useful diagnostic. It helps reveal whether the simulated data contain the expected dependence structure before we run graph algorithms.
The correlations show plausible relationships along the funnel, especially around discovery quality, engagement, and retention value. The graph search will now ask which of those associations look direct after conditioning on other variables.
Data Audit: Distribution Shape
Some algorithms rely on distributional clues. DirectLiNGAM, for example, is designed for linear non-Gaussian data. The next cell shows marginal distributions for all variables.
The variables are not perfectly Gaussian, which makes DirectLiNGAM a reasonable candidate method. This is still only a diagnostic, not a guarantee that LiNGAM assumptions are fully true.
Draw The Teaching Truth
This graph is the synthetic answer key. In a real end-to-end analysis, this figure would be replaced by a domain hypothesis diagram rather than known truth.
def trim_edge_to_box(start, end, box_w=0.16, box_h=0.09, gap=0.012): x0, y0 = start x1, y1 = end dx = x1 - x0 dy = y1 - y0 distance = (dx**2+ dy**2) **0.5if distance ==0:return start, end ux, uy = dx / distance, dy / distance candidates = []ifabs(ux) >1e-9: candidates.append((box_w /2) /abs(ux))ifabs(uy) >1e-9: candidates.append((box_h /2) /abs(uy)) offset =min(candidates) + gapreturn (x0 + ux * offset, y0 + uy * offset), (x1 - ux * offset, y1 - uy * offset)GRAPH_POSITIONS = {"baseline_need": (0.10, 0.72),"catalog_fit": (0.10, 0.46),"onboarding_friction": (0.10, 0.20),"discovery_quality": (0.40, 0.58),"early_engagement": (0.64, 0.48),"support_contact": (0.64, 0.20),"retention_value": (0.90, 0.48),}GRAPH_COLORS = {"baseline_need": "#dbeafe","catalog_fit": "#dbeafe","onboarding_friction": "#ede9fe","discovery_quality": "#ecfccb","early_engagement": "#fef3c7","support_contact": "#fde68a","retention_value": "#fee2e2",}def draw_box_graph(edge_df, title, path, note=None, highlight_edges=None): highlight_edges = highlight_edges orset() fig, ax = plt.subplots(figsize=(14, 6.5)) ax.set_axis_off() ax.set_xlim(0, 1) ax.set_ylim(0, 1)for row in edge_df.itertuples(index=False):if row.source notin GRAPH_POSITIONS or row.target notin GRAPH_POSITIONS:continue start, end = trim_edge_to_box(GRAPH_POSITIONS[row.source], GRAPH_POSITIONS[row.target]) edge_key = (row.source, row.target) edge_type = row.edge_type color ="#7c3aed"if edge_key in highlight_edges else"#334155" arrowstyle ="-|>"if edge_type =="-->"else"-" linewidth =2.4if edge_key in highlight_edges else1.8 arrow = FancyArrowPatch( start, end, arrowstyle=arrowstyle, mutation_scale=17, linewidth=linewidth, color=color, connectionstyle="arc3,rad=0.03", zorder=2, ) ax.add_patch(arrow)for variable, (x, y) in GRAPH_POSITIONS.items(): rect = FancyBboxPatch( (x -0.082, y -0.046),0.164,0.092, boxstyle="round,pad=0.014", facecolor=GRAPH_COLORS[variable], edgecolor="#1f2937", linewidth=1.15, zorder=4, ) ax.add_patch(rect) ax.text(x, y, DISPLAY_LABELS[variable], ha="center", va="center", fontsize=9.8, fontweight="bold", zorder=5)if note: ax.text(0.5, 0.07, note, ha="center", va="center", fontsize=10, color="#475569") ax.set_title(title, pad=16, fontsize=16, fontweight="bold") fig.savefig(path, dpi=160, bbox_inches="tight") plt.show()draw_box_graph( true_edge_table,"Synthetic Teaching Truth For The Case Study", FIGURE_DIR /f"{NOTEBOOK_PREFIX}_true_case_study_graph.png", note="This truth graph is available only because the notebook uses simulated data.",)
The truth graph gives us a way to score the workflow. In real work, the same visual style can be used for a pre-analysis domain graph and for the final candidate graph.
Encode Domain Timing Constraints
The variable dictionary gave each variable a timing tier. The next cell converts that tier order into causal-learn background knowledge for PC: later variables are not allowed to cause earlier variables.
def build_tier_background_knowledge(variable_dictionary): nodes = {variable: GraphNode(variable) for variable in variable_dictionary["variable"]} knowledge = BackgroundKnowledge()for row in variable_dictionary.itertuples(index=False): knowledge.add_node_to_tier(nodes[row.variable], int(row.timing_tier))return knowledgebackground_knowledge = build_tier_background_knowledge(variable_dictionary)constraint_table = variable_dictionary[["variable", "timing_tier", "role"]].copy()constraint_table.to_csv(TABLE_DIR /f"{NOTEBOOK_PREFIX}_timing_constraints.csv", index=False)display(constraint_table)
variable
timing_tier
role
0
baseline_need
0
upstream context
1
catalog_fit
1
upstream context
2
onboarding_friction
2
early barrier
3
discovery_quality
3
candidate mechanism
4
early_engagement
4
intermediate outcome
5
support_contact
5
friction outcome
6
retention_value
6
final outcome
The constraints are modest but important. They do not force any edge to exist; they only rule out time-reversing directions.
Graph Parsing And Evaluation Helpers
This helper block converts outputs from causal-learn into one standard edge table format and computes skeleton and direction metrics against the synthetic truth.
The metrics keep adjacency recovery and arrow recovery separate. That distinction is especially important for PC and GES, which can return partially directed graphs.
Algorithm Runners
The case study compares four views:
unconstrained PC, to see what the data alone suggest;
tier-constrained PC, to respect domain timing;
GES, a score-based candidate graph;
DirectLiNGAM, a linear non-Gaussian directed graph.
The runner functions keep tuning choices explicit. This makes later sensitivity checks easier to read.
Run Candidate Discovery Methods
We now estimate candidate graphs. The unconstrained PC graph is useful as a diagnostic, but the constrained PC graph better respects the known timing order.
The candidate comparison shows the value of method triangulation. The constrained PC graph respects timing, GES provides a score-based view, and DirectLiNGAM gives a directed weighted view under non-Gaussian assumptions.
Plot Candidate Method Metrics
This plot compares skeleton and directed F1 for the candidate methods. The synthetic truth lets us quantify what would normally be judged through stability and domain review.
The plot highlights a common applied pattern: a method can recover a strong skeleton while leaving some arrows unresolved, and a functional method can recover more directions when its assumptions are plausible.
Draw Candidate Graphs
A table is precise, but graph drawings are better for stakeholder review. The next cell saves the constrained PC, GES, and DirectLiNGAM candidate graphs.
for method_name in ["PC_tier_constrained", "GES_BIC", "DirectLiNGAM"]: method_edges = candidate_edge_table[candidate_edge_table["run"] == method_name].copy() draw_box_graph( method_edges,f"Candidate Graph: {method_name}", FIGURE_DIR /f"{NOTEBOOK_PREFIX}_{method_name.lower()}_candidate_graph.png", note="Plain lines are unresolved adjacencies; arrows are directed claims from that method.", )
The candidate graphs make it easier to inspect where algorithms agree and where they disagree. This is the visual input to the consensus step.
Edge Consensus Across Methods
The next table counts how many candidate methods support each directed edge and each adjacency. We use only the three serious candidates for final review: tier-constrained PC, GES, and DirectLiNGAM.
The consensus table gives a practical review queue. Edges supported by multiple methods and consistent with timing constraints are better candidates for the final graph.
Tuning Sensitivity
Before choosing a final graph, we scan key tuning knobs: PC alpha, GES BIC penalty, and DirectLiNGAM coefficient threshold.
The plot helps decide whether the final graph is a robust finding or a fragile artifact of one tuning choice.
Bootstrap Edge Stability
A final graph should not depend on a single sample draw. We bootstrap rows, rerun the three candidate methods, and count how often each directed edge appears.
The bootstrap summary gives a stability view by method. Next we turn the bootstrap edges into edge-level inclusion rates.
Edge-Level Stability Table
This table reports how often each directed edge appears across all method-bootstrap runs. It also records method-specific support so final graph selection can distinguish broad support from one-method support.
The stability table is the main evidence table for final graph selection. Edges with high inclusion across multiple method families are easier to defend.
Plot Edge Stability
The plot shows the strongest directed edges by overall inclusion rate.
The strongest edges form a sensible funnel from upstream context through discovery, engagement, support, and retention.
Hidden-Confounding Screen With FCI
FCI is useful when hidden common causes may exist. Here we run it as a diagnostic screen on the baseline observed data. We do not force FCI into the same DAG metric table because it returns a different graph type.
The FCI output should be read as a PAG-style diagnostic, not as an ordinary DAG. If it shows many ambiguous or bidirected endpoint marks, that is a warning to be cautious about observed-DAG claims.
Hidden-Context Stress Check
To show why the FCI diagnostic matters, we simulate a variant with an unobserved context variable that affects both baseline need and retention value. We then compare the candidate methods on this harder dataset.
The hidden-context stress check shows how quickly observed-DAG recovery can degrade when causal sufficiency is violated. In real work, this would motivate sensitivity analysis or designs that address omitted causes.
Choose A Final Candidate Graph
We now assemble a final candidate graph using three rules:
the edge respects timing constraints;
the edge appears in the DirectLiNGAM graph or has support from at least two candidate methods;
the edge has nontrivial bootstrap inclusion support.
These rules are intentionally transparent rather than clever.
The final edge table is not just a graph; it carries the evidence used to include each edge. That makes the graph easier to review and challenge.
Draw The Final Candidate Graph
The final graph highlights edges selected by the stability and consensus rules.
final_draw_edges = final_edge_table[["run", "source", "edge_type", "target"]].copy()highlight_edges =set(zip(final_draw_edges["source"], final_draw_edges["target"]))draw_box_graph( final_draw_edges,"Final Candidate Causal Discovery Graph", FIGURE_DIR /f"{NOTEBOOK_PREFIX}_final_candidate_graph.png", note="Final edges were selected using timing constraints, method consensus, and bootstrap stability.", highlight_edges=highlight_edges,)
The final graph should be treated as a candidate structure for follow-up validation, not as a completed causal proof.
Final Graph Score Against Synthetic Truth
Because the notebook is synthetic, we can score the final graph against the known truth. This is a teaching luxury rather than something available in real case studies.
The final score quantifies how well the workflow recovered the synthetic truth. In real work, the equivalent review would be expert plausibility, stability, and validation against interventions or future data.
Report-Ready Edge Summary
The next table turns the final graph into a compact report artifact. It labels each edge by evidence strength and notes the likely role in the funnel.
def evidence_label(row):if row["lingam_rate"] >=0.90and row["methods_with_any_support"] >=2:return"strong candidate"if row["overall_inclusion_rate"] >=0.30:return"moderate candidate"return"review only"edge_role_lookup = { ("baseline_need", "discovery_quality"): "upstream need to discovery mechanism", ("catalog_fit", "discovery_quality"): "catalog fit to discovery mechanism", ("onboarding_friction", "early_engagement"): "friction reducing early engagement", ("onboarding_friction", "support_contact"): "friction increasing support burden", ("discovery_quality", "early_engagement"): "discovery driving early usage", ("discovery_quality", "retention_value"): "direct discovery-to-retention path", ("early_engagement", "retention_value"): "engagement-to-retention path", ("support_contact", "retention_value"): "support burden to retention path",}report_edge_summary = final_edge_table.copy()report_edge_summary["evidence_label"] = report_edge_summary.apply(evidence_label, axis=1)report_edge_summary["business_role"] = report_edge_summary.apply(lambda row: edge_role_lookup.get((row["source"], row["target"]), "requires review"), axis=1)report_edge_summary = report_edge_summary[ ["source","target","evidence_label","business_role","overall_inclusion_rate","pc_rate","ges_rate","lingam_rate","methods_with_any_support", ]]report_edge_summary.to_csv(TABLE_DIR /f"{NOTEBOOK_PREFIX}_report_edge_summary.csv", index=False)display(report_edge_summary.round(3))
source
target
evidence_label
business_role
overall_inclusion_rate
pc_rate
ges_rate
lingam_rate
methods_with_any_support
0
baseline_need
discovery_quality
strong candidate
upstream need to discovery mechanism
0.983
1.00
0.95
1.0
3
1
catalog_fit
discovery_quality
strong candidate
catalog fit to discovery mechanism
0.983
1.00
0.95
1.0
3
2
discovery_quality
early_engagement
strong candidate
discovery driving early usage
1.000
1.00
1.00
1.0
3
3
discovery_quality
retention_value
strong candidate
direct discovery-to-retention path
0.750
0.25
1.00
1.0
3
4
early_engagement
retention_value
strong candidate
engagement-to-retention path
1.000
1.00
1.00
1.0
3
5
onboarding_friction
early_engagement
strong candidate
friction reducing early engagement
0.867
0.60
1.00
1.0
3
6
onboarding_friction
support_contact
strong candidate
friction increasing support burden
0.750
1.00
0.25
1.0
3
7
support_contact
retention_value
strong candidate
support burden to retention path
1.000
1.00
1.00
1.0
3
This table is the sort of artifact that can move into a written report. It gives the edge, the evidence level, and a plain-language role.
Case-Study Limitations
A final discovery notebook should include limitations directly beside the result. This cell creates a concise limitations table.
limitations_table = pd.DataFrame( [ {"limitation": "Observational graph discovery is not causal proof","practical_response": "Use the final graph to prioritize experiments, natural experiments, or effect estimation.", }, {"limitation": "Hidden confounding can create misleading adjacencies","practical_response": "Use FCI diagnostics, sensitivity checks, richer covariates, or study designs that address omitted causes.", }, {"limitation": "Timing constraints are domain assumptions","practical_response": "Document why each tier is plausible and revisit tiers when measurement timing changes.", }, {"limitation": "LiNGAM depends on linear non-Gaussian assumptions","practical_response": "Check residuals, compare with nonparametric methods, and avoid overclaiming directions from one method.", }, {"limitation": "Bootstrap row resampling is a simple stability check","practical_response": "For temporal or clustered data, use block, user-level, or time-split stability checks instead.", }, {"limitation": "Synthetic truth is available only in this tutorial","practical_response": "In real applications, replace truth scoring with domain review and validation on new data or interventions.", }, ])limitations_table.to_csv(TABLE_DIR /f"{NOTEBOOK_PREFIX}_limitations.csv", index=False)display(limitations_table)
limitation
practical_response
0
Observational graph discovery is not causal proof
Use the final graph to prioritize experiments,...
1
Hidden confounding can create misleading adjac...
Use FCI diagnostics, sensitivity checks, riche...
2
Timing constraints are domain assumptions
Document why each tier is plausible and revisi...
3
LiNGAM depends on linear non-Gaussian assumptions
Check residuals, compare with nonparametric me...
4
Bootstrap row resampling is a simple stability...
For temporal or clustered data, use block, use...
5
Synthetic truth is available only in this tuto...
In real applications, replace truth scoring wi...
The limitations are not an apology for the analysis. They are part of the analysis. Causal discovery is most useful when its uncertainty is visible.
Final Workflow Checklist
The checklist below summarizes the end-to-end workflow in reusable form.
The case study is complete. The important pattern is not the specific synthetic graph; it is the disciplined workflow: define, audit, constrain, estimate, stress-test, select, and report with caveats.