causal-learn Tutorial 17: Common Pitfalls, Reporting, And Limitations
This closing notebook is about the habits that keep causal discovery work honest. Earlier notebooks showed how to run PC, FCI, GES, LiNGAM, functional causal models, permutation methods, time-series methods, hidden-representation ideas, benchmarks, and an end-to-end case study. This notebook asks a different question: after you can run these algorithms, how do you avoid overclaiming what the output graph means?
Causal discovery is powerful because it can surface candidate structure from data. It is risky because the algorithms are only as good as their assumptions, variable design, measurement timing, sample size, and validation workflow. A polished graph can still be wrong if it was learned from biased data, if hidden common causes were ignored, if post-outcome variables leaked into the feature set, or if unstable edges were reported as facts.
The goal here is not to make you pessimistic. The goal is to make you precise. A good causal discovery report says what was searched, what assumptions were made, what changed under sensitivity checks, which edges were stable, which edges require domain review, and what downstream validation should happen next.
Estimated runtime: about 1 to 2 minutes on a typical laptop. The examples use small synthetic datasets so the notebook stays fast and reproducible.
Learning Goals
By the end of this notebook, you should be able to:
recognize hidden-confounding risk when a discovered adjacency has a plausible omitted common cause;
see why small samples and weak effects make graph recovery unstable;
explain why many graph directions are not identifiable from conditional independence alone;
detect temporal leakage and post-outcome variables before they enter a discovery run;
demonstrate selection bias from conditioning on a collider;
flag near-duplicate measurements that can create cluttered or misleading graphs;
write reports that treat discovered graphs as candidate structures rather than finished causal proof.
Notebook Flow
The notebook moves from setup to reusable helpers and then through six practical failure-mode labs:
hidden confounding can create convincing but incomplete observed graphs;
weak tests and small samples can make edge recovery highly variable;
Markov-equivalent graphs can share the same conditional-independence evidence;
temporal leakage can smuggle future information into a discovery task;
selection bias can create dependence where no direct causal link exists;
near-duplicate measurements can dominate a learned graph.
The final section turns these examples into a reporting checklist and a reusable graph-discovery report template.
Setup
This cell imports the scientific Python stack, prepares the output folders, configures Matplotlib to write cache files inside the repository, and imports the causal-learn algorithms used in the examples. Progress bars are disabled in algorithm calls later so the notebook stays clean when executed end to end.
The setup cell creates the same output structure used by the rest of the causal-learn tutorial. Keeping figures, tables, datasets, and reports under one outputs folder makes the notebook easy to rerun and easy to audit later.
Package Versions
A causal discovery result depends on algorithm implementations and default settings. This cell records the package versions used for the tutorial run so that saved outputs can be traced back to the environment that created them.
This table is small, but it belongs in serious reports. If an edge changes after a package upgrade, the version table helps separate a real data change from an implementation or default-setting change.
Pitfall Map
Before running examples, we list the major failure modes covered in this notebook. The map links each pitfall to a diagnostic habit and a reporting habit. This is the mental checklist to keep beside any causal discovery graph.
pitfall_map = pd.DataFrame( [ {"pitfall": "Hidden confounding","what_goes_wrong": "Observed variables can be connected because they share an omitted cause.","diagnostic_habit": "Compare PC-style output with FCI-style latent-risk output and review omitted-cause stories.","reporting_habit": "Call the edge an observed-data candidate, not proof of direct causation.", }, {"pitfall": "Weak tests and small samples","what_goes_wrong": "Conditional-independence tests miss true weak links or add noisy links.","diagnostic_habit": "Run sample-size, alpha, and bootstrap sensitivity checks.","reporting_habit": "Report stable and unstable edge groups separately.", }, {"pitfall": "Over-orientation","what_goes_wrong": "Several DAGs can imply the same observable independence pattern.","diagnostic_habit": "Distinguish skeleton evidence from direction evidence.","reporting_habit": "Use cautious language for directions that come from assumptions rather than data alone.", }, {"pitfall": "Temporal leakage","what_goes_wrong": "Post-outcome variables can make the graph look predictive but invalid for causal reasoning.","diagnostic_habit": "Build a measurement-time table before discovery.","reporting_habit": "State the time window for every variable used in the graph.", }, {"pitfall": "Selection bias","what_goes_wrong": "Filtering on a collider can induce dependence between unrelated causes.","diagnostic_habit": "Compare full-population and selected-sample associations when possible.","reporting_habit": "Document the sampling rule and who is excluded.", }, {"pitfall": "Near-duplicate measurements","what_goes_wrong": "Multiple proxies for the same construct can crowd the graph with measurement edges.","diagnostic_habit": "Check high correlations, duplicate definitions, and shared feature engineering.","reporting_habit": "Collapse or label measurement families before reading them as causal mechanisms.", }, ])pitfall_map.to_csv(TABLE_DIR /f"{NOTEBOOK_PREFIX}_pitfall_map.csv", index=False)display(pitfall_map)
pitfall
what_goes_wrong
diagnostic_habit
reporting_habit
0
Hidden confounding
Observed variables can be connected because th...
Compare PC-style output with FCI-style latent-...
Call the edge an observed-data candidate, not ...
1
Weak tests and small samples
Conditional-independence tests miss true weak ...
Run sample-size, alpha, and bootstrap sensitiv...
Report stable and unstable edge groups separat...
2
Over-orientation
Several DAGs can imply the same observable ind...
Distinguish skeleton evidence from direction e...
Use cautious language for directions that come...
3
Temporal leakage
Post-outcome variables can make the graph look...
Build a measurement-time table before discovery.
State the time window for every variable used ...
4
Selection bias
Filtering on a collider can induce dependence ...
Compare full-population and selected-sample as...
Document the sampling rule and who is excluded.
5
Near-duplicate measurements
Multiple proxies for the same construct can cr...
Check high correlations, duplicate definitions...
Collapse or label measurement families before ...
The theme across the map is simple: discovery output needs context. A graph without assumptions, data provenance, and sensitivity checks is closer to a hypothesis sketch than a decision-ready causal object.
Reusable Graph And Evaluation Helpers
The next cell defines small helpers used throughout the notebook. The graph drawing helper follows the same visual style used across this tutorial: wide white canvas, rounded pastel boxes, bold labels, and dark arrows that stop before the boxes so arrowheads remain visible.
def trim_edge_to_box(start, end, box_w=0.16, box_h=0.09, gap=0.012):"""Shorten an edge so arrows land outside rounded boxes rather than under labels.""" x0, y0 = start x1, y1 = end dx = x1 - x0 dy = y1 - y0 distance = (dx**2+ dy**2) **0.5if distance ==0:return start, end ux, uy = dx / distance, dy / distance candidates = []ifabs(ux) >1e-9: candidates.append((box_w /2) /abs(ux))ifabs(uy) >1e-9: candidates.append((box_h /2) /abs(uy)) offset =min(candidates) + gapreturn (x0 + ux * offset, y0 + uy * offset), (x1 - ux * offset, y1 - uy * offset)def draw_box_graph(edge_df, node_specs, title, path, note=None, highlight_edges=None):"""Draw a compact causal graph using the notebook's rounded-box figure style.""" highlight_edges = highlight_edges orset() fig, ax = plt.subplots(figsize=(14, 6.5)) ax.set_axis_off() ax.set_xlim(0, 1) ax.set_ylim(0, 1)for row in edge_df.itertuples(index=False):if row.source notin node_specs or row.target notin node_specs:continue start, end = trim_edge_to_box(node_specs[row.source]["xy"], node_specs[row.target]["xy"]) edge_type =getattr(row, "edge_type", "-->") edge_key = (row.source, row.target) color ="#7c3aed"if edge_key in highlight_edges else"#334155" arrowstyle ="-|>"if edge_type =="-->"else"-" linestyle ="--"if edge_type in {"o-o", "<->"} else"solid" linewidth =2.4if edge_key in highlight_edges else1.8 arrow = FancyArrowPatch( start, end, arrowstyle=arrowstyle, mutation_scale=18, linewidth=linewidth, color=color, linestyle=linestyle, connectionstyle="arc3,rad=0.035", zorder=2, ) ax.add_patch(arrow)for variable, spec in node_specs.items(): x, y = spec["xy"] rect = FancyBboxPatch( (x -0.082, y -0.046),0.164,0.092, boxstyle="round,pad=0.014", facecolor=spec.get("color", "#dbeafe"), edgecolor="#1f2937", linewidth=1.15, zorder=4, ) ax.add_patch(rect) ax.text(x, y, spec.get("label", variable), ha="center", va="center", fontsize=9.8, fontweight="bold", zorder=5)if note: ax.text(0.5, 0.07, note, ha="center", va="center", fontsize=10, color="#475569") ax.set_title(title, pad=16, fontsize=16, fontweight="bold") fig.savefig(path, dpi=160, bbox_inches="tight") plt.show()def make_edge_table(edges, run="manual"):return pd.DataFrame( [{"run": run, "source": source, "edge_type": edge_type, "target": target} for source, edge_type, target in edges], columns=["run", "source", "edge_type", "target"], )def parse_causallearn_edge(edge): text =str(edge).strip() edge_tokens = [" --> ", " <-- ", " <-> ", " o-> ", " <-o ", " o-o ", " --- "]for token in edge_tokens:if token in text: left, right = text.split(token)if token ==" <-- ":return {"source": right.strip(), "edge_type": "-->", "target": left.strip()}if token ==" <-o ":return {"source": right.strip(), "edge_type": "o->", "target": left.strip()}return {"source": left.strip(), "edge_type": token.strip(), "target": right.strip()}raiseValueError(f"Could not parse edge: {text}")def graph_to_edge_table(graph, label): rows = [parse_causallearn_edge(edge) for edge in graph.get_graph_edges()] edge_table = pd.DataFrame(rows, columns=["source", "edge_type", "target"]) edge_table.insert(0, "run", label)return edge_tabledef standardize_frame(df):return pd.DataFrame(StandardScaler().fit_transform(df), columns=df.columns)def run_pc_edges(df, alpha=0.05, label="PC"): started = time.perf_counter() graph = pc( df.to_numpy(), alpha=alpha, indep_test="fisherz", stable=True, verbose=False, show_progress=False, node_names=list(df.columns), ).G elapsed = time.perf_counter() - startedreturn graph_to_edge_table(graph, label), elapseddef run_fci_edges(df, alpha=0.05, label="FCI"): started = time.perf_counter() graph, _ = fci( df.to_numpy(), independence_test_method="fisherz", alpha=alpha, verbose=False, show_progress=False, node_names=list(df.columns), ) elapsed = time.perf_counter() - startedreturn graph_to_edge_table(graph, label), elapseddef skeleton_pairs(edge_df):if edge_df.empty:returnset()return {tuple(sorted((row.source, row.target))) for row in edge_df.itertuples(index=False)}def directed_pairs(edge_df):if edge_df.empty:returnset()return {(row.source, row.target) for row in edge_df.itertuples(index=False) if row.edge_type =="-->"}
These helpers keep later cells focused on the lesson rather than notebook plumbing. In particular, the graph parser turns causal-learn edge objects into a tabular format that can be saved, filtered, compared, and reported.
Lab 1: Hidden Confounding
The first failure mode is an omitted common cause. In the synthetic truth below, latent_need affects both discovery_quality and retention_value. The observed dataset leaves latent_need out. That means an algorithm may connect the two observed variables even if the strongest story is shared upstream context rather than a clean direct effect.
This is the classic reason to treat an observed edge as a candidate path, not automatic proof of direct causation.
hidden_specs = {"latent_need": {"xy": (0.18, 0.62), "label": "latent\nneed", "color": "#f3f4f6"},"discovery_quality": {"xy": (0.55, 0.72), "label": "discovery\nquality", "color": "#ecfccb"},"retention_value": {"xy": (0.82, 0.42), "label": "retention\nvalue", "color": "#fee2e2"},}hidden_truth_edges = make_edge_table( [ ("latent_need", "-->", "discovery_quality"), ("latent_need", "-->", "retention_value"), ], run="hidden_truth",)hidden_truth_edges.to_csv(TABLE_DIR /f"{NOTEBOOK_PREFIX}_hidden_confounding_truth_edges.csv", index=False)draw_box_graph( hidden_truth_edges, hidden_specs,"Hidden Common Cause Behind An Observed Association", FIGURE_DIR /f"{NOTEBOOK_PREFIX}_hidden_confounding_truth.png", note="The gray variable is not available to the discovery algorithm in the observed-data run.",)
The graph shows why hidden common causes are not a minor detail. If the gray variable is missing from the data, the observed variables can look causally close even when the underlying structure is incomplete.
Now we simulate the observed data. We keep only discovery_quality and retention_value, even though both were generated from the omitted latent_need. This reproduces a common real workflow: a dataset contains outcomes and intermediate measures, but not all upstream causes that generated them.
The observed data has only two columns, so no purely observational algorithm can recover the missing gray node. The best we can do from this dataset alone is detect a strong association and then be explicit about omitted-cause risk.
The next cell compares PC and FCI on the observed two-variable dataset. FCI is designed to be more cautious when latent confounding may exist, although with only two observed variables it still cannot name the hidden cause. The point is to compare the edge language, not to expect magic recovery of a variable that was never measured.
The learned edge is an observed-data clue. It should trigger the question, “what upstream variables could cause both sides?” rather than the statement, “the left side directly causes the right side.”
A useful report makes the hidden-confounding concern visible. This cell writes a compact risk note that can be reused in the final report template.
hidden_risk_note = pd.DataFrame( [ {"edge_or_pattern": "discovery_quality connected to retention_value","observed_evidence": "Strong dependence remains in the observed dataset.","main_risk": "The association can be generated by omitted latent_need.","responsible_next_step": "Measure richer pre-exposure context or validate the edge with an intervention or quasi-experiment.", } ])hidden_risk_note.to_csv(TABLE_DIR /f"{NOTEBOOK_PREFIX}_hidden_confounding_risk_note.csv", index=False)display(hidden_risk_note)
edge_or_pattern
observed_evidence
main_risk
responsible_next_step
0
discovery_quality connected to retention_value
Strong dependence remains in the observed data...
The association can be generated by omitted la...
Measure richer pre-exposure context or validat...
The note is deliberately plain. Discovery work becomes more useful when every high-risk edge is paired with a validation plan.
Lab 2: Weak Tests And Small Samples
Constraint-based algorithms depend on conditional-independence tests. When effects are weak or samples are small, these tests can fail in both directions: they can miss true adjacencies, and they can keep noisy adjacencies. This lab simulates the same true graph across sample sizes and alpha values so we can see how recovery changes.
weak_specs = {"context": {"xy": (0.12, 0.55), "label": "context", "color": "#dbeafe"},"activation": {"xy": (0.42, 0.55), "label": "activation", "color": "#ecfccb"},"retention": {"xy": (0.72, 0.55), "label": "retention", "color": "#fee2e2"},"noise_proxy": {"xy": (0.42, 0.24), "label": "noise\nproxy", "color": "#f3f4f6"},}weak_truth_edges = make_edge_table( [ ("context", "-->", "activation"), ("activation", "-->", "retention"), ("context", "-->", "retention"), ], run="weak_signal_truth",)weak_truth_skeleton = skeleton_pairs(weak_truth_edges)weak_truth_directed = directed_pairs(weak_truth_edges)weak_truth_edges.to_csv(TABLE_DIR /f"{NOTEBOOK_PREFIX}_weak_signal_truth_edges.csv", index=False)draw_box_graph( weak_truth_edges, weak_specs,"Weak-Signal Teaching Graph", FIGURE_DIR /f"{NOTEBOOK_PREFIX}_weak_signal_truth.png", note="The noise proxy has no true causal edge, but weak tests may still connect it in small samples.",)
The true graph has three real directed edges and one irrelevant proxy. Because the signal strengths below are modest, the algorithm needs enough data and careful tuning to recover the structure consistently.
This simulator creates weak but real dependencies. The noise proxy is correlated with random noise rather than the causal system, so a good discovery run should usually leave it isolated.
The head of the data looks ordinary. That is part of the problem: weak-signal causal discovery issues rarely announce themselves in the first few rows. We need sensitivity checks.
The next cell runs PC over a small grid of sample sizes, alpha values, and random seeds. This is intentionally modest so the notebook stays quick. The table measures skeleton recall, skeleton precision, directed recall, and whether the irrelevant proxy was connected.
The table shows why a single discovery run is not enough. Smaller samples and more permissive alpha values can change both recovery and false-positive behavior. A report should show this sensitivity rather than hide it.
The next plot turns the grid into a quick visual diagnostic. We look at recall for real adjacencies and the rate at which the irrelevant proxy gets connected.
The plot makes the tradeoff visible. A setting that recovers more weak signal can also admit more noise. The responsible move is to report this tradeoff and avoid making edge-by-edge claims from one arbitrary alpha.
Lab 3: Markov Equivalence And Over-Orientation
A skeleton can be well supported while some directions remain ambiguous. The chain A -> B -> C and the fork A <- B -> C both imply that A and C become independent after conditioning on B. Conditional-independence evidence alone cannot always choose between these structures.
This lab computes marginal and conditional association for both structures and then shows the shared partially directed story.
The two DAGs disagree about the direction between A and B, but their conditional-independence signature is the same for this simple three-node pattern. The partially directed summary is the safer graph to report when direction is not identified.
This cell simulates both structures and estimates the marginal correlation between A and C, plus the partial correlation after removing linear dependence on B. In both cases, conditioning on B removes the A and C association.
def residual_after_control(y, x): design = np.column_stack([np.ones(len(x)), x]) coef = np.linalg.lstsq(design, y, rcond=None)[0]return y - design @ coefdef partial_corr_ac_given_b(df): a_resid = residual_after_control(df["A"].to_numpy(), df["B"].to_numpy()) c_resid = residual_after_control(df["C"].to_numpy(), df["B"].to_numpy())return np.corrcoef(a_resid, c_resid)[0, 1]def simulate_chain(seed, n_samples=1500): rng = np.random.default_rng(seed) a = rng.normal(size=n_samples) b =0.85* a + rng.normal(scale=0.80, size=n_samples) c =0.85* b + rng.normal(scale=0.80, size=n_samples)return standardize_frame(pd.DataFrame({"A": a, "B": b, "C": c}))def simulate_fork(seed, n_samples=1500): rng = np.random.default_rng(seed) b = rng.normal(size=n_samples) a =0.85* b + rng.normal(scale=0.80, size=n_samples) c =0.85* b + rng.normal(scale=0.80, size=n_samples)return standardize_frame(pd.DataFrame({"A": a, "B": b, "C": c}))chain_df = simulate_chain(RANDOM_SEED)fork_df = simulate_fork(RANDOM_SEED)chain_df.to_csv(DATASET_DIR /f"{NOTEBOOK_PREFIX}_equivalence_chain_data.csv", index=False)fork_df.to_csv(DATASET_DIR /f"{NOTEBOOK_PREFIX}_equivalence_fork_data.csv", index=False)equiv_stats = pd.DataFrame( [ {"scenario": "chain_A_to_B_to_C","corr_A_C": chain_df[["A", "C"]].corr().iloc[0, 1],"partial_corr_A_C_given_B": partial_corr_ac_given_b(chain_df), }, {"scenario": "fork_B_to_A_and_C","corr_A_C": fork_df[["A", "C"]].corr().iloc[0, 1],"partial_corr_A_C_given_B": partial_corr_ac_given_b(fork_df), }, ])equiv_stats.to_csv(TABLE_DIR /f"{NOTEBOOK_PREFIX}_markov_equivalence_stats.csv", index=False)display(equiv_stats.round(4))
scenario
corr_A_C
partial_corr_A_C_given_B
0
chain_A_to_B_to_C
0.5570
-0.0355
1
fork_B_to_A_and_C
0.5489
0.0418
The conditional association is near zero in both scenarios. This is the mathematical reason over-orientation is dangerous: the data can support adjacency without uniquely supporting a direction.
Now we run PC on the two equivalent datasets. The exact edge marks can vary by implementation details, but the key lesson is that the evidence is about a shared equivalence class, not a guaranteed unique DAG.
The safe reporting habit is to separate adjacency evidence from direction evidence. If a direction is coming from timing, background knowledge, or an algorithm-specific assumption, say that explicitly.
Lab 4: Temporal Leakage
Temporal leakage happens when variables measured after the target event are included as if they were pre-event causes. In discovery workflows, this can make a graph look strong while quietly mixing causes, outcomes, and consequences.
The synthetic truth below includes post_event_activity, which is generated after retention_value. It should not be used to discover pre-retention causes.
leak_specs = {"user_context": {"xy": (0.10, 0.76), "label": "user\ncontext", "color": "#dbeafe"},"discovery_quality": {"xy": (0.38, 0.50), "label": "discovery\nquality", "color": "#ecfccb"},"retention_value": {"xy": (0.66, 0.76), "label": "retention\nvalue", "color": "#fee2e2"},"post_event_activity": {"xy": (0.90, 0.34), "label": "post-event\nactivity", "color": "#fde68a"},}leak_truth_edges = make_edge_table( [ ("user_context", "-->", "discovery_quality"), ("user_context", "-->", "retention_value"), ("discovery_quality", "-->", "retention_value"), ("retention_value", "-->", "post_event_activity"), ], run="temporal_truth",)leak_truth_edges.to_csv(TABLE_DIR /f"{NOTEBOOK_PREFIX}_temporal_leakage_truth_edges.csv", index=False)draw_box_graph( leak_truth_edges, leak_specs,"Temporal Ordering With A Post-Event Variable", FIGURE_DIR /f"{NOTEBOOK_PREFIX}_temporal_leakage_truth.png", note="The post-event variable is downstream of the target event and should be excluded from pre-event discovery.",)
The graph gives a clear exclusion rule: post_event_activity is useful for describing what happened later, but it is not a valid pre-retention cause. A measurement-time table should catch this before any algorithm runs.
This cell simulates the temporal-leakage scenario and writes the measurement timing table. The timing table is often more valuable than another algorithm tweak because it prevents invalid variables from entering the problem.
The table turns a subtle modeling issue into an auditable rule. Any variable marked as measured after the target event should be excluded from discovery runs meant to explain pre-event causes.
Now we compare a clean run that excludes the post-event variable with a leaky run that includes it. The leaky run is not useful because it mixes a future consequence into the discovery target.
The run with the post-event variable may look richer, but that richness is misleading for causal discovery about pre-event drivers. More variables are not automatically better when their timestamps are wrong.
Lab 5: Selection Bias From Conditioning On A Collider
Selection bias appears when the analysis dataset includes only cases that passed through a selection mechanism. If selection is influenced by two variables, filtering on selection can make those variables dependent even when neither causes the other.
Here interest and difficulty both affect whether a case is selected. In the full population they are independent. In the selected sample, they become associated.
selection_specs = {"interest": {"xy": (0.16, 0.74), "label": "interest", "color": "#dbeafe"},"difficulty": {"xy": (0.16, 0.24), "label": "difficulty", "color": "#ede9fe"},"selected": {"xy": (0.50, 0.45), "label": "selected", "color": "#fde68a"},"retention_value": {"xy": (0.84, 0.74), "label": "retention\nvalue", "color": "#fee2e2"},}selection_truth_edges = make_edge_table( [ ("interest", "-->", "selected"), ("difficulty", "-->", "selected"), ("interest", "-->", "retention_value"), ], run="selection_truth",)selection_truth_edges.to_csv(TABLE_DIR /f"{NOTEBOOK_PREFIX}_selection_bias_truth_edges.csv", index=False)draw_box_graph( selection_truth_edges, selection_specs,"Selection As A Collider", FIGURE_DIR /f"{NOTEBOOK_PREFIX}_selection_bias_truth.png", note="Filtering to selected cases conditions on a collider and can create dependence among its causes.",)
The warning sign is the collider: interest -> selected <- difficulty. Once we restrict the dataset to selected cases, interest and difficulty can become statistically linked even without a direct edge.
This cell simulates the full population and a selected subset. The selection rule keeps cases with high combined interest and difficulty, reproducing a common filtered-data pattern.
Full population shape: (5000, 4)
Selected sample shape: (1600, 3)
The selected sample is a biased slice of the full population. Any graph learned only from this slice describes that filtered dataset, not necessarily the population mechanism.
The next cell quantifies the induced dependence. It compares the correlation between interest and difficulty in the full population versus the selected sample.
A near-zero population association can turn into a strong selected-sample association. This is why reports must describe how rows entered the dataset, not just which columns were used.
Finally, we run PC on the selected sample. If the algorithm connects interest and difficulty, that edge is a selection artifact candidate rather than a clean causal mechanism.
The practical lesson is to document inclusion rules. If a dataset is restricted to responders, active users, retained customers, completed sessions, or any selected subset, collider bias should be on the checklist.
Lab 6: Near-Duplicate Measurements
Discovery algorithms operate on columns, not concepts. If two columns are near-duplicates of the same construct, the graph may spend edges explaining measurement redundancy instead of the causal system.
This lab creates a true causal chain and then adds a duplicate proxy for the middle variable.
duplicate_specs = {"need": {"xy": (0.12, 0.58), "label": "need", "color": "#dbeafe"},"engagement": {"xy": (0.45, 0.58), "label": "engagement", "color": "#ecfccb"},"engagement_proxy": {"xy": (0.45, 0.28), "label": "engagement\nproxy", "color": "#f3f4f6"},"retention": {"xy": (0.78, 0.58), "label": "retention", "color": "#fee2e2"},}duplicate_truth_edges = make_edge_table( [ ("need", "-->", "engagement"), ("engagement", "-->", "retention"), ("engagement", "-->", "engagement_proxy"), ], run="duplicate_measurement_truth",)duplicate_truth_edges.to_csv(TABLE_DIR /f"{NOTEBOOK_PREFIX}_duplicate_measurement_truth_edges.csv", index=False)draw_box_graph( duplicate_truth_edges, duplicate_specs,"Near-Duplicate Measurement In A Discovery Dataset", FIGURE_DIR /f"{NOTEBOOK_PREFIX}_duplicate_measurement_truth.png", note="The proxy measures the same construct as engagement; it is not a separate business mechanism.",)
The proxy is downstream of the same latent concept represented by engagement. Without careful feature review, the graph can make this measurement relationship look as important as the mechanism of interest.
This simulator creates a proxy that is almost a copy of engagement. The correlation audit will flag that relationship before we read the discovery graph.
The near-duplicate flag is a warning to pause before discovery. Highly redundant variables may need to be collapsed, grouped, or documented as measurement proxies.
The next cell runs PC on the duplicate-measurement data. The purpose is not to blame the algorithm; it is doing what the columns ask it to do. The responsibility is on us to design variables that match the causal question.
If a near-duplicate edge appears, it should usually be reported as a measurement issue rather than a substantive causal discovery. This is a good example of why feature engineering review belongs in causal discovery.
Reporting Language Ladder
The examples above all point to a language problem: a discovered graph often looks more certain than it is. This table gives a practical ladder of claims, from safest to strongest. Stronger language requires stronger validation.
language_ladder = pd.DataFrame( [ {"claim_level": "association pattern","example_language": "These variables are connected in the observed-data graph.","minimum_support_needed": "Clean data audit and a transparent discovery run.", }, {"claim_level": "stable candidate edge","example_language": "This edge is a stable candidate across sensitivity checks.","minimum_support_needed": "Bootstrap, alpha, sample-split, or method-comparison support.", }, {"claim_level": "direction supported by assumptions","example_language": "The direction is supported under the stated timing and algorithm assumptions.","minimum_support_needed": "Documented measurement order, background knowledge, and algorithm assumptions.", }, {"claim_level": "causal mechanism hypothesis","example_language": "This path is a plausible mechanism to validate next.","minimum_support_needed": "Stable graph evidence plus domain review and a validation plan.", }, {"claim_level": "decision-ready causal claim","example_language": "Changing this variable is expected to change the outcome by this amount.","minimum_support_needed": "An experiment, quasi-experiment, or effect-estimation design beyond graph discovery alone.", }, ])language_ladder.to_csv(TABLE_DIR /f"{NOTEBOOK_PREFIX}_reporting_language_ladder.csv", index=False)display(language_ladder)
claim_level
example_language
minimum_support_needed
0
association pattern
These variables are connected in the observed-...
Clean data audit and a transparent discovery run.
1
stable candidate edge
This edge is a stable candidate across sensiti...
Bootstrap, alpha, sample-split, or method-comp...
2
direction supported by assumptions
The direction is supported under the stated ti...
Documented measurement order, background knowl...
3
causal mechanism hypothesis
This path is a plausible mechanism to validate...
Stable graph evidence plus domain review and a...
4
decision-ready causal claim
Changing this variable is expected to change t...
An experiment, quasi-experiment, or effect-est...
A useful discovery report usually lives in the middle of this ladder. It can identify stable candidate mechanisms and prioritize follow-up work without pretending the graph alone estimates intervention effects.
Reporting Checklist
The checklist below turns the notebook into a reusable review artifact. Before sharing a discovered graph, each row should have an answer. If an answer is missing, the report should say so directly.
reporting_checklist = pd.DataFrame( [ {"category": "Question", "check": "Is the causal discovery target stated in plain language?", "why_it_matters": "Avoids drifting from exploration to overclaiming."}, {"category": "Variables", "check": "Does every variable have a definition and measurement window?", "why_it_matters": "Prevents temporal leakage and mixed-granularity graphs."}, {"category": "Rows", "check": "Is the sampling or selection rule documented?", "why_it_matters": "Makes selection bias visible."}, {"category": "Assumptions", "check": "Are sufficiency, faithfulness, stationarity, and functional assumptions stated?", "why_it_matters": "Connects algorithms to the conditions they need."}, {"category": "Algorithms", "check": "Are algorithm settings, tests, scores, and package versions saved?", "why_it_matters": "Makes the run reproducible."}, {"category": "Stability", "check": "Were alpha, sample, seed, threshold, or method sensitivity checks run?", "why_it_matters": "Separates stable candidates from fragile edges."}, {"category": "Hidden causes", "check": "Were plausible omitted common causes reviewed?", "why_it_matters": "Protects against reading confounded links as direct effects."}, {"category": "Graph reading", "check": "Are adjacencies, directions, and equivalence-class limits separated?", "why_it_matters": "Reduces over-orientation."}, {"category": "Validation", "check": "Is there a proposed experiment, quasi-experiment, or effect-estimation follow-up?", "why_it_matters": "Moves from graph hypotheses to causal evidence."}, {"category": "Communication", "check": "Does the report use cautious language for candidate edges?", "why_it_matters": "Keeps the audience from treating discovery as proof."}, ])reporting_checklist.to_csv(TABLE_DIR /f"{NOTEBOOK_PREFIX}_reporting_checklist.csv", index=False)display(reporting_checklist)
category
check
why_it_matters
0
Question
Is the causal discovery target stated in plain...
Avoids drifting from exploration to overclaiming.
1
Variables
Does every variable have a definition and meas...
Prevents temporal leakage and mixed-granularit...
2
Rows
Is the sampling or selection rule documented?
Makes selection bias visible.
3
Assumptions
Are sufficiency, faithfulness, stationarity, a...
Connects algorithms to the conditions they need.
4
Algorithms
Are algorithm settings, tests, scores, and pac...
Makes the run reproducible.
5
Stability
Were alpha, sample, seed, threshold, or method...
Separates stable candidates from fragile edges.
6
Hidden causes
Were plausible omitted common causes reviewed?
Protects against reading confounded links as d...
7
Graph reading
Are adjacencies, directions, and equivalence-c...
Reduces over-orientation.
8
Validation
Is there a proposed experiment, quasi-experime...
Moves from graph hypotheses to causal evidence.
9
Communication
Does the report use cautious language for cand...
Keeps the audience from treating discovery as ...
The checklist is intentionally operational. It is meant to be used before a graph appears in a slide, report, notebook summary, or handoff document.
A Reusable Report Template
This final code cell writes a markdown template. The template gives students and practitioners a starting point for communicating discovery results without overstating them.
report_template ="""# Causal Discovery Report Template## 1. Discovery QuestionState the graph question in one or two sentences. Name the decision, mechanism, or scientific question the graph is meant to inform.## 2. Data Scope- Row unit:- Time period:- Inclusion and exclusion rules:- Known selection mechanisms:- Missingness handling:## 3. Variable DictionaryFor each variable, include definition, measurement window, allowed causal timing tier, and whether it is an input, mechanism, outcome, or post-outcome measure.## 4. AssumptionsState which assumptions are required by the chosen methods. Include causal sufficiency, faithfulness, stationarity, linearity, non-Gaussianity, or functional-form assumptions where relevant.## 5. Algorithms And SettingsList algorithms, independence tests or scores, alpha values, thresholds, background knowledge, package versions, and random seeds.## 6. Main Candidate GraphShow the graph and separate:- stable adjacencies;- directions supported by timing or method assumptions;- ambiguous or equivalence-class edges;- edges with plausible hidden-confounding risk.## 7. Sensitivity ChecksSummarize sample-size, alpha, threshold, seed, bootstrap, method-comparison, and hidden-confounding checks.## 8. Edge Review TableFor each important edge, report support level, caveats, plausible omitted causes, and recommended validation.## 9. What The Graph Does Not ProveState limits plainly. Graph discovery alone usually does not estimate intervention effects or prove that changing a variable will change the outcome.## 10. Next Validation StepRecommend the next experiment, quasi-experiment, effect-estimation design, or data collection step."""report_path = REPORT_DIR /f"{NOTEBOOK_PREFIX}_causal_discovery_report_template.md"report_path.write_text(report_template)print(report_path)
The template is the bridge from notebook work to communication. It nudges the report toward assumptions, stability, and validation rather than a single attractive graph.
Artifact Manifest
The last table lists the files created by this notebook. Saving artifacts makes it easier to inspect the analysis outside the notebook and compare future reruns.
The artifact list is also a useful final audit. If a report references a figure or table, it should appear here with a stable filename.
Closing Notes
The core lesson of this tutorial series is that causal discovery is a disciplined workflow, not just an algorithm call. The library can estimate candidate graph structure, but the analyst is responsible for defining variables, respecting time, checking assumptions, stress-testing output, and communicating uncertainty.
A strong discovery notebook leaves the reader with a graph, a list of caveats, a stability summary, and a validation plan. That combination is much more credible than a graph alone.