causal-learn Tutorial 08: Score-Based Discovery With GES

The earlier causal-learn notebooks focused mostly on constraint-based discovery: PC, FCI, and CD-NOD search for graphs by asking many conditional independence questions. This notebook switches to a different family: score-based discovery.

Score-based discovery assigns each candidate graph a numerical score and searches for a graph that improves that score. The central algorithm here is GES, or Greedy Equivalence Search. GES does not test one conditional independence statement at a time. Instead, it searches over graph structures, adding and deleting edges when those operations improve a decomposable score such as BIC or BDeu.

The practical goals are:

Runtime Note

This notebook is designed to execute quickly by default, usually in about 1-3 minutes on the current dataset and environment. It runs BIC-style GES and a small BDeu example.

causal-learn also exposes kernel/RKHS-style scores such as local_score_CV_general and local_score_marginal_general. Those can be useful for nonlinear settings, but on this 2,500-row teaching dataset they can take 5-15+ minutes depending on the machine and may emit numerical warnings. For a tutorial notebook, that is too much waiting for the main path, so those slower score families are discussed as optional extensions rather than executed by default.

Notebook Flow

We will work through score-based discovery in this order:

  1. Set up imports, paths, GES, PC, and reusable graph helpers.
  2. Load the synthetic datasets created earlier in the tutorial series.
  3. Explain GES, BIC, BDeu, and the forward/backward greedy search idea.
  4. Run BIC-GES on the linear Gaussian dataset.
  5. Compare GES with PC on the same data.
  6. Inspect GES update history, penalty sensitivity, and sample-size sensitivity.
  7. Stress-test BIC-GES on non-Gaussian and nonlinear data.
  8. Run BDeu-GES on the discrete teaching dataset.
  9. Save a reporting checklist and artifact manifest.

As in the previous notebooks, every code cell has explanation before it and a short discussion after it.

GES Theory

GES, or Greedy Equivalence Search, is a score-based causal discovery algorithm. Instead of asking many conditional independence questions directly, it searches for a graph that optimizes a numerical score.

The score typically rewards goodness of fit and penalizes complexity. For continuous linear Gaussian data, BIC-style scores are common. A graph with more parents can fit the data better, but the penalty discourages adding edges that do not improve the model enough.

GES is greedy: it follows local score-improving moves. This makes it much more scalable than exact search, but it also means the path through graph space matters.

Decomposable Scores

GES works efficiently because many graph scores are decomposable. A decomposable score can be written as a sum of local node scores:

\[ S(G) = \sum_i S_i(\mathrm{Pa}(X_i)) \]

This means the contribution for X_i depends only on the parent set of X_i. If a proposed move changes only the parents of one node, the algorithm can update only that local score rather than refitting the entire graph from scratch.

Decomposability is one reason score-based discovery can be practical. It is also why parent-set assumptions and score choice matter so much.

BIC And Complexity Penalties

BIC balances model fit against model complexity. In causal discovery, adding an edge usually adds a parameter because the child variable gets another parent. More parameters can improve fit, but they can also overfit sample noise.

A BIC-style score therefore asks: does the improvement in likelihood justify the extra complexity? If yes, the edge may be added. If no, the penalty wins and the simpler graph is preferred.

The penalty strength is not just a technicality. Stronger penalties produce sparser graphs; weaker penalties allow denser graphs. This is why the GES notebook includes penalty sensitivity.

Equivalence Classes Rather Than Single DAGs

GES searches over Markov equivalence classes rather than treating every DAG orientation as uniquely identifiable. Multiple DAGs can imply the same conditional independence structure and receive equivalent scores under common assumptions.

The output is often represented as a CPDAG-like graph. Directed edges are compelled across the equivalence class; undirected edges are not uniquely oriented by the score and assumptions.

This is conceptually similar to PC’s CPDAG output. Both methods can leave edges unoriented when observational information does not compel a direction.

Forward And Backward Phases

GES usually has two main phases.

The forward phase starts from an empty graph and greedily adds edges or edge patterns that improve the score. This phase increases model complexity when the fit improvement is worthwhile.

The backward phase starts from the forward result and greedily deletes edges when removal improves the score. This phase corrects some over-addition from the forward phase and searches for a better sparse explanation.

Some implementations also include turning or refinement steps. The key idea remains the same: local moves are accepted when they improve the score.

Score Assumptions And Model Mismatch

A score is not neutral. A linear Gaussian BIC score is most appropriate when the data are reasonably modeled by linear relationships with Gaussian residuals. If the true mechanisms are nonlinear, discrete, heavy-tailed, or heteroskedastic, the score may prefer a graph that is statistically convenient rather than causally correct.

This is why a score-based workflow should include mechanism checks and alternative scores when appropriate. For discrete data, a BDeu-style score may be more natural than a continuous BIC score.

The graph should be interpreted as the best explanation under the chosen score, penalty, and search procedure, not as a score-free causal truth.

What GES Can And Cannot Claim

GES can efficiently search large graph spaces and often performs well when its score assumptions match the data. It is useful as a complement to PC because it approaches discovery through model fit rather than direct independence testing.

GES cannot guarantee a global optimum. It is greedy, so local decisions can affect the final graph. It also cannot identify directions that are not identifiable under the observational equivalence class without additional assumptions.

A good GES report documents the score, penalty, data type, search settings, sample-size sensitivity, and any comparison to constraint-based or exact-search results.

Setup

This setup cell imports the scientific stack, GES, PC, and plotting utilities. It also applies a small local compatibility wrapper for causal-learn’s BIC score in this Python/NumPy environment. The wrapper fixes scalar extraction from a one-element matrix and does not edit the installed package on disk.

from pathlib import Path
from importlib.metadata import PackageNotFoundError, version
import contextlib
import io
import os
import warnings

# Keep matplotlib cache writes inside the repository so notebook execution works in restricted environments.
os.environ.setdefault("MPLCONFIGDIR", str(Path.cwd() / ".matplotlib_cache"))
warnings.filterwarnings("ignore", message="IProgress not found.*")
warnings.filterwarnings("ignore", message=".*pkg_resources is deprecated.*")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
from matplotlib.patches import FancyArrowPatch, FancyBboxPatch

import causallearn.search.ScoreBased.GES as ges_module
from causallearn.search.ScoreBased.GES import ges
from causallearn.search.ConstraintBased.PC import pc

# Resolve paths whether the notebook is run from the repository root or from this notebook folder.
CWD = Path.cwd()
if CWD.name == "causal_learn" and (CWD / "outputs").exists():
    NOTEBOOK_DIR = CWD
else:
    NOTEBOOK_DIR = (CWD / "notebooks" / "tutorials" / "causal_learn").resolve()

OUTPUT_DIR = NOTEBOOK_DIR / "outputs"
DATASET_DIR = OUTPUT_DIR / "datasets"
TABLE_DIR = OUTPUT_DIR / "tables"
FIGURE_DIR = OUTPUT_DIR / "figures"
for directory in [OUTPUT_DIR, DATASET_DIR, TABLE_DIR, FIGURE_DIR]:
    directory.mkdir(parents=True, exist_ok=True)

NOTEBOOK_PREFIX = "08"
sns.set_theme(style="whitegrid", context="notebook")
plt.rcParams["figure.dpi"] = 120
plt.rcParams["savefig.facecolor"] = "white"


def local_score_BIC_from_cov(Data, i, PAi, parameters=None):
    """Compatibility wrapper for causal-learn BIC score under recent NumPy scalar behavior."""
    cov, n = Data
    lambda_value = 0.5 if parameters is None else parameters.get("lambda_value", 0.5)
    sigma = float(cov[i, i])
    if len(PAi) > 0:
        yX = cov[np.ix_([i], PAi)]
        XX = cov[np.ix_(PAi, PAi)]
        try:
            XX_inv = np.linalg.inv(XX)
        except np.linalg.LinAlgError:
            XX_inv = np.linalg.pinv(XX)
        sigma = float(cov[i, i] - (yX @ XX_inv @ yX.T).item())
    sigma = max(sigma, 1e-12)
    return -0.5 * n * (1 + np.log(sigma)) - lambda_value * (len(PAi) + 1) * np.log(n)

# Patch only the imported module object used during this notebook session.
ges_module.local_score_BIC_from_cov = local_score_BIC_from_cov

packages = ["causal-learn", "numpy", "pandas", "matplotlib", "seaborn"]
version_rows = []
for package in packages:
    try:
        package_version = version(package)
    except PackageNotFoundError:
        package_version = "not installed"
    version_rows.append({"package": package, "version": package_version})

package_versions = pd.DataFrame(version_rows)
package_versions.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_package_versions.csv", index=False)
display(package_versions)
package version
0 causal-learn 0.1.4.5
1 numpy 2.4.4
2 pandas 3.0.2
3 matplotlib 3.10.9
4 seaborn 0.13.2

The version table and compatibility wrapper make the notebook reproducible in this environment. The wrapper preserves the same BIC formula; it only converts a one-element NumPy result into a Python scalar before scoring.

Load Teaching Datasets

We use four datasets from the synthetic data factory:

  • 02_linear_gaussian.csv: the best match for BIC-GES because the score assumes linear Gaussian local regressions;
  • 02_linear_nongaussian.csv: same broad graph family but with non-Gaussian noise;
  • 02_nonlinear_continuous.csv: a deliberate mismatch for linear Gaussian BIC;
  • 02_discrete_mixed.csv: a small discrete example for BDeu-GES.

The true edge tables are loaded only for tutorial evaluation. Real causal discovery work would not know the true graph.

# Load datasets and true edge lists created by tutorial notebook 02.
dataset_paths = {
    "linear_gaussian": DATASET_DIR / "02_linear_gaussian.csv",
    "linear_nongaussian": DATASET_DIR / "02_linear_nongaussian.csv",
    "nonlinear_continuous": DATASET_DIR / "02_nonlinear_continuous.csv",
    "discrete_mixed": DATASET_DIR / "02_discrete_mixed.csv",
}
true_edge_path = TABLE_DIR / "02_base_true_dag_edges.csv"
discrete_true_edge_path = TABLE_DIR / "02_discrete_mixed_true_edges.csv"

required_paths = list(dataset_paths.values()) + [true_edge_path, discrete_true_edge_path]
missing_paths = [str(path) for path in required_paths if not path.exists()]
if missing_paths:
    raise FileNotFoundError("Run tutorial notebook 02 first. Missing files: " + ", ".join(missing_paths))

datasets = {name: pd.read_csv(path) for name, path in dataset_paths.items()}
true_edges = pd.read_csv(true_edge_path)
discrete_true_edges = pd.read_csv(discrete_true_edge_path)
VARIABLES = list(datasets["linear_gaussian"].columns)

loaded_summary = pd.DataFrame(
    [
        {
            "dataset": name,
            "rows": len(df),
            "columns": df.shape[1],
            "missing_cells": int(df.isna().sum().sum()),
            "source_file": dataset_paths[name].name,
        }
        for name, df in datasets.items()
    ]
)
loaded_summary.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_loaded_dataset_summary.csv", index=False)

display(loaded_summary)
display(datasets["linear_gaussian"].head())
display(true_edges)
dataset rows columns missing_cells source_file
0 linear_gaussian 2500 6 0 02_linear_gaussian.csv
1 linear_nongaussian 2500 6 0 02_linear_nongaussian.csv
2 nonlinear_continuous 2500 6 0 02_nonlinear_continuous.csv
3 discrete_mixed 2500 6 0 02_discrete_mixed.csv
need intent match engagement renewal support
0 0.249820 -0.372094 0.060245 0.667197 0.252766 -0.280058
1 0.683671 -0.210471 0.904969 1.004727 0.320095 -0.332215
2 -0.579752 -1.202671 -0.578579 -0.235444 -0.732431 0.594102
3 -0.902823 -0.077309 -0.771219 -0.531128 -0.105721 -1.503551
4 -1.985745 0.087297 -0.691315 -1.281731 -0.797906 -0.328219
source target edge_type mechanism
0 need match directed Need changes what a good match means.
1 intent match directed Current intent changes recommendation relevance.
2 match engagement directed Better matching increases engagement depth.
3 intent renewal directed Intent directly affects later value.
4 engagement renewal directed Engagement contributes to renewal value.
5 engagement support directed Engagement creates more chances for support co...

All datasets use the same six observed variables, which makes graph comparisons straightforward. The first run will use linear_gaussian because BIC-GES is best aligned with that data-generating setup.

GES Concept Map

GES searches over equivalence classes of DAGs rather than testing individual conditional independence statements. It usually has two major stages: a forward stage that greedily inserts edges to improve the score, and a backward stage that deletes edges if doing so improves the score.

The output should be read as a candidate equivalence-class graph. Some directions may be compelled, while other adjacencies may remain unoriented because several DAGs can have the same score under the assumptions.

# Summarize the key score-based discovery concepts used in this notebook.
ges_concepts = pd.DataFrame(
    [
        {
            "concept": "score-based discovery",
            "plain_language": "Search for a graph that optimizes a numerical fit-plus-penalty score.",
            "why_it_matters": "The result depends on score assumptions and search choices, not only independence tests.",
        },
        {
            "concept": "GES forward phase",
            "plain_language": "Greedily insert edges or operators that improve the graph score.",
            "why_it_matters": "Early additions can shape the graph neighborhood explored later.",
        },
        {
            "concept": "GES backward phase",
            "plain_language": "Delete edges or operators when deletion improves the score.",
            "why_it_matters": "The backward phase can simplify an overly dense forward graph.",
        },
        {
            "concept": "BIC score",
            "plain_language": "Fit a linear Gaussian local model and penalize parent-set complexity.",
            "why_it_matters": "Good match for linear Gaussian data; weaker match for nonlinear mechanisms.",
        },
        {
            "concept": "BDeu score",
            "plain_language": "Score discrete Bayesian-network structures with Dirichlet priors.",
            "why_it_matters": "Useful for categorical data when states and priors are documented.",
        },
    ]
)
ges_concepts.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_ges_concept_map.csv", index=False)
display(ges_concepts)
concept plain_language why_it_matters
0 score-based discovery Search for a graph that optimizes a numerical ... The result depends on score assumptions and se...
1 GES forward phase Greedily insert edges or operators that improv... Early additions can shape the graph neighborho...
2 GES backward phase Delete edges or operators when deletion improv... The backward phase can simplify an overly dens...
3 BIC score Fit a linear Gaussian local model and penalize... Good match for linear Gaussian data; weaker ma...
4 BDeu score Score discrete Bayesian-network structures wit... Useful for categorical data when states and pr...

The concept map sets up the rest of the notebook. The key shift from PC is that we are now asking whether a graph scores well under a modeling assumption, not whether a specific conditional independence test rejects or fails to reject.

Helper Functions

This helper cell defines graph conversion, metric, update-history, and plotting utilities. The graph helpers are similar to earlier notebooks so the visual language stays consistent across the tutorial series.

def parse_causallearn_edge(edge):
    """Convert a causal-learn edge object into source, endpoint pattern, and target strings."""
    parts = str(edge).strip().split()
    if len(parts) != 3:
        return {"source": str(edge), "edge_type": "unknown", "target": "unknown"}
    return {"source": parts[0], "edge_type": parts[1], "target": parts[2]}


def graph_to_edge_table(graph, label):
    """Return a tidy edge table from a causal-learn graph object."""
    rows = [parse_causallearn_edge(edge) for edge in graph.get_graph_edges()]
    edge_df = pd.DataFrame(rows, columns=["source", "edge_type", "target"])
    if edge_df.empty:
        edge_df = pd.DataFrame(columns=["source", "edge_type", "target"])
    edge_df.insert(0, "run", label)
    return edge_df


def directed_pairs(edge_df):
    """Extract definite directed pairs from an edge table."""
    pairs = set()
    for row in edge_df.itertuples(index=False):
        if row.edge_type == "-->":
            pairs.add((row.source, row.target))
        elif row.edge_type == "<--":
            pairs.add((row.target, row.source))
    return pairs


def skeleton_pairs(edge_df):
    """Extract learned adjacencies while ignoring endpoint direction."""
    pairs = set()
    for row in edge_df.itertuples(index=False):
        if row.target != "unknown":
            pairs.add(frozenset([row.source, row.target]))
    return pairs


def summarize_against_truth(edge_df, truth_df, label):
    """Compute compact graph-recovery metrics against the synthetic truth table."""
    true_directed = set(zip(truth_df["source"], truth_df["target"]))
    true_skeleton = {frozenset(edge) for edge in true_directed}
    learned_directed = directed_pairs(edge_df)
    learned_skeleton = skeleton_pairs(edge_df)

    correct_directed = learned_directed & true_directed
    reversed_true = {(src, dst) for src, dst in true_directed if (dst, src) in learned_directed}
    missing_skeleton = true_skeleton - learned_skeleton
    extra_skeleton = learned_skeleton - true_skeleton

    unresolved_true = 0
    for src, dst in true_directed:
        pair = frozenset([src, dst])
        if pair in learned_skeleton and (src, dst) not in learned_directed and (dst, src) not in learned_directed:
            unresolved_true += 1

    directed_count = len(learned_directed)
    return pd.DataFrame(
        [
            {
                "run": label,
                "learned_edges_total": len(edge_df),
                "definite_directed_edges": directed_count,
                "true_edges": len(true_directed),
                "correct_directed_edges": len(correct_directed),
                "directed_precision": len(correct_directed) / directed_count if directed_count else np.nan,
                "directed_recall": len(correct_directed) / len(true_directed) if true_directed else np.nan,
                "reversed_true_edges": len(reversed_true),
                "unresolved_true_adjacencies": unresolved_true,
                "missing_true_adjacencies": len(missing_skeleton),
                "extra_adjacencies": len(extra_skeleton),
            }
        ]
    )


def classify_edges(edge_df, truth_df):
    """Label learned edges relative to the synthetic truth table."""
    true_directed = set(zip(truth_df["source"], truth_df["target"]))
    true_skeleton = {frozenset(edge) for edge in true_directed}
    rows = []
    for row in edge_df.itertuples(index=False):
        pair = frozenset([row.source, row.target])
        learned_direction = None
        if row.edge_type == "-->":
            learned_direction = (row.source, row.target)
        elif row.edge_type == "<--":
            learned_direction = (row.target, row.source)

        if learned_direction in true_directed:
            status = "correct directed edge"
        elif learned_direction and (learned_direction[1], learned_direction[0]) in true_directed:
            status = "reversed true edge"
        elif pair in true_skeleton:
            status = "true adjacency with uncertain or wrong endpoint"
        else:
            status = "extra adjacency"
        rows.append({"source": row.source, "edge_type": row.edge_type, "target": row.target, "status": status})
    return pd.DataFrame(rows)


def run_ges_bic(data_df, label, lambda_value=0.5, maxP=None):
    """Run BIC-GES quietly and return the record plus tidy edge table."""
    buffer = io.StringIO()
    with contextlib.redirect_stdout(buffer):
        record = ges(
            data_df.to_numpy(),
            score_func="local_score_BIC",
            node_names=list(data_df.columns),
            lambda_value=lambda_value,
            maxP=maxP,
        )
    edge_table = graph_to_edge_table(record["G"], label=label)
    messages = pd.DataFrame({"run": label, "message": [line for line in buffer.getvalue().splitlines() if line.strip()]})
    return record, edge_table, messages


def run_pc_baseline(data_df, label):
    """Run Fisher-Z PC as a constraint-based baseline."""
    result = pc(
        data_df.to_numpy(),
        alpha=0.05,
        indep_test="fisherz",
        stable=True,
        show_progress=False,
        node_names=list(data_df.columns),
    )
    return graph_to_edge_table(result.G, label=label)


def operation_table(record, variable_names):
    """Summarize GES forward and backward updates using variable names."""
    rows = []
    for phase, updates in [("forward_insert", record.get("update1", [])), ("backward_delete", record.get("update2", []))]:
        for step, update in enumerate(updates, start=1):
            source_idx, target_idx, conditioning_set = update
            rows.append(
                {
                    "phase": phase,
                    "step": step,
                    "source_index": int(source_idx),
                    "target_index": int(target_idx),
                    "source_variable": variable_names[int(source_idx)],
                    "target_variable": variable_names[int(target_idx)],
                    "conditioning_set": [variable_names[int(idx)] for idx in conditioning_set],
                }
            )
    return pd.DataFrame(rows)


GRAPH_POSITIONS = {
    "need": (0.11, 0.72),
    "intent": (0.11, 0.28),
    "match": (0.39, 0.50),
    "engagement": (0.62, 0.50),
    "renewal": (0.89, 0.72),
    "support": (0.89, 0.28),
}
NODE_LABELS = {name: name.title() for name in GRAPH_POSITIONS}
NODE_COLORS = {
    "need": "#e0f2fe",
    "intent": "#dbeafe",
    "match": "#ecfccb",
    "engagement": "#fef3c7",
    "renewal": "#fee2e2",
    "support": "#f3e8ff",
}


def trim_edge_to_box(start, end, box_w=0.145, box_h=0.095, gap=0.012):
    """Return edge endpoints that stop just outside source and target boxes."""
    x0, y0 = start
    x1, y1 = end
    dx = x1 - x0
    dy = y1 - y0
    length = float(np.hypot(dx, dy))
    if length == 0:
        return start, end
    effective_w = box_w + 0.04
    effective_h = box_h + 0.04
    x_limit = (effective_w / 2) / abs(dx) if dx else np.inf
    y_limit = (effective_h / 2) / abs(dy) if dy else np.inf
    t = min(x_limit, y_limit) + gap / length
    return (x0 + dx * t, y0 + dy * t), (x1 - dx * t, y1 - dy * t)


def draw_box_graph(edge_df, title, path, note=None):
    """Draw a DAG/CPDAG-style graph with rounded boxes and visible arrowheads."""
    fig, ax = plt.subplots(figsize=(12, 6.2))
    ax.set_axis_off()
    ax.set_xlim(-0.02, 1.02)
    ax.set_ylim(0.04, 0.96)
    box_w, box_h = 0.145, 0.095

    for row in edge_df.itertuples(index=False):
        if row.source not in GRAPH_POSITIONS or row.target not in GRAPH_POSITIONS:
            continue
        raw_start = GRAPH_POSITIONS[row.source]
        raw_end = GRAPH_POSITIONS[row.target]
        if row.edge_type == "<--":
            raw_start, raw_end = raw_end, raw_start
        start, end = trim_edge_to_box(raw_start, raw_end, box_w=box_w, box_h=box_h)
        if row.edge_type in {"-->", "<--"}:
            arrowstyle = "-|>"
            mutation_scale = 18
            linewidth = 1.8
            color = "#334155"
        else:
            arrowstyle = "-"
            mutation_scale = 1
            linewidth = 1.5
            color = "#64748b"
        arrow = FancyArrowPatch(
            start,
            end,
            arrowstyle=arrowstyle,
            mutation_scale=mutation_scale,
            linewidth=linewidth,
            color=color,
            connectionstyle="arc3,rad=0.035",
            zorder=2,
        )
        ax.add_patch(arrow)

    for node, (x, y) in GRAPH_POSITIONS.items():
        rect = FancyBboxPatch(
            (x - box_w / 2, y - box_h / 2),
            box_w,
            box_h,
            boxstyle="round,pad=0.018",
            facecolor=NODE_COLORS[node],
            edgecolor="#1f2937",
            linewidth=1.1,
            zorder=5,
        )
        ax.add_patch(rect)
        ax.text(x, y, NODE_LABELS[node], ha="center", va="center", fontsize=10.5, fontweight="bold", zorder=6)

    if note:
        ax.text(0.50, 0.08, note, ha="center", va="center", fontsize=10, color="#475569")
    ax.set_title(title, pad=18, fontsize=14, fontweight="bold")
    fig.savefig(path, dpi=160, bbox_inches="tight")
    plt.show()


def truth_as_edge_table(truth_df, label="truth"):
    """Convert a truth edge table to the plotting schema."""
    return truth_df.assign(run=label, edge_type="-->")[["run", "source", "edge_type", "target"]]

The helpers let us compare GES, PC, BIC penalties, sample sizes, and data-generating regimes using the same edge-table and graph-metric vocabulary.

Draw The Reference Graph

The reference graph is the synthetic data-generating structure. It is included for teaching evaluation, not because real causal discovery work would know the answer in advance.

# Draw the reference teaching graph.
true_edge_table = truth_as_edge_table(true_edges, label="true_graph")
true_graph_path = FIGURE_DIR / f"{NOTEBOOK_PREFIX}_true_dag.png"
draw_box_graph(
    true_edge_table,
    title="Reference Teaching DAG",
    path=true_graph_path,
    note="This graph is known only because the data are synthetic; it is used to evaluate discovery behavior.",
)

The reference graph gives us a visual target for the score-based runs. BIC-GES should perform best on the linear Gaussian dataset because that is the closest match to the score assumptions.

Continuous Data Audit

Before running a linear Gaussian score, we audit the linear Gaussian dataset. The BIC score uses covariance and local linear regressions, so the correlation structure is especially relevant.

# Summarize the continuous baseline data and save its correlation matrix.
linear_df = datasets["linear_gaussian"]
linear_audit = pd.DataFrame(
    {
        "variable": VARIABLES,
        "missing_rate": linear_df.isna().mean().reindex(VARIABLES).to_numpy(),
        "mean": linear_df.mean().reindex(VARIABLES).to_numpy(),
        "std": linear_df.std().reindex(VARIABLES).to_numpy(),
        "min": linear_df.min().reindex(VARIABLES).to_numpy(),
        "max": linear_df.max().reindex(VARIABLES).to_numpy(),
    }
)
linear_corr = linear_df.corr()

linear_audit.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_linear_gaussian_data_audit.csv", index=False)
linear_corr.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_linear_gaussian_correlation_matrix.csv")

display(linear_audit)
display(linear_corr.round(3))
variable missing_rate mean std min max
0 need 0.0 3.552714e-18 1.0002 -3.371900 3.316358
1 intent 0.0 -9.947598e-18 1.0002 -3.467962 3.389982
2 match 0.0 -3.552714e-19 1.0002 -3.596272 3.917631
3 engagement 0.0 -1.776357e-18 1.0002 -3.404627 3.469112
4 renewal 0.0 -8.526513e-18 1.0002 -3.418065 3.122549
5 support 0.0 1.563194e-17 1.0002 -3.250120 3.174916
need intent match engagement renewal support
need 1.000 0.007 0.595 0.494 0.211 0.311
intent 0.007 1.000 0.638 0.529 0.735 0.307
match 0.595 0.638 1.000 0.821 0.665 0.489
engagement 0.494 0.529 0.821 1.000 0.670 0.582
renewal 0.211 0.735 0.665 0.670 1.000 0.384
support 0.311 0.307 0.489 0.582 0.384 1.000

The data are complete and centered enough for a clean tutorial example. The correlation matrix shows dependence structure, but GES will use a graph score rather than treating pairwise correlations as graph edges.

Plot The Correlation Matrix

The heatmap gives a visual sense of the dependence structure that the score-based search will try to explain through parent sets.

# Plot the baseline correlation matrix.
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(linear_corr, annot=True, fmt=".2f", cmap="vlag", center=0, square=True, ax=ax)
ax.set_title("Linear Gaussian Correlation Matrix")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_linear_gaussian_correlation_heatmap.png", dpi=160, bbox_inches="tight")
plt.show()

The heatmap is only a diagnostic. The next cell runs GES and asks which graph scores best under the local linear Gaussian BIC objective.

Baseline BIC-GES On Linear Gaussian Data

This is the main GES run. We use the standard BIC penalty coefficient lambda_value=0.5, which corresponds to the usual BIC-style complexity penalty in causal-learn’s implementation.

# Run BIC-GES on the linear Gaussian dataset.
ges_record, ges_edges, ges_messages = run_ges_bic(
    linear_df,
    label="bic_ges_linear_gaussian",
    lambda_value=0.5,
)
ges_metrics = summarize_against_truth(ges_edges, true_edges, "bic_ges_linear_gaussian")
ges_classified = classify_edges(ges_edges, true_edges)

baseline_score_summary = pd.DataFrame(
    [
        {
            "run": "bic_ges_linear_gaussian",
            "score": ges_record["score"],
            "forward_updates": len(ges_record.get("update1", [])),
            "backward_updates": len(ges_record.get("update2", [])),
            "lambda_value": 0.5,
        }
    ]
)

baseline_score_summary.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_baseline_ges_score_summary.csv", index=False)
ges_edges.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_baseline_ges_edges.csv", index=False)
ges_metrics.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_baseline_ges_metrics.csv", index=False)
ges_classified.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_baseline_ges_edge_classification.csv", index=False)
ges_messages.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_baseline_ges_messages.csv", index=False)

display(baseline_score_summary)
display(ges_edges)
display(ges_metrics)
display(ges_classified)
run score forward_updates backward_updates lambda_value
0 bic_ges_linear_gaussian -2547.995676 8 2 0.5
run source edge_type target
0 bic_ges_linear_gaussian need --> match
1 bic_ges_linear_gaussian intent --> match
2 bic_ges_linear_gaussian intent --> renewal
3 bic_ges_linear_gaussian match --> engagement
4 bic_ges_linear_gaussian engagement --> renewal
5 bic_ges_linear_gaussian engagement --> support
run learned_edges_total definite_directed_edges true_edges correct_directed_edges directed_precision directed_recall reversed_true_edges unresolved_true_adjacencies missing_true_adjacencies extra_adjacencies
0 bic_ges_linear_gaussian 6 6 6 6 1.0 1.0 0 0 0 0
source edge_type target status
0 need --> match correct directed edge
1 intent --> match correct directed edge
2 intent --> renewal correct directed edge
3 match --> engagement correct directed edge
4 engagement --> renewal correct directed edge
5 engagement --> support correct directed edge

On the linear Gaussian teaching data, BIC-GES recovers the intended graph cleanly. This is the friendly case: the data-generating assumptions line up well with the score.

Draw The Baseline GES Graph

The graph below is the baseline score-based discovery result. It should look familiar from the PC notebook because this synthetic dataset was intentionally easy for both Fisher-Z PC and BIC-GES.

# Draw and save the baseline GES graph.
baseline_ges_graph_path = FIGURE_DIR / f"{NOTEBOOK_PREFIX}_baseline_ges_graph.png"
draw_box_graph(
    ges_edges,
    title="BIC-GES On Linear Gaussian Data",
    path=baseline_ges_graph_path,
    note="In this friendly setting, the score assumptions align with the data-generating process.",
)

The graph is clean because the example is well matched to BIC-GES. The next section opens the search record so students can see that GES reached this graph through forward and backward operations.

GES Update History

GES is greedy: it records forward insert operations and backward delete operations. The raw update objects are index-based, so this cell maps them back to variable names.

# Summarize the forward and backward GES update history.
ges_updates = operation_table(ges_record, VARIABLES)
ges_updates.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_baseline_ges_update_history.csv", index=False)
display(ges_updates)
phase step source_index target_index source_variable target_variable conditioning_set
0 forward_insert 1 2 3 match engagement []
1 forward_insert 2 4 1 renewal intent []
2 forward_insert 3 4 3 renewal engagement []
3 forward_insert 4 2 0 match need []
4 forward_insert 5 1 0 intent need [match]
5 forward_insert 6 3 5 engagement support []
6 forward_insert 7 1 2 intent match [engagement]
7 forward_insert 8 1 3 intent engagement []
8 backward_delete 1 0 1 need intent [match]
9 backward_delete 2 1 3 intent engagement []

The update table shows the search path rather than just the final graph. This is useful when debugging a score-based result because two final graphs can have similar scores but different search trajectories.

Compare GES With PC

Now we run PC on the same linear Gaussian dataset. This is not a contest to crown one method permanently better. It is a controlled comparison between constraint-based and score-based discovery under friendly assumptions.

# Run PC on the same linear Gaussian data and compare with GES.
pc_edges = run_pc_baseline(linear_df, label="pc_fisherz_linear_gaussian")
pc_metrics = summarize_against_truth(pc_edges, true_edges, "pc_fisherz_linear_gaussian")
pc_classified = classify_edges(pc_edges, true_edges)

pc_edges.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_pc_baseline_edges.csv", index=False)
pc_metrics.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_pc_baseline_metrics.csv", index=False)
pc_classified.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_pc_baseline_edge_classification.csv", index=False)

method_comparison = pd.concat([ges_metrics, pc_metrics], ignore_index=True)
method_comparison.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_ges_vs_pc_metrics.csv", index=False)

display(pc_edges)
display(method_comparison)
run source edge_type target
0 pc_fisherz_linear_gaussian need --> match
1 pc_fisherz_linear_gaussian intent --> match
2 pc_fisherz_linear_gaussian intent --> renewal
3 pc_fisherz_linear_gaussian match --> engagement
4 pc_fisherz_linear_gaussian engagement --> renewal
5 pc_fisherz_linear_gaussian engagement --> support
run learned_edges_total definite_directed_edges true_edges correct_directed_edges directed_precision directed_recall reversed_true_edges unresolved_true_adjacencies missing_true_adjacencies extra_adjacencies
0 bic_ges_linear_gaussian 6 6 6 6 1.0 1.0 0 0 0 0
1 pc_fisherz_linear_gaussian 6 6 6 6 1.0 1.0 0 0 0 0

In this controlled case, both methods recover the graph well. That agreement is useful: when methods with different search logic agree under plausible assumptions, the candidate structure is easier to trust.

Plot GES Versus PC Metrics

A compact plot makes the method comparison easier to scan. We focus on correct directed edges, reversed true edges, unresolved true adjacencies, and extra adjacencies.

# Plot GES and PC metric comparison.
fig, ax = plt.subplots(figsize=(10, 5))
plot_df = method_comparison.melt(
    id_vars="run",
    value_vars=["correct_directed_edges", "reversed_true_edges", "unresolved_true_adjacencies", "extra_adjacencies"],
    var_name="metric",
    value_name="count",
)
sns.barplot(data=plot_df, x="metric", y="count", hue="run", ax=ax, palette=["#38bdf8", "#94a3b8"])
ax.set_title("BIC-GES Versus Fisher-Z PC On Linear Gaussian Data")
ax.set_xlabel("")
ax.set_ylabel("Count")
ax.tick_params(axis="x", rotation=20)
ax.legend(title="Run", bbox_to_anchor=(1.02, 1), loc="upper left")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_ges_vs_pc_metrics.png", dpi=160, bbox_inches="tight")
plt.show()

The plot reinforces the friendly-case message. The next sections make the notebook more realistic by changing penalty strength, sample size, and data-generating assumptions.

BIC Penalty Sensitivity

The BIC penalty controls how much complexity is punished. A lower penalty can allow denser graphs; a higher penalty can prefer simpler graphs. This cell runs BIC-GES across several lambda_value choices.

# Run BIC-GES across penalty strengths.
lambda_values = [0.25, 0.5, 1.0, 2.0]
lambda_metric_rows = []
lambda_edge_rows = []
lambda_score_rows = []

for lambda_value in lambda_values:
    label = f"bic_ges_lambda_{lambda_value:g}"
    record, edge_table, messages = run_ges_bic(linear_df, label=label, lambda_value=lambda_value)
    metrics = summarize_against_truth(edge_table, true_edges, label).assign(lambda_value=lambda_value)
    lambda_metric_rows.append(metrics)
    lambda_edge_rows.append(edge_table.assign(lambda_value=lambda_value))
    lambda_score_rows.append(
        {
            "run": label,
            "lambda_value": lambda_value,
            "score": record["score"],
            "forward_updates": len(record.get("update1", [])),
            "backward_updates": len(record.get("update2", [])),
        }
    )

lambda_metrics = pd.concat(lambda_metric_rows, ignore_index=True)
lambda_edges = pd.concat(lambda_edge_rows, ignore_index=True)
lambda_scores = pd.DataFrame(lambda_score_rows)

lambda_metrics.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_bic_lambda_sensitivity_metrics.csv", index=False)
lambda_edges.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_bic_lambda_sensitivity_edges.csv", index=False)
lambda_scores.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_bic_lambda_sensitivity_scores.csv", index=False)

display(lambda_scores)
display(lambda_metrics)
display(lambda_edges)
run lambda_value score forward_updates backward_updates
0 bic_ges_lambda_0.25 0.25 -2524.523538 8 2
1 bic_ges_lambda_0.5 0.50 -2547.995676 8 2
2 bic_ges_lambda_1 1.00 -2608.870598 7 0
3 bic_ges_lambda_2 2.00 -2710.583196 7 0
run learned_edges_total definite_directed_edges true_edges correct_directed_edges directed_precision directed_recall reversed_true_edges unresolved_true_adjacencies missing_true_adjacencies extra_adjacencies lambda_value
0 bic_ges_lambda_0.25 6 6 6 6 1.00 1.000000 0 0 0 0 0.25
1 bic_ges_lambda_0.5 6 6 6 6 1.00 1.000000 0 0 0 0 0.50
2 bic_ges_lambda_1 7 4 6 1 0.25 0.166667 2 3 0 1 1.00
3 bic_ges_lambda_2 7 4 6 1 0.25 0.166667 2 3 0 1 2.00
run source edge_type target lambda_value
0 bic_ges_lambda_0.25 need --> match 0.25
1 bic_ges_lambda_0.25 intent --> match 0.25
2 bic_ges_lambda_0.25 intent --> renewal 0.25
3 bic_ges_lambda_0.25 match --> engagement 0.25
4 bic_ges_lambda_0.25 engagement --> renewal 0.25
5 bic_ges_lambda_0.25 engagement --> support 0.25
6 bic_ges_lambda_0.5 need --> match 0.50
7 bic_ges_lambda_0.5 intent --> match 0.50
8 bic_ges_lambda_0.5 intent --> renewal 0.50
9 bic_ges_lambda_0.5 match --> engagement 0.50
10 bic_ges_lambda_0.5 engagement --> renewal 0.50
11 bic_ges_lambda_0.5 engagement --> support 0.50
12 bic_ges_lambda_1 intent --> need 1.00
13 bic_ges_lambda_1 match --> need 1.00
14 bic_ges_lambda_1 intent --> match 1.00
15 bic_ges_lambda_1 intent --- renewal 1.00
16 bic_ges_lambda_1 engagement --> match 1.00
17 bic_ges_lambda_1 engagement --- renewal 1.00
18 bic_ges_lambda_1 engagement --- support 1.00
19 bic_ges_lambda_2 intent --> need 2.00
20 bic_ges_lambda_2 match --> need 2.00
21 bic_ges_lambda_2 intent --> match 2.00
22 bic_ges_lambda_2 intent --- renewal 2.00
23 bic_ges_lambda_2 engagement --> match 2.00
24 bic_ges_lambda_2 engagement --- renewal 2.00
25 bic_ges_lambda_2 engagement --- support 2.00

The penalty sensitivity table shows that the standard penalty works well here, while stronger penalties can change the discovered structure. This is why score-based results should report tuning choices rather than presenting the graph as automatic truth.

Plot Penalty Sensitivity

The plot tracks graph quality metrics across BIC penalty strengths. It helps students see when a score penalty starts changing the graph materially.

# Plot BIC penalty sensitivity.
fig, ax = plt.subplots(figsize=(10, 5))
lambda_plot = lambda_metrics.melt(
    id_vars=["run", "lambda_value"],
    value_vars=["correct_directed_edges", "reversed_true_edges", "unresolved_true_adjacencies", "extra_adjacencies"],
    var_name="metric",
    value_name="count",
)
sns.lineplot(data=lambda_plot, x="lambda_value", y="count", hue="metric", marker="o", ax=ax)
ax.set_title("BIC-GES Penalty Sensitivity")
ax.set_xlabel("BIC Penalty Coefficient")
ax.set_ylabel("Count")
ax.legend(title="Metric", bbox_to_anchor=(1.02, 1), loc="upper left")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_bic_lambda_sensitivity.png", dpi=160, bbox_inches="tight")
plt.show()

The sensitivity curve makes tuning visible. If a graph only appears for one penalty value, the report should treat it as less stable than a graph that persists across nearby choices.

Sample-Size Sensitivity

Score-based search can be unstable with small samples because local score differences are harder to estimate. This cell reruns BIC-GES on subsamples of increasing size.

# Run BIC-GES on increasing sample sizes.
sample_sizes = [100, 250, 500, 1000, len(linear_df)]
sample_metric_rows = []
sample_edge_rows = []
sample_score_rows = []

for sample_size in sample_sizes:
    sample_df = linear_df.sample(n=sample_size, random_state=908 + sample_size)
    label = f"bic_ges_n_{sample_size}"
    record, edge_table, messages = run_ges_bic(sample_df, label=label, lambda_value=0.5)
    metrics = summarize_against_truth(edge_table, true_edges, label).assign(sample_size=sample_size)
    sample_metric_rows.append(metrics)
    sample_edge_rows.append(edge_table.assign(sample_size=sample_size))
    sample_score_rows.append(
        {
            "run": label,
            "sample_size": sample_size,
            "score": record["score"],
            "forward_updates": len(record.get("update1", [])),
            "backward_updates": len(record.get("update2", [])),
        }
    )

sample_metrics = pd.concat(sample_metric_rows, ignore_index=True)
sample_edges = pd.concat(sample_edge_rows, ignore_index=True)
sample_scores = pd.DataFrame(sample_score_rows)

sample_metrics.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_sample_size_sensitivity_metrics.csv", index=False)
sample_edges.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_sample_size_sensitivity_edges.csv", index=False)
sample_scores.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_sample_size_sensitivity_scores.csv", index=False)

display(sample_scores)
display(sample_metrics)
display(sample_edges.head(20))
run sample_size score forward_updates backward_updates
0 bic_ges_n_100 100 -79.144994 7 0
1 bic_ges_n_250 250 -245.731976 8 2
2 bic_ges_n_500 500 -511.203702 8 2
3 bic_ges_n_1000 1000 -962.730373 8 2
4 bic_ges_n_2500 2500 -2547.995676 8 2
run learned_edges_total definite_directed_edges true_edges correct_directed_edges directed_precision directed_recall reversed_true_edges unresolved_true_adjacencies missing_true_adjacencies extra_adjacencies sample_size
0 bic_ges_n_100 7 2 6 2 1.0 0.333333 0 4 0 1 100
1 bic_ges_n_250 6 6 6 6 1.0 1.000000 0 0 0 0 250
2 bic_ges_n_500 6 6 6 6 1.0 1.000000 0 0 0 0 500
3 bic_ges_n_1000 6 6 6 6 1.0 1.000000 0 0 0 0 1000
4 bic_ges_n_2500 6 6 6 6 1.0 1.000000 0 0 0 0 2500
run source edge_type target sample_size
0 bic_ges_n_100 need --- intent 100
1 bic_ges_n_100 need --- match 100
2 bic_ges_n_100 intent --- match 100
3 bic_ges_n_100 intent --> renewal 100
4 bic_ges_n_100 match --- engagement 100
5 bic_ges_n_100 engagement --> renewal 100
6 bic_ges_n_100 engagement --- support 100
7 bic_ges_n_250 need --> match 250
8 bic_ges_n_250 intent --> match 250
9 bic_ges_n_250 intent --> renewal 250
10 bic_ges_n_250 match --> engagement 250
11 bic_ges_n_250 engagement --> renewal 250
12 bic_ges_n_250 engagement --> support 250
13 bic_ges_n_500 need --> match 500
14 bic_ges_n_500 intent --> match 500
15 bic_ges_n_500 intent --> renewal 500
16 bic_ges_n_500 match --> engagement 500
17 bic_ges_n_500 engagement --> renewal 500
18 bic_ges_n_500 engagement --> support 500
19 bic_ges_n_1000 need --> match 1000

Small samples can leave edges unoriented or add extra adjacencies. By a few hundred rows in this synthetic setup, BIC-GES becomes much more stable, which is exactly the kind of sample-size story a good report should include.

Plot Sample-Size Sensitivity

This plot makes the sample-size pattern easier to read. It tracks correct directed edges, unresolved true adjacencies, and extra adjacencies as the available data grow.

# Plot sample-size sensitivity for BIC-GES.
fig, ax = plt.subplots(figsize=(10, 5))
sample_plot = sample_metrics.melt(
    id_vars=["run", "sample_size"],
    value_vars=["correct_directed_edges", "unresolved_true_adjacencies", "extra_adjacencies", "missing_true_adjacencies"],
    var_name="metric",
    value_name="count",
)
sns.lineplot(data=sample_plot, x="sample_size", y="count", hue="metric", marker="o", ax=ax)
ax.set_title("BIC-GES Sample-Size Sensitivity")
ax.set_xlabel("Sample Size")
ax.set_ylabel("Count")
ax.legend(title="Metric", bbox_to_anchor=(1.02, 1), loc="upper left")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_sample_size_sensitivity.png", dpi=160, bbox_inches="tight")
plt.show()

The sample-size plot is a useful diagnostic for any score-based discovery workflow. Stable graph recovery at larger samples is more reassuring than a graph that only appears in one small subsample.

Model-Mismatch Stress Test

BIC-GES is a linear Gaussian score. This section runs the same method on three datasets with different mechanisms. The purpose is not to say BIC-GES is bad; it is to show that score assumptions matter.

# Run BIC-GES on datasets with increasingly mismatched mechanisms.
regime_metric_rows = []
regime_edge_rows = []
regime_score_rows = []
for dataset_name in ["linear_gaussian", "linear_nongaussian", "nonlinear_continuous"]:
    label = f"bic_ges_{dataset_name}"
    record, edge_table, messages = run_ges_bic(datasets[dataset_name], label=label, lambda_value=0.5)
    metrics = summarize_against_truth(edge_table, true_edges, label).assign(dataset=dataset_name)
    regime_metric_rows.append(metrics)
    regime_edge_rows.append(edge_table.assign(dataset=dataset_name))
    regime_score_rows.append(
        {
            "run": label,
            "dataset": dataset_name,
            "score": record["score"],
            "forward_updates": len(record.get("update1", [])),
            "backward_updates": len(record.get("update2", [])),
        }
    )

regime_metrics = pd.concat(regime_metric_rows, ignore_index=True)
regime_edges = pd.concat(regime_edge_rows, ignore_index=True)
regime_scores = pd.DataFrame(regime_score_rows)

regime_metrics.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_mechanism_stress_test_metrics.csv", index=False)
regime_edges.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_mechanism_stress_test_edges.csv", index=False)
regime_scores.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_mechanism_stress_test_scores.csv", index=False)

display(regime_scores)
display(regime_metrics)
display(regime_edges)
run dataset score forward_updates backward_updates
0 bic_ges_linear_gaussian linear_gaussian -2547.995676 8 2
1 bic_ges_linear_nongaussian linear_nongaussian -2561.286088 8 1
2 bic_ges_nonlinear_continuous nonlinear_continuous -4253.218605 9 0
run learned_edges_total definite_directed_edges true_edges correct_directed_edges directed_precision directed_recall reversed_true_edges unresolved_true_adjacencies missing_true_adjacencies extra_adjacencies dataset
0 bic_ges_linear_gaussian 6 6 6 6 1.000000 1.000000 0 0 0 0 linear_gaussian
1 bic_ges_linear_nongaussian 7 7 6 6 0.857143 1.000000 0 0 0 1 linear_nongaussian
2 bic_ges_nonlinear_continuous 9 6 6 1 0.166667 0.166667 2 3 0 3 nonlinear_continuous
run source edge_type target dataset
0 bic_ges_linear_gaussian need --> match linear_gaussian
1 bic_ges_linear_gaussian intent --> match linear_gaussian
2 bic_ges_linear_gaussian intent --> renewal linear_gaussian
3 bic_ges_linear_gaussian match --> engagement linear_gaussian
4 bic_ges_linear_gaussian engagement --> renewal linear_gaussian
5 bic_ges_linear_gaussian engagement --> support linear_gaussian
6 bic_ges_linear_nongaussian need --> match linear_nongaussian
7 bic_ges_linear_nongaussian intent --> match linear_nongaussian
8 bic_ges_linear_nongaussian intent --> engagement linear_nongaussian
9 bic_ges_linear_nongaussian intent --> renewal linear_nongaussian
10 bic_ges_linear_nongaussian match --> engagement linear_nongaussian
11 bic_ges_linear_nongaussian engagement --> renewal linear_nongaussian
12 bic_ges_linear_nongaussian engagement --> support linear_nongaussian
13 bic_ges_nonlinear_continuous intent --> need nonlinear_continuous
14 bic_ges_nonlinear_continuous match --> need nonlinear_continuous
15 bic_ges_nonlinear_continuous engagement --> need nonlinear_continuous
16 bic_ges_nonlinear_continuous intent --> match nonlinear_continuous
17 bic_ges_nonlinear_continuous intent --- renewal nonlinear_continuous
18 bic_ges_nonlinear_continuous engagement --> match nonlinear_continuous
19 bic_ges_nonlinear_continuous renewal --> match nonlinear_continuous
20 bic_ges_nonlinear_continuous engagement --- renewal nonlinear_continuous
21 bic_ges_nonlinear_continuous engagement --- support nonlinear_continuous

The stress test shows the main limitation of a score: when the score is mismatched to the data-generating process, the discovered graph can deteriorate. The nonlinear dataset is especially hard for a linear Gaussian score.

Plot Model-Mismatch Results

The plot compares graph recovery across the three mechanism regimes. It is a quick way to see where BIC-GES is reliable and where its assumptions start to fray.

# Plot the mechanism mismatch stress test.
fig, ax = plt.subplots(figsize=(10, 5))
regime_plot = regime_metrics.melt(
    id_vars=["run", "dataset"],
    value_vars=["correct_directed_edges", "reversed_true_edges", "unresolved_true_adjacencies", "extra_adjacencies"],
    var_name="metric",
    value_name="count",
)
sns.barplot(data=regime_plot, x="dataset", y="count", hue="metric", ax=ax)
ax.set_title("BIC-GES Under Mechanism Mismatch")
ax.set_xlabel("Dataset")
ax.set_ylabel("Count")
ax.tick_params(axis="x", rotation=15)
ax.legend(title="Metric", bbox_to_anchor=(1.02, 1), loc="upper left")
plt.tight_layout()
fig.savefig(FIGURE_DIR / f"{NOTEBOOK_PREFIX}_mechanism_stress_test.png", dpi=160, bbox_inches="tight")
plt.show()

The plot is the cautionary center of the notebook. Score-based discovery is only as credible as the score family and the data representation that support it.

Draw The Nonlinear Stress-Test Graph

The nonlinear graph is worth drawing because it shows what model mismatch looks like visually: extra arrows, reversed directions, and less faithful structure.

# Draw the BIC-GES result on nonlinear continuous data.
nonlinear_edges = regime_edges[regime_edges["dataset"] == "nonlinear_continuous"].drop(columns=["dataset"])
nonlinear_graph_path = FIGURE_DIR / f"{NOTEBOOK_PREFIX}_nonlinear_bic_ges_graph.png"
draw_box_graph(
    nonlinear_edges,
    title="BIC-GES On Nonlinear Continuous Data",
    path=nonlinear_graph_path,
    note="A linear Gaussian score can misread nonlinear dependence structure.",
)

The nonlinear graph is less trustworthy than the baseline graph. This is where a report should consider nonlinear scores, transformations, or a different discovery method rather than forcing BIC-GES to carry the whole analysis.

Discrete Score Example: BDeu-GES

BIC is not the only score. For discrete data, causal-learn supports BDeu-style scoring. This section runs GES on the discrete teaching dataset to show how the score family changes with data type.

# Run BDeu-GES on the discrete dataset.
discrete_df = datasets["discrete_mixed"].astype(int)
r_i_map = {i: int(discrete_df.iloc[:, i].nunique()) for i in range(discrete_df.shape[1])}
bdeu_parameters = {"sample_prior": 1, "structure_prior": 1, "r_i_map": r_i_map}

buffer = io.StringIO()
with contextlib.redirect_stdout(buffer):
    bdeu_record = ges(
        discrete_df.to_numpy(),
        score_func="local_score_BDeu",
        node_names=list(discrete_df.columns),
        parameters=bdeu_parameters,
    )

bdeu_edges = graph_to_edge_table(bdeu_record["G"], label="bdeu_ges_discrete_mixed")
bdeu_metrics = summarize_against_truth(bdeu_edges, discrete_true_edges, "bdeu_ges_discrete_mixed")
bdeu_classified = classify_edges(bdeu_edges, discrete_true_edges)
bdeu_score_summary = pd.DataFrame(
    [
        {
            "run": "bdeu_ges_discrete_mixed",
            "score": bdeu_record["score"],
            "forward_updates": len(bdeu_record.get("update1", [])),
            "backward_updates": len(bdeu_record.get("update2", [])),
            "sample_prior": bdeu_parameters["sample_prior"],
            "structure_prior": bdeu_parameters["structure_prior"],
        }
    ]
)

bdeu_edges.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_bdeu_ges_edges.csv", index=False)
bdeu_metrics.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_bdeu_ges_metrics.csv", index=False)
bdeu_classified.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_bdeu_ges_edge_classification.csv", index=False)
bdeu_score_summary.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_bdeu_ges_score_summary.csv", index=False)

display(bdeu_score_summary)
display(bdeu_edges)
display(bdeu_metrics)
display(bdeu_classified)
run score forward_updates backward_updates sample_prior structure_prior
0 bdeu_ges_discrete_mixed -10499.860061 6 0 1 1
run source edge_type target
0 bdeu_ges_discrete_mixed need --- match
1 bdeu_ges_discrete_mixed match --> intent
2 bdeu_ges_discrete_mixed renewal --> intent
3 bdeu_ges_discrete_mixed match --- engagement
4 bdeu_ges_discrete_mixed engagement --- renewal
5 bdeu_ges_discrete_mixed engagement --- support
run learned_edges_total definite_directed_edges true_edges correct_directed_edges directed_precision directed_recall reversed_true_edges unresolved_true_adjacencies missing_true_adjacencies extra_adjacencies
0 bdeu_ges_discrete_mixed 6 2 6 0 0.0 0.0 2 4 0 0
source edge_type target status
0 need --- match true adjacency with uncertain or wrong endpoint
1 match --> intent reversed true edge
2 renewal --> intent reversed true edge
3 match --- engagement true adjacency with uncertain or wrong endpoint
4 engagement --- renewal true adjacency with uncertain or wrong endpoint
5 engagement --- support true adjacency with uncertain or wrong endpoint

The BDeu example is intentionally more modest than the BIC baseline. It shows how to switch score families for discrete data, while also showing that discretization and prior choices can make orientation recovery harder.

Draw The BDeu-GES Graph

The discrete graph is drawn with the same layout so it can be compared against the continuous BIC-GES graph. Undirected edges are shown as plain lines because GES can leave equivalence-class uncertainty.

# Draw the BDeu-GES graph on discrete data.
bdeu_graph_path = FIGURE_DIR / f"{NOTEBOOK_PREFIX}_bdeu_ges_graph.png"
draw_box_graph(
    bdeu_edges,
    title="BDeu-GES On Discrete Teaching Data",
    path=bdeu_graph_path,
    note="Discrete scoring matches categorical data, but orientation can remain uncertain.",
)

The BDeu graph is a good reminder that matching score type to data type is necessary but not sufficient. A score can be appropriate and still leave several edges uncertain.

Score Function Runtime And Use Cases

The next table summarizes score functions you may encounter in causal-learn. The notebook runs the fast, stable choices by default and leaves the slower kernel/CV scores as optional experiments.

# Summarize score choices and approximate runtime expectations for this tutorial setup.
score_function_guide = pd.DataFrame(
    [
        {
            "score_function": "local_score_BIC",
            "data_type": "continuous",
            "main_assumption": "linear Gaussian local models",
            "run_by_default": True,
            "approx_runtime_on_this_dataset": "seconds",
            "when_to_try": "baseline score-based discovery for continuous data",
        },
        {
            "score_function": "local_score_BDeu",
            "data_type": "discrete",
            "main_assumption": "categorical Bayesian-network score with documented priors",
            "run_by_default": True,
            "approx_runtime_on_this_dataset": "seconds",
            "when_to_try": "discrete variables with a clear state space",
        },
        {
            "score_function": "local_score_CV_general",
            "data_type": "continuous",
            "main_assumption": "kernel regression with cross-validated likelihood",
            "run_by_default": False,
            "approx_runtime_on_this_dataset": "5-15+ minutes",
            "when_to_try": "nonlinear relationships when longer runtime is acceptable",
        },
        {
            "score_function": "local_score_marginal_general",
            "data_type": "continuous",
            "main_assumption": "kernel marginal likelihood style score",
            "run_by_default": False,
            "approx_runtime_on_this_dataset": "5-15+ minutes or more",
            "when_to_try": "nonlinear sensitivity checks on smaller samples first",
        },
    ]
)
score_function_guide.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_score_function_guide.csv", index=False)
display(score_function_guide)
score_function data_type main_assumption run_by_default approx_runtime_on_this_dataset when_to_try
0 local_score_BIC continuous linear Gaussian local models True seconds baseline score-based discovery for continuous ...
1 local_score_BDeu discrete categorical Bayesian-network score with docume... True seconds discrete variables with a clear state space
2 local_score_CV_general continuous kernel regression with cross-validated likelihood False 5-15+ minutes nonlinear relationships when longer runtime is...
3 local_score_marginal_general continuous kernel marginal likelihood style score False 5-15+ minutes or more nonlinear sensitivity checks on smaller sample...

This guide is included because score choice is both statistical and computational. A slower nonlinear score should usually be tested on a smaller sample or reduced variable set before it becomes part of a default notebook pipeline.

Combined Scenario Summary

This table gathers the main runs into one place: baseline BIC-GES, PC, selected penalty runs, sample-size runs, mechanism stress tests, and BDeu-GES. It becomes the compact summary for the notebook.

# Combine the most important metric tables into one scenario summary.
scenario_summary = pd.concat(
    [
        ges_metrics.assign(section="baseline_bic_ges"),
        pc_metrics.assign(section="constraint_based_comparison"),
        lambda_metrics.assign(section="bic_penalty_sensitivity"),
        sample_metrics.assign(section="sample_size_sensitivity"),
        regime_metrics.assign(section="mechanism_stress_test"),
        bdeu_metrics.assign(section="discrete_bdeu_example"),
    ],
    ignore_index=True,
    sort=False,
)
scenario_summary = scenario_summary[["section"] + [col for col in scenario_summary.columns if col != "section"]]
scenario_summary.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_scenario_summary_metrics.csv", index=False)
display(scenario_summary)
section run learned_edges_total definite_directed_edges true_edges correct_directed_edges directed_precision directed_recall reversed_true_edges unresolved_true_adjacencies missing_true_adjacencies extra_adjacencies lambda_value sample_size dataset
0 baseline_bic_ges bic_ges_linear_gaussian 6 6 6 6 1.000000 1.000000 0 0 0 0 NaN NaN NaN
1 constraint_based_comparison pc_fisherz_linear_gaussian 6 6 6 6 1.000000 1.000000 0 0 0 0 NaN NaN NaN
2 bic_penalty_sensitivity bic_ges_lambda_0.25 6 6 6 6 1.000000 1.000000 0 0 0 0 0.25 NaN NaN
3 bic_penalty_sensitivity bic_ges_lambda_0.5 6 6 6 6 1.000000 1.000000 0 0 0 0 0.50 NaN NaN
4 bic_penalty_sensitivity bic_ges_lambda_1 7 4 6 1 0.250000 0.166667 2 3 0 1 1.00 NaN NaN
5 bic_penalty_sensitivity bic_ges_lambda_2 7 4 6 1 0.250000 0.166667 2 3 0 1 2.00 NaN NaN
6 sample_size_sensitivity bic_ges_n_100 7 2 6 2 1.000000 0.333333 0 4 0 1 NaN 100.0 NaN
7 sample_size_sensitivity bic_ges_n_250 6 6 6 6 1.000000 1.000000 0 0 0 0 NaN 250.0 NaN
8 sample_size_sensitivity bic_ges_n_500 6 6 6 6 1.000000 1.000000 0 0 0 0 NaN 500.0 NaN
9 sample_size_sensitivity bic_ges_n_1000 6 6 6 6 1.000000 1.000000 0 0 0 0 NaN 1000.0 NaN
10 sample_size_sensitivity bic_ges_n_2500 6 6 6 6 1.000000 1.000000 0 0 0 0 NaN 2500.0 NaN
11 mechanism_stress_test bic_ges_linear_gaussian 6 6 6 6 1.000000 1.000000 0 0 0 0 NaN NaN linear_gaussian
12 mechanism_stress_test bic_ges_linear_nongaussian 7 7 6 6 0.857143 1.000000 0 0 0 1 NaN NaN linear_nongaussian
13 mechanism_stress_test bic_ges_nonlinear_continuous 9 6 6 1 0.166667 0.166667 2 3 0 3 NaN NaN nonlinear_continuous
14 discrete_bdeu_example bdeu_ges_discrete_mixed 6 2 6 0 0.000000 0.000000 2 4 0 0 NaN NaN NaN

The scenario summary is the main audit table. It shows where score-based discovery works cleanly and where the result becomes sensitive to penalty, sample size, or model mismatch.

GES Reporting Checklist

A score-based discovery report should document more than the final graph. The checklist below captures the key details that make a GES result easier to evaluate.

# Save a practical checklist for reporting score-based discovery results.
reporting_checklist = pd.DataFrame(
    [
        {
            "topic": "score choice",
            "question_to_answer": "Which score was used, and why does it match the data type and mechanism assumptions?",
            "reporting_note": "BIC, BDeu, and kernel scores answer different modeling questions.",
        },
        {
            "topic": "penalty or priors",
            "question_to_answer": "What penalty coefficient or score prior was used?",
            "reporting_note": "Graph density can change when the complexity penalty changes.",
        },
        {
            "topic": "search path",
            "question_to_answer": "How many forward and backward updates occurred?",
            "reporting_note": "The final graph is easier to debug when the search path is visible.",
        },
        {
            "topic": "sample-size sensitivity",
            "question_to_answer": "Does the graph stabilize as sample size increases?",
            "reporting_note": "Small-sample score differences can be noisy.",
        },
        {
            "topic": "model mismatch",
            "question_to_answer": "What happens when the same score is run on data with different mechanisms?",
            "reporting_note": "A score can be fast and elegant but still wrong for nonlinear data.",
        },
        {
            "topic": "claim strength",
            "question_to_answer": "Which edges are stable candidate structure rather than confirmed causal directions?",
            "reporting_note": "GES returns a discovered graph under assumptions, not a final causal proof.",
        },
    ]
)
reporting_checklist.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_ges_reporting_checklist.csv", index=False)
display(reporting_checklist)
topic question_to_answer reporting_note
0 score choice Which score was used, and why does it match th... BIC, BDeu, and kernel scores answer different ...
1 penalty or priors What penalty coefficient or score prior was used? Graph density can change when the complexity p...
2 search path How many forward and backward updates occurred? The final graph is easier to debug when the se...
3 sample-size sensitivity Does the graph stabilize as sample size increa... Small-sample score differences can be noisy.
4 model mismatch What happens when the same score is run on dat... A score can be fast and elegant but still wron...
5 claim strength Which edges are stable candidate structure rat... GES returns a discovered graph under assumptio...

The checklist turns the notebook into reusable practice. Score-based discovery is most useful when the score assumptions, tuning choices, and stability checks are all visible.

Artifact Manifest

The final code cell lists the generated figures and tables. This makes the notebook outputs easy to find later.

# Inventory artifacts generated by this notebook.
artifact_rows = []
for folder, artifact_type in [(TABLE_DIR, "table"), (FIGURE_DIR, "figure")]:
    for artifact_path in sorted(folder.glob(f"{NOTEBOOK_PREFIX}_*")):
        artifact_rows.append(
            {
                "artifact_type": artifact_type,
                "file_name": artifact_path.name,
                "relative_path": str(artifact_path.relative_to(NOTEBOOK_DIR)),
                "size_kb": round(artifact_path.stat().st_size / 1024, 1),
            }
        )

artifact_manifest = pd.DataFrame(artifact_rows)
artifact_manifest.to_csv(TABLE_DIR / f"{NOTEBOOK_PREFIX}_artifact_manifest.csv", index=False)
display(artifact_manifest)
artifact_type file_name relative_path size_kb
0 table 08_baseline_ges_edge_classification.csv outputs/tables/08_baseline_ges_edge_classifica... 0.3
1 table 08_baseline_ges_edges.csv outputs/tables/08_baseline_ges_edges.csv 0.3
2 table 08_baseline_ges_messages.csv outputs/tables/08_baseline_ges_messages.csv 0.0
3 table 08_baseline_ges_metrics.csv outputs/tables/08_baseline_ges_metrics.csv 0.2
4 table 08_baseline_ges_score_summary.csv outputs/tables/08_baseline_ges_score_summary.csv 0.1
5 table 08_baseline_ges_update_history.csv outputs/tables/08_baseline_ges_update_history.csv 0.5
6 table 08_bdeu_ges_edge_classification.csv outputs/tables/08_bdeu_ges_edge_classification... 0.4
7 table 08_bdeu_ges_edges.csv outputs/tables/08_bdeu_ges_edges.csv 0.3
8 table 08_bdeu_ges_metrics.csv outputs/tables/08_bdeu_ges_metrics.csv 0.2
9 table 08_bdeu_ges_score_summary.csv outputs/tables/08_bdeu_ges_score_summary.csv 0.1
10 table 08_bic_lambda_sensitivity_edges.csv outputs/tables/08_bic_lambda_sensitivity_edges... 1.1
11 table 08_bic_lambda_sensitivity_metrics.csv outputs/tables/08_bic_lambda_sensitivity_metri... 0.4
12 table 08_bic_lambda_sensitivity_scores.csv outputs/tables/08_bic_lambda_sensitivity_score... 0.2
13 table 08_ges_concept_map.csv outputs/tables/08_ges_concept_map.csv 0.8
14 table 08_ges_reporting_checklist.csv outputs/tables/08_ges_reporting_checklist.csv 0.9
15 table 08_ges_vs_pc_metrics.csv outputs/tables/08_ges_vs_pc_metrics.csv 0.3
16 table 08_linear_gaussian_correlation_matrix.csv outputs/tables/08_linear_gaussian_correlation_... 0.7
17 table 08_linear_gaussian_data_audit.csv outputs/tables/08_linear_gaussian_data_audit.csv 0.6
18 table 08_loaded_dataset_summary.csv outputs/tables/08_loaded_dataset_summary.csv 0.2
19 table 08_mechanism_stress_test_edges.csv outputs/tables/08_mechanism_stress_test_edges.csv 1.4
20 table 08_mechanism_stress_test_metrics.csv outputs/tables/08_mechanism_stress_test_metric... 0.5
21 table 08_mechanism_stress_test_scores.csv outputs/tables/08_mechanism_stress_test_scores... 0.2
22 table 08_package_versions.csv outputs/tables/08_package_versions.csv 0.1
23 table 08_pc_baseline_edge_classification.csv outputs/tables/08_pc_baseline_edge_classificat... 0.3
24 table 08_pc_baseline_edges.csv outputs/tables/08_pc_baseline_edges.csv 0.3
25 table 08_pc_baseline_metrics.csv outputs/tables/08_pc_baseline_metrics.csv 0.3
26 table 08_sample_size_sensitivity_edges.csv outputs/tables/08_sample_size_sensitivity_edge... 1.2
27 table 08_sample_size_sensitivity_metrics.csv outputs/tables/08_sample_size_sensitivity_metr... 0.4
28 table 08_sample_size_sensitivity_scores.csv outputs/tables/08_sample_size_sensitivity_scor... 0.3
29 table 08_scenario_summary_metrics.csv outputs/tables/08_scenario_summary_metrics.csv 1.5
30 table 08_score_function_guide.csv outputs/tables/08_score_function_guide.csv 0.7
31 figure 08_baseline_ges_graph.png outputs/figures/08_baseline_ges_graph.png 66.2
32 figure 08_bdeu_ges_graph.png outputs/figures/08_bdeu_ges_graph.png 66.2
33 figure 08_bic_lambda_sensitivity.png outputs/figures/08_bic_lambda_sensitivity.png 83.6
34 figure 08_ges_vs_pc_metrics.png outputs/figures/08_ges_vs_pc_metrics.png 87.3
35 figure 08_linear_gaussian_correlation_heatmap.png outputs/figures/08_linear_gaussian_correlation... 93.8
36 figure 08_mechanism_stress_test.png outputs/figures/08_mechanism_stress_test.png 78.9
37 figure 08_nonlinear_bic_ges_graph.png outputs/figures/08_nonlinear_bic_ges_graph.png 77.1
38 figure 08_sample_size_sensitivity.png outputs/figures/08_sample_size_sensitivity.png 71.5
39 figure 08_true_dag.png outputs/figures/08_true_dag.png 67.8

This notebook now has a complete score-based discovery workflow: BIC-GES, PC comparison, search history, penalty sensitivity, sample-size sensitivity, mechanism mismatch, discrete BDeu scoring, runtime guidance, and reporting artifacts. The next tutorial can go deeper into exact search and score functions on smaller graphs.