00. Getting a Local LLM Running

This setup notebook prepares the environment for the AI for Causal Inference course.

The goal is not to build a production LLM serving stack. The goal is to give us a local pretrained language model that can support lecture examples: drafting estimand cards, critiquing DAGs, generating checklists, summarizing project briefs, and producing memo drafts.

The notebook is designed to be safe for a public portfolio:

it checks the environment;
it uses GPU automatically when available;
it does not download or load a large model unless you explicitly turn that on;
it provides deterministic fallback outputs so later notebooks can still render without a local LLM.

Learning Goals

By the end of this notebook, you should be able to:

Check whether PyTorch can see your GPU.
Understand which local model sizes are realistic for lecture notebooks.
Load a small instruction-tuned model with transformers when desired.
Use a single helper function that can call either a local LLM or a deterministic fallback.
Keep public notebooks reproducible even when the local model is unavailable.

Live Model Note

This course treats LLM behavior as an empirical object. These notebooks may include live local-model calls, so outputs can vary across model versions, hardware, decoding settings, prompt wording, package versions, and reruns. That instability is part of the lesson: AI-assisted causal inference requires validation, audit trails, and analyst judgment.

Treat model output as a draft artifact, not as causal evidence. A model may produce valid JSON with weak causal reasoning, or strong prose that fails schema validation.

When live calls are enabled, read the results as experiments about AI behavior:

Did the model invent design details?
Did it confuse prediction with causation?
Did it recommend bad controls?
Did it obey the schema?
Did it surface missing information?
Did it preserve uncertainty?

The goal is not to make every model output perfect. The goal is to learn how to build AI-assisted causal workflows that are auditable, constrained, and reviewed by a human analyst.

1. Local LLM Strategy for This Course

For this course, local LLMs serve three roles:

Pedagogical demos. Show what an LLM can and cannot do in a causal workflow.
Structured artifact generation. Draft estimand cards, variable-role tables, DAG assumptions, checklists, and memos.
Failure-mode analysis. Demonstrate hallucinations, overclaiming, bad adjustment recommendations, and brittle causal reasoning.

We do not need a giant model for every example. A smaller instruction model is enough to demonstrate workflow patterns. But because this environment has access to a high-memory GPU, we can also make model scale part of the course: compare small, medium, and large local models on the same causal reasoning tasks and ask what actually improves.

We will also branch outside one model family. Qwen gives us a clean scale ladder, but alternative models such as Phi, Mistral, Gemma, and Llama help us ask whether an observed behavior is about model size, model family, instruction tuning, or the prompt itself.

2. Practical Model Choices

A useful setup is to keep model choice as a variable.

Model	Role in this course	Notes
`Qwen/Qwen2.5-0.5B-Instruct`	Smoke test	Confirms the pipeline works quickly; weak causal reasoning
`Qwen/Qwen2.5-7B-Instruct`	Fast local default	Main starting model for everyday examples and quick iteration
`Qwen/Qwen2.5-14B-Instruct`	Strong local analysis model	Better for estimand cards, DAG critique, and memos
`Qwen/Qwen2.5-32B-Instruct`	Scale comparison model	Useful for serious local reasoning and small-vs-large comparisons
`microsoft/Phi-3.5-mini-instruct`	Alternative reasoning model	Compact non-Qwen comparison point
`mistralai/Mistral-7B-Instruct-v0.3`	Alternative open instruct model	Useful for model-family comparisons at the 7B scale
`mistralai/Mistral-Small-3.1-24B-Instruct-2503`	Strong non-Qwen comparison	Useful for system-prompt adherence, JSON behavior, and scale comparisons below 32B
`google/gemma-3-27b-it`	Large non-Qwen comparison	Requires the Gemma 3 processor path and `torchvision`; useful as a large model-family comparison
`meta-llama/Meta-Llama-3.1-8B-Instruct`	Industry-standard instruct baseline	Useful because Llama-family models are common in applied local-LLM workflows

For this course, start with Qwen/Qwen2.5-7B-Instruct as the fast default, use Qwen/Qwen2.5-14B-Instruct for stronger local drafting, and keep Qwen/Qwen2.5-32B-Instruct for scale comparisons. Notebook 02 will test these choices empirically against Phi, Mistral, Gemma, and Llama models. The important habit is not to assume that bigger automatically means better for causal work.

3. Safety Switches

The two switches below are intentionally set to False by default.

DOWNLOAD_MODEL = False means the notebook will not download model weights.
RUN_LOCAL_LLM = False means the notebook will not load the model into memory.

When you are working locally and want to test the model, change both to True and run the notebook interactively.

DOWNLOAD_MODEL = False
RUN_LOCAL_LLM = True

LOCAL_SMOKE_TEST_MODEL = 'Qwen/Qwen2.5-0.5B-Instruct'
LOCAL_FAST_MODEL = 'Qwen/Qwen2.5-7B-Instruct'
LOCAL_STRONG_MODEL = 'Qwen/Qwen2.5-14B-Instruct'
LOCAL_SCALE_MODEL = 'Qwen/Qwen2.5-32B-Instruct'

LOCAL_ALT_REASONING_MODEL = 'microsoft/Phi-3.5-mini-instruct'
LOCAL_ALT_OPEN_MODEL = 'mistralai/Mistral-7B-Instruct-v0.3'
LOCAL_MISTRAL_SMALL_MODEL = 'mistralai/Mistral-Small-3.1-24B-Instruct-2503'
LOCAL_GEMMA_MODEL = 'google/gemma-3-27b-it'
LOCAL_LLAMA_MODEL = 'meta-llama/Meta-Llama-3.1-8B-Instruct'

# Choose the model for interactive local runs.
MODEL_ID = LOCAL_FAST_MODEL

# Keep outputs short for teaching notebooks.
MAX_NEW_TOKENS = 256
TEMPERATURE = 0.2

4. Check the Environment

First we check which packages are available and whether PyTorch can see a GPU.

import importlib.util
import os
from functools import lru_cache

import torch
from IPython.display import Markdown, display


def has_package(module_name):
    return importlib.util.find_spec(module_name) is not None


package_status = {
    'torch': has_package('torch'),
    'transformers': has_package('transformers'),
    'sentence_transformers': has_package('sentence_transformers'),
    'accelerate': has_package('accelerate'),
    'bitsandbytes': has_package('bitsandbytes'),
}

package_status

{'torch': True,
 'transformers': True,
 'sentence_transformers': True,
 'accelerate': True,
 'bitsandbytes': False}

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

device_info = {
    'torch_version': torch.__version__,
    'device': DEVICE,
    'cuda_available': torch.cuda.is_available(),
    'cuda_device_count': torch.cuda.device_count(),
    'cuda_device_name': torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'none',
}

device_info

{'torch_version': '2.11.0+cu130',
 'device': 'cuda',
 'cuda_available': True,
 'cuda_device_count': 1,
 'cuda_device_name': 'NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition'}

Discussion

If cuda_available is True, the notebook will use GPU for local model inference. If it is False, the notebook still works, but local generation may be slow.

For public rendering, it is fine if this notebook shows CPU. In your local terminal, you already checked that torch.cuda.is_available() returns True, so local interactive runs should be able to use the GPU.

5. Optional Install Notes

You already installed the core packages for this course. For basic local generation, the important packages are:

uv add torch transformers sentence-transformers

For larger models, especially 14B and 32B, accelerate is strongly recommended because it enables safer low-memory loading and automatic device placement:

uv add accelerate

Gemma 3 uses the official processor-based loading path in transformers, and that path imports torchvision even for text-only prompts:

uv add torchvision

For quantized local inference, bitsandbytes can help, but it is more sensitive to CUDA and platform details:

uv add bitsandbytes

We will not require quantization for the first pass of this course. With 96 GB VRAM, 7B and 14B should be comfortable, and 24B, 27B, and 32B models should be realistic in half precision or bfloat16 if the surrounding system memory and CUDA setup cooperate.

6. Optional Model Download

The next cell downloads the model only if DOWNLOAD_MODEL = True. This keeps the notebook safe to render in environments without internet access.

if DOWNLOAD_MODEL:
    from huggingface_hub import snapshot_download

    model_path = snapshot_download(repo_id=MODEL_ID)
    print(f'Downloaded model to: {model_path}')
else:
    print('DOWNLOAD_MODEL is False. Skipping model download.')

DOWNLOAD_MODEL is False. Skipping model download.

7. Load a Local Instruction Model

The loading function below keeps model initialization lazy. Nothing is loaded unless RUN_LOCAL_LLM = True and you call local_generate.

This direct transformers approach is simple and avoids requiring a separate model server. For larger local models, a server such as Ollama, llama.cpp, or vLLM can be cleaner, but direct loading is best for a first course notebook.

from pathlib import Path
import sys


def find_project_root(start=None):
    start = Path(start or Path.cwd()).resolve()
    for candidate in [start, *start.parents]:
        if (candidate / 'pyproject.toml').exists() and (candidate / 'notebooks').exists():
            return candidate
    return start


PROJECT_ROOT = find_project_root()
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

from notebooks._shared.local_llm import (
    DEFAULT_MODELS_TO_COMPARE,
    build_chat_inputs,
    clean_generated_text,
    clear_loaded_model_cache,
    decode_generated_response,
    format_chat_prompt,
    get_device,
    has_package,
    load_local_model as _shared_load_local_model,
    local_chat as _shared_local_chat,
    move_inputs_to_model_device,
    prepare_chat_inputs,
    set_generation_seed,
)

DEVICE = get_device()


def load_local_model(model_id=MODEL_ID):
    return _shared_load_local_model(model_id)


def decode_generated_text(tokenizer, generated, prompt_token_count, model_id=MODEL_ID):
    return decode_generated_response(tokenizer, generated, prompt_token_count, model_id=model_id)


def local_chat(user_message, system_message=None, model_id=MODEL_ID, max_new_tokens=MAX_NEW_TOKENS, temperature=TEMPERATURE):
    enabled = globals().get('RUN_LIVE_LOCAL_LLM', globals().get('RUN_LOCAL_LLM', True))
    return _shared_local_chat(
        user_message,
        system_message=system_message,
        model_id=model_id,
        max_new_tokens=max_new_tokens,
        temperature=temperature,
        seed=globals().get('SEED', 123),
        enabled=enabled,
    )


def local_generate(user_message, system_message=None, max_new_tokens=MAX_NEW_TOKENS, temperature=TEMPERATURE):
    return _shared_local_chat(
        user_message,
        system_message=system_message,
        model_id=MODEL_ID,
        max_new_tokens=max_new_tokens,
        temperature=temperature,
        seed=globals().get('SEED', 123),
        enabled=globals().get('RUN_LOCAL_LLM', globals().get('RUN_LIVE_LOCAL_LLM', True)),
    )

8. Reproducible Fallback for Public Notebooks

Later notebooks should not fail just because a local model is unavailable. The helper below returns a deterministic fallback when local generation is disabled.

This is the pattern we will use throughout Course 05.

def deterministic_fallback(prompt_name):
    examples = {
        'estimand_card': """
Treatment: proactive customer-success outreach.
Outcome: churn within 90 days.
Unit: customer account.
Target population: eligible active accounts.
Main risks: confounding by account health, manager targeting, post-treatment usage controls.
""".strip(),
        'dag_critique': """
Likely confounders include pre-treatment account health, prior usage, support-ticket history, segment, tenure, and region.
Avoid adjusting for post-outreach engagement if the estimand is the total effect of outreach.
""".strip(),
    }
    return examples.get(prompt_name, 'Fallback response: local LLM disabled for reproducible rendering.')


def causal_llm(prompt, prompt_name='default', system_message=None):
    if RUN_LOCAL_LLM:
        return local_generate(prompt, system_message=system_message)
    return deterministic_fallback(prompt_name)


test_prompt = 'Draft a short estimand card for evaluating whether proactive outreach reduces churn.'
print(causal_llm(test_prompt, prompt_name='estimand_card'))

**Estimand Card: Proactive Outreach Effect on Churn Reduction**

- **Target Population:** Customers who are at risk of churning.
- **Intervention:** Proactive outreach (e.g., phone calls, emails, SMS) aimed at retaining customers.
- **Comparator:** No proactive outreach or standard retention efforts.
- **Outcome of Interest:** Churn rate over a specified time period (e.g., 3 months).
- **Time Frame:** [Specify the exact time frame, e.g., from January to March 2023].
- **Definition of Churn:** Customers who cancel their subscription or service within the specified time frame.
- **Assumptions:**
 - The intervention is applied uniformly across the target population.
 - There is no interference between individuals in the target population.
 - Attrition is negligible and does not bias the results.
- **Causal Question:** Does proactive outreach reduce the churn rate among customers at risk of churning compared to no proactive outreach?

9. Optional Live Local LLM Smoke Test

To run a real local generation test:

Set DOWNLOAD_MODEL = True and run the download cell once, or let transformers download during model loading.
Set RUN_LOCAL_LLM = True.
Run the cell below.

The first run may take time because model weights need to download and load into memory.

if RUN_LOCAL_LLM:
    response = local_generate(
        user_message=(
            'A product team asks: Did our proactive outreach reduce churn? '
            'Return three causal design questions the analyst should ask before estimating anything.'
        ),
        max_new_tokens=180,
    )
    display(Markdown(response))
else:
    print('RUN_LOCAL_LLM is False. Skipping live generation test.')

How was the proactive outreach treatment assigned to customers?
What are the key characteristics of customers who received proactive outreach compared to those who did not?
Are there any time periods or external events that could confound the relationship between proactive outreach and churn?

10. Local Embedding Model Check

For RAG notebooks, embeddings are often more important than text generation. sentence-transformers can use GPU automatically when available.

RUN_EMBEDDING_SMOKE_TEST = True

if RUN_EMBEDDING_SMOKE_TEST:
    from sentence_transformers import SentenceTransformer

    embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', device=DEVICE)
    texts = [
        'The treatment is proactive outreach.',
        'The outcome is churn within ninety days.',
        'A mediator should not be adjusted for when estimating the total effect.',
    ]
    embeddings = embedding_model.encode(texts, normalize_embeddings=True)
    print(embeddings.shape)
else:
    print('RUN_EMBEDDING_SMOKE_TEST is False. Skipping embedding model load.')

(3, 384)

11. Cache and Storage Notes

Local models can be large. A 32B model can require many tens of gigabytes of disk cache and substantial memory during loading. Hugging Face downloads are usually stored outside the project directory, commonly under ~/.cache/huggingface/. That is good because model weights should not be committed to GitHub.

For this portfolio repository:

keep notebooks and lightweight outputs in the repo;
keep downloaded model weights out of the repo;
do not commit API keys or local .env files;
use deterministic fallbacks when rendering public pages.

12. Key Takeaways

Yes, this course can run with a local pretrained LLM.
Qwen/Qwen2.5-7B-Instruct is the starting default for this course because it is strong enough for causal artifacts and fast enough for iteration.
Small instruction models are enough for teaching AI-assisted causal workflows, but 7B, 14B, 24B, 27B, and 32B models let us study how scale changes causal reasoning quality.
Gemma 3 needs a processor-based loading path and torchvision; not all local instruct models use the same loading pattern.
Use GPU when available, but keep CPU-safe fallbacks.
Do not make public notebook rendering depend on a model download or API key.
Later notebooks should call a shared helper such as causal_llm, so we can switch between local LLMs, API LLMs, and deterministic examples without rewriting the lecture logic.
Model scale will be treated as an empirical question: bigger models may write better and catch more issues, but they can still overclaim, hallucinate, or recommend invalid adjustment.
Model family will also be treated as an empirical question: Qwen, Phi, Mistral, Gemma, and Llama models may fail in different ways on the same causal prompt.

The next notebook starts the main course sequence with the AI-assisted causal workflow.