Pipeline Flow¶

This page walks through the complete execution path from specification to output.

1. Specification Loading¶

When you create a Choregraph instance with an XML spec (or build one programmatically), the Parser converts it into a ChoregraphSpec — a set of Python dataclasses:

graph LR
    XML["XML string / file"] --> Parser["ChoregraphSpecParser.parse()"]
    Parser --> Spec["ChoregraphSpec"]
    Spec --> Inputs["List[InputSpec]"]
    Spec --> Nodes["List[NodeSpec]"]
    Spec --> Outputs["List[OutputSpec]"]

Each NodeSpec contains:

Input ports — either connected to a data source (source_ref) or carrying a static parameter (value)
Output ports — with unique IDs, labels, and visibility flags

2. Project Generation¶

The Wrapper (ManagedProjectBuilder) generates a complete Kedro project in a .viz_wrapper directory:

Generated File	Purpose
`pyproject.toml`	Kedro project metadata
`settings.py`	Kedro configuration (hooks, plugins)
`catalog.yml`	Dataset definitions (CSV, Parquet, Memory)
`pipeline_registry.py`	Pipeline module with node wiring

Files are only written when their content changes (via _write_if_changed), preventing unnecessary Kedro Viz reloads.

3. Pipeline Building¶

The Builder converts the spec into a Kedro Pipeline object:

For each NodeSpec, look up the function in TRANSFORM_REGISTRY
Convert static port values to Python types using the XSD catalogue (float, int, bool, list)
Resolve source_ref connections to Kedro dataset names
Create Kedro node() calls with correct inputs, outputs, and function references

4. Execution¶

Choregraph.run() proceeds as follows:

Hash check — compute a hash from the XML spec content and input file modification times
Short-circuit — if hash matches the last run, return cached data immediately
Generate — call the Wrapper and Builder to produce Kedro project files and pipeline
Run — create a KedroSession and execute with SequentialRunner
Cache — store all output datasets in _data_cache and extract metadata into _metadata_cache

5. Data Access¶

After execution, datasets can be accessed through several methods:

get_dataset(data_id) — looks up by ID in spec, then tries cache, parquet files, and catalog
list_data() — returns all available dataset names, including multi-table Excel outputs
get_field_uniques() — returns categorical field values from the metadata cache
get_datasets_metadata() — returns full field-level statistics

6. DIVE Export¶

The DiveConnector translates cached data and metadata into VisuSpec XML:

Determine which datasets to include via _get_outputs_allow_list()
Extract field metadata (types, min/max, distinct counts, unique values)
Generate specifications.xml with <rawData> and <fields> sections
Optionally merge with an existing specifications file via update_visuspec_xml()