Pipeline Flow¶
This page walks through the complete execution path from specification to output.
1. Specification Loading¶
When you create a Choregraph instance with an XML spec (or build one programmatically), the Parser converts it into a ChoregraphSpec — a set of Python dataclasses:
graph LR
XML["XML string / file"] --> Parser["ChoregraphSpecParser.parse()"]
Parser --> Spec["ChoregraphSpec"]
Spec --> Inputs["List[InputSpec]"]
Spec --> Nodes["List[NodeSpec]"]
Spec --> Outputs["List[OutputSpec]"]
Each NodeSpec contains:
- Input ports — either connected to a data source (
source_ref) or carrying a static parameter (value) - Output ports — with unique IDs, labels, and visibility flags
2. Project Generation¶
The Wrapper (ManagedProjectBuilder) generates a complete Kedro project in a .viz_wrapper directory:
| Generated File | Purpose |
|---|---|
pyproject.toml |
Kedro project metadata |
settings.py |
Kedro configuration (hooks, plugins) |
catalog.yml |
Dataset definitions (CSV, Parquet, Memory) |
pipeline_registry.py |
Pipeline module with node wiring |
Files are only written when their content changes (via _write_if_changed), preventing unnecessary Kedro Viz reloads.
3. Pipeline Building¶
The Builder converts the spec into a Kedro Pipeline object:
- For each
NodeSpec, look up the function inTRANSFORM_REGISTRY - Convert static port values to Python types using the XSD catalogue (float, int, bool, list)
- Resolve
source_refconnections to Kedro dataset names - Create Kedro
node()calls with correct inputs, outputs, and function references
4. Execution¶
Choregraph.run() proceeds as follows:
- Hash check — compute a hash from the XML spec content and input file modification times
- Short-circuit — if hash matches the last run, return cached data immediately
- Generate — call the Wrapper and Builder to produce Kedro project files and pipeline
- Run — create a
KedroSessionand execute withSequentialRunner - Cache — store all output datasets in
_data_cacheand extract metadata into_metadata_cache
5. Data Access¶
After execution, datasets can be accessed through several methods:
get_dataset(data_id)— looks up by ID in spec, then tries cache, parquet files, and cataloglist_data()— returns all available dataset names, including multi-table Excel outputsget_field_uniques()— returns categorical field values from the metadata cacheget_datasets_metadata()— returns full field-level statistics
6. DIVE Export¶
The DiveConnector translates cached data and metadata into VisuSpec XML:
- Determine which datasets to include via
_get_outputs_allow_list() - Extract field metadata (types, min/max, distinct counts, unique values)
- Generate
specifications.xmlwith<rawData>and<fields>sections - Optionally merge with an existing specifications file via
update_visuspec_xml()