Choregraph¶

Graph-based data processing powered by Kedro

Choregraph is a Python library that turns declarative XML pipeline specifications into executable Kedro data pipelines. It provides 50 built-in transform functions, geolocation enrichment, NLP text processing, and direct export to the DIVE visualization kernel.

Key Features¶

XML-driven pipelines — Define inputs, transforms, and outputs in a portable XML specification
50 transform functions — Filtering, aggregation, column/row operations, joins, normalization, discretization, and more.
Geo collection — Geocode location names to coordinates; join country boundary polygons for map visualizations.
NLP collection — Multi-label binarization with automatic language detection, lemmatization, and fuzzy matching.
Excel intelligence — LLM-assisted detection and tidying of messy multi-table spreadsheets.
DIVE integration — Export pipeline results to VisuSpec XML for the DIVE C++ visualization kernel.
Kedro Viz — Built-in pipeline visualization server with custom styling.

Architecture Overview¶

graph LR
    A["XML Spec"] --> B["Parser"]
    B --> C["ChoregraphSpec"]
    C --> D["Builder"]
    C --> F["Wrapper"]
    D --> E["Kedro Pipeline"]
    F --> G["Kedro Project Files"]
    E --> H["Runner"]
    H --> I["Outputs"]
    I --> J["DiveConnector"]
    J --> K["VisuSpec XML"]

Component	Role
Parser	Converts XML specification into in-memory `ChoregraphSpec` dataclasses
Builder	Translates `ChoregraphSpec` into a Kedro `Pipeline` with wired nodes
Wrapper	Generates a full Kedro project directory (catalog, settings, registry)
Library	Registry of 50 transform functions consumed by the builder
DiveConnector	Exports data and metadata to DIVE-compatible VisuSpec XML

Quick Example¶

from choregraph import Choregraph

cg = Choregraph()
cg.add_input(id="sales", location="data/sales.csv", format="CSV")
cg.add_node(
    id="top10",
    type="get_top_n",
    input_ports=[
        InputPortSpec(name="df", source_ref="sales"),
        InputPortSpec(name="column", value="revenue"),
        InputPortSpec(name="n", value="10"),
    ],
)
result = cg.run()
df = cg.get_dataset("top10_result")

See the Quick Start guide for a full walkthrough.