Skip to content

XML Specification

Choregraph pipelines are defined using an XML format validated against TransformGraph.xsd.

Structure

A pipeline specification has three sections:

<choregraph>
    <inputs>
        <!-- Data source definitions -->
    </inputs>
    <pipeline>
        <!-- Transform node definitions -->
    </pipeline>
</choregraph>

Inputs

Each <input> declares a data source:

<inputs>
    <input id="sales"
           label="Sales Data"
           location="data/sales.csv"
           format="CSV"
           visibility="true"
           fieldSeparator=","
           header="true" />

    <input id="regions"
           label="Region Lookup"
           location="data/regions.csv"
           format="CSV" />
</inputs>
Attribute Required Description
id Yes Unique identifier (referenced by nodes)
label No Human-readable name (auto-generated from ID if omitted)
location Yes File path or URL
format Yes Data format: CSV, JSON, EXCEL
visibility No Whether to expose in visualization (true/false)
fieldSeparator No CSV column delimiter (default: auto-detect)
header No Whether the CSV has a header row
skipLines No Number of lines to skip at the start

Nodes

Each <node> defines a transform operation:

<pipeline>
    <node id="top_sales"
          label="Top 10 Sales"
          type="get_top_n">
        <inputPort name="df" source_ref="sales" />
        <inputPort name="column" value="revenue" />
        <inputPort name="n" value="10" />
        <outputPort id="101"
                    name="result"
                    label="Top Sales"
                    visibility="true" />
    </node>

    <node id="summary"
          label="Revenue Summary"
          type="aggregate_sum">
        <inputPort name="df" source_ref="101" />
        <inputPort name="group_columns" value="region" />
        <outputPort id="102"
                    name="result"
                    label="Revenue by Region"
                    visibility="true" />
    </node>
</pipeline>

Input Ports

Ports connect data or pass parameters to transform functions:

Attribute Description
name Parameter name matching the Python function signature
source_ref ID of the input or output port providing data (connected port)
value Static parameter value as a string (converted to Python type by the builder)

A port has either source_ref (connected) or value (static), not both.

Output Ports

Attribute Description
id Unique integer ID (used as source_ref by downstream nodes)
name Port name (typically result or mask)
label Human-readable label for visualization
visibility Whether to expose this output in DIVE visualization

Type Conversion

The builder converts string value attributes to Python types using the XSD schema:

XSD Port Type Python Type Example
StaticFloatPort float "3.14"3.14
StaticIntegerPort int "10"10
StaticBooleanPort bool "true"True
StaticListPort list "a,b,c"["a", "b", "c"]
StaticStringPort str "revenue""revenue"
ConnectedDataFramePort Resolved via source_ref

Programmatic Equivalent

The same pipeline can be built without XML:

from choregraph import Choregraph
from choregraph.parser import InputPortSpec, OutputPortSpec

cg = Choregraph()
cg.add_input(id="sales", location="data/sales.csv", format="CSV")
cg.add_node(
    id="top_sales",
    type="get_top_n",
    input_ports=[
        InputPortSpec(name="df", source_ref="sales"),
        InputPortSpec(name="column", value="revenue"),
        InputPortSpec(name="n", value="10"),
    ],
    output_ports=[
        OutputPortSpec(id=101, name="result", label="Top Sales", visibility=True),
    ],
)