XML Specification¶

Choregraph pipelines are defined using an XML format validated against TransformGraph.xsd.

Structure¶

A pipeline specification has three sections:

<choregraph>
    <inputs>
        <!-- Data source definitions -->
    </inputs>
    <pipeline>
        <!-- Transform node definitions -->
    </pipeline>
</choregraph>

Inputs¶

Each <input> declares a data source:

<inputs>
    <input id="sales"
           label="Sales Data"
           location="data/sales.csv"
           format="CSV"
           visibility="true"
           fieldSeparator=","
           header="true" />

    <input id="regions"
           label="Region Lookup"
           location="data/regions.csv"
           format="CSV" />
</inputs>

Attribute	Required	Description
`id`	Yes	Unique identifier (referenced by nodes)
`label`	No	Human-readable name (auto-generated from ID if omitted)
`location`	Yes	File path or URL
`format`	Yes	Data format: `CSV`, `JSON`, `EXCEL`
`visibility`	No	Whether to expose in visualization (`true`/`false`)
`fieldSeparator`	No	CSV column delimiter (default: auto-detect)
`header`	No	Whether the CSV has a header row
`skipLines`	No	Number of lines to skip at the start

Nodes¶

Each <node> defines a transform operation:

<pipeline>
    <node id="top_sales"
          label="Top 10 Sales"
          type="get_top_n">
        <inputPort name="df" source_ref="sales" />
        <inputPort name="column" value="revenue" />
        <inputPort name="n" value="10" />
        <outputPort id="101"
                    name="result"
                    label="Top Sales"
                    visibility="true" />
    </node>

    <node id="summary"
          label="Revenue Summary"
          type="aggregate_sum">
        <inputPort name="df" source_ref="101" />
        <inputPort name="group_columns" value="region" />
        <outputPort id="102"
                    name="result"
                    label="Revenue by Region"
                    visibility="true" />
    </node>
</pipeline>

Input Ports¶

Ports connect data or pass parameters to transform functions:

Attribute	Description
`name`	Parameter name matching the Python function signature
`source_ref`	ID of the input or output port providing data (connected port)
`value`	Static parameter value as a string (converted to Python type by the builder)

A port has either source_ref (connected) or value (static), not both.

Output Ports¶

Attribute	Description
`id`	Unique integer ID (used as `source_ref` by downstream nodes)
`name`	Port name (typically `result` or `mask`)
`label`	Human-readable label for visualization
`visibility`	Whether to expose this output in DIVE visualization

Type Conversion¶

The builder converts string value attributes to Python types using the XSD schema:

XSD Port Type	Python Type	Example
`StaticFloatPort`	`float`	`"3.14"` → `3.14`
`StaticIntegerPort`	`int`	`"10"` → `10`
`StaticBooleanPort`	`bool`	`"true"` → `True`
`StaticListPort`	`list`	`"a,b,c"` → `["a", "b", "c"]`
`StaticStringPort`	`str`	`"revenue"` → `"revenue"`
`ConnectedDataFramePort`	—	Resolved via `source_ref`

Programmatic Equivalent¶

The same pipeline can be built without XML:

from choregraph import Choregraph
from choregraph.parser import InputPortSpec, OutputPortSpec

cg = Choregraph()
cg.add_input(id="sales", location="data/sales.csv", format="CSV")
cg.add_node(
    id="top_sales",
    type="get_top_n",
    input_ports=[
        InputPortSpec(name="df", source_ref="sales"),
        InputPortSpec(name="column", value="revenue"),
        InputPortSpec(name="n", value="10"),
    ],
    output_ports=[
        OutputPortSpec(id=101, name="result", label="Top Sales", visibility=True),
    ],
)