Core Transforms¶
The core transform function library — 50 DataFrame operations registered in TRANSFORM_REGISTRY.
These functions are used by the Builder when constructing Kedro pipeline nodes
from the XML specification.
All functions follow a consistent pattern: accept a DataFrame (and parameters), return a
DataFrame (or scalar). Functions with return_mask=True support return both a filtered result
and a boolean mask.
library
¶
Transform function library -- the extensible registry of data operations.
Defines 50+ DataFrame transform functions organized by category (filtering,
aggregation, column/row operations, calculations, multi-input joins, advanced
transformations, JSON extraction). All functions are registered in
:data:TRANSFORM_REGISTRY, which the builder uses to look up implementations
when constructing Kedro pipeline nodes from an XML specification.
JsonTooDeepError
¶
Bases: ValueError
Raised by :func:cartograph_json when input nests deeper than
:data:MAX_JSON_DEPTH.
calculate_min
¶
Calculate the minimum value from a DataFrame column or a list.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame (mutually exclusive with
TYPE:
|
column
|
Column name to compute the minimum of.
TYPE:
|
input_list
|
Plain Python list to compute the minimum of.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
float
|
The minimum value as a scalar. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If neither |
Source code in src/choregraph/library.py
calculate_max
¶
Calculate the maximum value from a DataFrame column or a list.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame (mutually exclusive with
TYPE:
|
column
|
Column name to compute the maximum of.
TYPE:
|
input_list
|
Plain Python list to compute the maximum of.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
float
|
The maximum value as a scalar. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If neither |
Source code in src/choregraph/library.py
filter_less_than
¶
Filter rows where column < value.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame.
TYPE:
|
column
|
Column name to compare.
TYPE:
|
value
|
Threshold value.
TYPE:
|
return_mask
|
If True, return a dict with both the filtered DataFrame and a boolean mask.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Union[DataFrame, Dict[str, Any]]
|
Filtered DataFrame, or |
Union[DataFrame, Dict[str, Any]]
|
when return_mask is True. |
Source code in src/choregraph/library.py
filter_greater_than
¶
Filter rows where column > value.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame.
TYPE:
|
column
|
Column name to compare.
TYPE:
|
value
|
Threshold value.
TYPE:
|
return_mask
|
If True, return a dict with both the filtered DataFrame and a boolean mask.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Union[DataFrame, Dict[str, Any]]
|
Filtered DataFrame, or |
Union[DataFrame, Dict[str, Any]]
|
when return_mask is True. |
Source code in src/choregraph/library.py
filter_in_range
¶
Filter rows where min_value <= column <= max_value.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame.
TYPE:
|
column
|
Column name to compare.
TYPE:
|
min_value
|
Lower bound of the range (inclusive).
TYPE:
|
max_value
|
Upper bound of the range (inclusive).
TYPE:
|
return_mask
|
If True, return a dict with both the filtered DataFrame and a boolean mask.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Union[DataFrame, Dict[str, Any]]
|
Filtered DataFrame, or |
Union[DataFrame, Dict[str, Any]]
|
when return_mask is True. |
Source code in src/choregraph/library.py
filter_equal
¶
Filter rows where column == value.
Works with both numeric and string columns. Numeric conversion is attempted automatically when the column dtype is numeric.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame.
TYPE:
|
column
|
Column name to compare.
TYPE:
|
value
|
Value to match (string; auto-converted for numeric columns).
TYPE:
|
return_mask
|
If True, return a dict with both the filtered DataFrame and a boolean mask.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Union[DataFrame, Dict[str, Any]]
|
Filtered DataFrame, or |
Union[DataFrame, Dict[str, Any]]
|
when return_mask is True. |
Source code in src/choregraph/library.py
filter_not_equal
¶
Filter rows where column != value.
Works with both numeric and string columns. Numeric conversion is attempted automatically when the column dtype is numeric.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame.
TYPE:
|
column
|
Column name to compare.
TYPE:
|
value
|
Value to exclude (string; auto-converted for numeric columns).
TYPE:
|
return_mask
|
If True, return a dict with both the filtered DataFrame and a boolean mask.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Union[DataFrame, Dict[str, Any]]
|
Filtered DataFrame, or |
Union[DataFrame, Dict[str, Any]]
|
when return_mask is True. |
Source code in src/choregraph/library.py
get_top_n
¶
Return the top n rows by column value (descending).
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame.
TYPE:
|
column
|
Column name to rank by.
TYPE:
|
n
|
Number of rows to keep.
TYPE:
|
return_mask
|
If True, return a dict with the result and a boolean mask.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Union[DataFrame, Dict[str, Any]]
|
DataFrame with the top n rows, or |
Source code in src/choregraph/library.py
get_top_percentage
¶
Return the top fraction of rows by column value (descending).
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame.
TYPE:
|
column
|
Column name to rank by.
TYPE:
|
fraction
|
Fraction of rows to keep (0.0–1.0).
TYPE:
|
return_mask
|
If True, return a dict with the result and a boolean mask.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Union[DataFrame, Dict[str, Any]]
|
DataFrame with the top rows, or |
Source code in src/choregraph/library.py
get_bottom_n
¶
Return the bottom n rows by column value (ascending).
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame.
TYPE:
|
column
|
Column name to rank by.
TYPE:
|
n
|
Number of rows to keep.
TYPE:
|
return_mask
|
If True, return a dict with the result and a boolean mask.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Union[DataFrame, Dict[str, Any]]
|
DataFrame with the bottom n rows, or |
Source code in src/choregraph/library.py
get_bottom_percentage
¶
Return the bottom fraction of rows by column value (ascending).
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame.
TYPE:
|
column
|
Column name to rank by.
TYPE:
|
fraction
|
Fraction of rows to keep (0.0–1.0).
TYPE:
|
return_mask
|
If True, return a dict with the result and a boolean mask.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Union[DataFrame, Dict[str, Any]]
|
DataFrame with the bottom rows, or |
Source code in src/choregraph/library.py
aggregate_mean
¶
Calculates the mean of all numeric columns, optionally grouped.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame
TYPE:
|
group_columns
|
Optional column(s) to group by
TYPE:
|
suffix
|
Optional suffix to add to the aggregated column names
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Aggregated DataFrame with mean values per group (or a single-row |
DataFrame
|
DataFrame if ungrouped). |
Source code in src/choregraph/library.py
aggregate_count
¶
Returns the number of rows, optionally grouped. Only returns the grouping columns and a 'count' column.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame
TYPE:
|
group_columns
|
Optional column(s) to group by
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame with grouping columns and a |
DataFrame
|
DataFrame with the total row count if ungrouped). |
Source code in src/choregraph/library.py
aggregate_sum
¶
Calculates the sum of all numeric columns, optionally grouped.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame
TYPE:
|
group_columns
|
Optional column(s) to group by
TYPE:
|
suffix
|
Optional suffix to add to the aggregated column names
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Aggregated DataFrame with summed values per group (or a single-row |
DataFrame
|
DataFrame if ungrouped). |
Source code in src/choregraph/library.py
aggregate_median
¶
Calculates the median of all numeric columns, optionally grouped.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame
TYPE:
|
group_columns
|
Optional column(s) to group by
TYPE:
|
suffix
|
Optional suffix to add to the aggregated column names
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Aggregated DataFrame with median values per group (or a single-row |
DataFrame
|
DataFrame if ungrouped). |
Source code in src/choregraph/library.py
hierarchical_rollup
¶
Transform tabular data into hierarchical parent-child-value long format.
Takes N hierarchical columns (broadest to most specific) and produces a DataFrame with path-based ids, parent references, and aggregated values. Supports arbitrary hierarchy depth.
A synthetic root node (root_label) is always prepended so that the
output has a single root — required by Plotly Treemap / Sunburst.
All numeric columns (except path_columns) are automatically summed at each
hierarchy level and preserved in the output alongside a count column.
This allows downstream channels (e.g. color) to reference any aggregated
numeric variable.
The output serves both Partition (Treemap/Sunburst) and Flow (Sankey) marks: - Partition reads: ids=id, labels=last_part(id), parents=parent, values=value - Flow reads: source=parent, target=id, value=value (skip root rows)
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame with hierarchical columns.
TYPE:
|
path_columns
|
Ordered list of column names defining hierarchy levels (broadest to most specific). e.g. ["continent", "country", "city"]. Also accepts a comma-separated string.
TYPE:
|
value_column
|
Column to aggregate as the primary
TYPE:
|
root_label
|
Label for the synthetic root node (default
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame with columns: target, source, value, count, and one column |
DataFrame
|
per extra numeric field (summed). |
DataFrame
|
based identifier; |
DataFrame
|
the synthetic root). The (source, target, value) triple is the shape |
DataFrame
|
that sankey/chord flow marks consume directly. |
Source code in src/choregraph/library.py
483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 | |
add_label
¶
Add a new column with a constant value. Args: df: Input DataFrame label: Name of the new column to add value: Value to fill in the new column (can be any scalar or object) Returns: DataFrame with the new column added.
Source code in src/choregraph/library.py
select_columns
¶
Extract/select only the specified columns from the DataFrame.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame
TYPE:
|
columns
|
Column name(s) to keep. Can be a single string or a list.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame with only the specified columns. |
Source code in src/choregraph/library.py
drop_columns
¶
Remove the specified columns from the DataFrame.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame
TYPE:
|
columns
|
Column name(s) to drop. Can be a single string or a list.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame without the specified columns. |
Source code in src/choregraph/library.py
rename_column
¶
Rename a column in the DataFrame.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame
TYPE:
|
old_name
|
Current column name
TYPE:
|
new_name
|
New column name
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame with the column renamed. |
Source code in src/choregraph/library.py
count_rows
¶
Return the total number of rows in the DataFrame.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
int
|
Row count as an integer scalar. |
slice_rows
¶
Keep only a specific range of rows by positional index.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame.
TYPE:
|
start
|
Start index (inclusive). None means from the beginning.
TYPE:
|
stop
|
Stop index (exclusive). None means to the end.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Sliced DataFrame. |
Source code in src/choregraph/library.py
sort_values
¶
Sort the DataFrame by one or more columns.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame.
TYPE:
|
columns
|
Column name(s) to sort by.
TYPE:
|
ascending
|
Sort order. True for ascending, False for descending.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Sorted DataFrame. |
Source code in src/choregraph/library.py
sample_rows
¶
Take a random sample of rows from the DataFrame.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame.
TYPE:
|
n
|
Exact number of rows to sample (mutually exclusive with fraction).
TYPE:
|
fraction
|
Fraction of rows to sample (0.0–1.0).
TYPE:
|
seed
|
Random seed for reproducibility.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Sampled DataFrame. |
Source code in src/choregraph/library.py
calc_distance
¶
Calculate Euclidean distance from a reference point.
Adds a new column with the distance from (ref_x, ref_y) to each row's
(x_col, y_col) values.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame.
TYPE:
|
x_col
|
Column containing X coordinates.
TYPE:
|
y_col
|
Column containing Y coordinates.
TYPE:
|
ref_x
|
Reference X coordinate.
TYPE:
|
ref_y
|
Reference Y coordinate.
TYPE:
|
target_col
|
Name of the new distance column.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame with the distance column added. |
Source code in src/choregraph/library.py
calc_ratio
¶
Calculates the ratio between two columns in the same DataFrame. Creates a new column named 'ratio' containing numerator_col / denominator_col.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame
TYPE:
|
numerator_col
|
Column name for the numerator
TYPE:
|
denominator_col
|
Column name for the denominator
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame with a new 'ratio' column added. |
Source code in src/choregraph/library.py
join
¶
Join multiple DataFrames on a common key.
Collects inputs from dfs (list or single DataFrame) and any DataFrames
passed as keyword arguments (named ports from the pipeline). When column
name conflicts occur, columns are suffixed with the source name (the
kwargs key from the pipeline) instead of generic _left / _right.
| PARAMETER | DESCRIPTION |
|---|---|
dfs
|
Primary DataFrame(s) to join.
TYPE:
|
on
|
Column name(s) to join on.
TYPE:
|
how
|
Join type —
TYPE:
|
**kwargs
|
Additional DataFrames passed by name.
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Merged DataFrame. |
Source code in src/choregraph/library.py
790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 | |
union
¶
Vertically stack (union) multiple DataFrames.
Collects inputs from dfs (list or single DataFrame) and any DataFrames passed as keyword arguments.
| PARAMETER | DESCRIPTION |
|---|---|
dfs
|
Primary DataFrame(s) to concatenate.
TYPE:
|
ignore_index
|
If True, reset the index in the result.
TYPE:
|
**kwargs
|
Additional DataFrames passed by name.
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Concatenated DataFrame. |
Source code in src/choregraph/library.py
melt
¶
Unpivot a wide DataFrame into long format.
Converts columns into rows, turning a wide table (one column per metric)
into a long table with a variable column and a value column.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame in wide format.
TYPE:
|
id_columns
|
Column(s) to keep as identifiers (not melted). Accepts a single string or a list. If None, all non-value columns are used.
TYPE:
|
value_columns
|
Column(s) to unpivot. Accepts a single string or a list. If None, all columns not in id_columns are melted.
TYPE:
|
var_name
|
Name for the new column holding the former column headers.
TYPE:
|
value_name
|
Name for the new column holding the values.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Long-format DataFrame. |
Examples:
Wide input::
| date | price_cape | price_panama |
| 2024-01 | 100 | 200 |
melt(df, id_columns="date", var_name="source", value_name="price")::
| date | source | price |
| 2024-01 | price_cape | 100 |
| 2024-01 | price_panama | 200 |
Source code in src/choregraph/library.py
arithmetic_op
¶
arithmetic_op(df, left_column, right_column=None, constant=None, operator='ADD', output_column='result')
Apply an arithmetic operation between a column and another column or constant.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame.
TYPE:
|
left_column
|
Column name for the left operand.
TYPE:
|
right_column
|
Column name for the right operand (mutually exclusive with constant).
TYPE:
|
constant
|
Scalar value for the right operand.
TYPE:
|
operator
|
One of
TYPE:
|
output_column
|
Name of the result column.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame with the computed column added. |
Source code in src/choregraph/library.py
normalize_column
¶
Normalize a numeric column using min-max scaling or z-score standardization.
'minmax':(x - min) / (max - min)'zscore':(x - mean) / std
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame.
TYPE:
|
column
|
Column to normalize.
TYPE:
|
method
|
Normalization method —
TYPE:
|
output_column
|
Name of the result column (defaults to
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame with the normalized column added. |
Source code in src/choregraph/library.py
discretize
¶
Discretize a continuous column into bins.
'uniform': Equal-width bins.'quantile': Equal-frequency bins.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame.
TYPE:
|
column
|
Column to discretize.
TYPE:
|
bins
|
Number of bins.
TYPE:
|
strategy
|
Binning strategy —
TYPE:
|
output_column
|
Name of the result column (defaults to
TYPE:
|
labels
|
Optional list of label names for the bins (e.g.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
DataFrame with the binned column added. |
Source code in src/choregraph/library.py
flatten_json
¶
Convert arbitrary JSON structures into a flat DataFrame.
Auto-detects common JSON-to-table patterns and applies the best flattening strategy:
- Array of objects
[{col: val, ...}, ...]→pd.DataFrame(data)directly. - Dict of paired arrays
{key: [[x, y], ...], ...}(e.g. CoinGecko market data) → join arrays on shared first column, one column per key. - Dict of simple arrays
{key: [v1, v2, ...], ...}→ one column per key (all same length). - Keyed array of objects (when root_key is provided)
{root_key: [{...}, ...]}→ flattens the inner list. - Nested / complex →
pd.json_normalize()as fallback.
| PARAMETER | DESCRIPTION |
|---|---|
data
|
Loaded JSON data (dict or list).
TYPE:
|
root_key
|
Optional top-level key to drill into before flattening.
TYPE:
|
columns
|
Optional comma-separated column names to assign to the resulting DataFrame (useful for unnamed arrays).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
A flat :class: |
Examples:
>>> flatten_json({"prices": [[1, 100], [2, 200]],
... "volumes": [[1, 50], [2, 60]]})
timestamp prices volumes
0 1 100 50
1 2 200 60
Source code in src/choregraph/library.py
1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 | |
remove_required_keys
¶
Parcourt récursivement le dictionnaire pour supprimer les clés 'required'.
Source code in src/choregraph/library.py
cartograph_json
¶
Produce a structural cartography of a JSON document for the LLM.
Uses genson <https://github.com/wolverdude/genson>_ to infer a JSON
Schema from the loaded data (one "skeleton" merging every record), then
renders it as a compact ASCII hierarchy that the planning LLM embeds via
:attr:DatasetStats.info["extract_with"] (rendered by
:meth:MetadataResult._to_markdown).
| PARAMETER | DESCRIPTION |
|---|---|
data
|
Loaded JSON value (dict, list, or primitive).
TYPE:
|
max_chars
|
Upper bound on the rendered tree length. Truncated with
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
|
Source code in src/choregraph/library.py
execute_code
¶
Execute Python code with one or more DataFrame inputs.
All input DataFrames are available in the code by their port name.
The code must assign its result to a variable named result.
All scientific Python libraries installed in the environment are available.
System, IO, and network modules are blocked.
Source code in src/choregraph/library.py
concat_partitions
¶
Concatenate a PartitionedDataset into a single DataFrame.
Loads every partition in sorted key order, tags each row with a
__partition__ column (float index: 0.0, 1.0, …), and concatenates
them into one DataFrame. Use this before applying transforms that need
global context (consistent bin edges, global aggregates, etc.).
Pair with :func:split_partitions to restore the partitioned structure
after the transform.
| PARAMETER | DESCRIPTION |
|---|---|
partitioned
|
Kedro PartitionedDataset dict
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
Single DataFrame with an added |