Metadata¶
Extracts field-level statistics from pandas DataFrames — data types, min/max values, distinct
counts, and unique categorical values. Used by DiveConnector to populate the fields section
of VisuSpec XML. Optimized for large datasets with configurable thresholds.
metadata
¶
Metadata extraction and caching for Choregraph datasets. Provides MetadataExtractor for DataFrame analysis and Metadata for persistence.
FieldMetadata
dataclass
¶
FieldMetadata(id, name, data_type, min_value=None, max_value=None, is_unique=False, units='UNITLESS', distinct_count=-1, uniques='', info=None)
Metadata for a single DataFrame column.
| ATTRIBUTE | DESCRIPTION |
|---|---|
id |
Sequential field identifier (string).
TYPE:
|
name |
Column name from the DataFrame.
TYPE:
|
data_type |
One of INTEGER, FLOAT, DATETIME, STRING, BOOLEAN, OBJECT.
TYPE:
|
min_value |
Minimum value (numeric/datetime columns only).
TYPE:
|
max_value |
Maximum value (numeric/datetime columns only).
TYPE:
|
is_unique |
Whether all values in the column are unique.
TYPE:
|
units |
Unit label (default
TYPE:
|
distinct_count |
Number of distinct values (-1 if unknown).
TYPE:
|
uniques |
Comma-separated string of unique values (categorical fields).
TYPE:
|
DatasetStats
dataclass
¶
Complete stats for a single dataset.
MetadataResult
¶
Bases: UserDict
Wrapper around the dict of DatasetStats to allow formatting methods. Behaves exactly like a Dict[str, DatasetStats], but adds .format().
format
¶
Format the metadata collection into a string representation.
| PARAMETER | DESCRIPTION |
|---|---|
format_type
|
"markdown", "json"
TYPE:
|
user_message
|
Filter fields based on user query context.
TYPE:
|
detailed
|
Include all stats columns (min/max/uniques).
TYPE:
|
Source code in src/choregraph/metadata.py
to_api_format
¶
Convert to list-of-dicts format for the viz API (metadata.json).
| RETURNS | DESCRIPTION |
|---|---|
list
|
List of dataset metadata dicts with keys: data_id, name, rows, fields. |
Source code in src/choregraph/metadata.py
from_datasets
classmethod
¶
Build a MetadataResult from a list of dataset dicts.
Uses the same DatasetStats.from_dict() / FieldMetadata.from_dict()
deserialization as Metadata.read_from_cache(), so both the
workspace-based web flow and the stateless API produce identical objects.
| PARAMETER | DESCRIPTION |
|---|---|
datasets
|
List of dicts with keys matching
TYPE:
|
Source code in src/choregraph/metadata.py
MetadataExtractor
¶
Analyzes a pandas DataFrame to extract metadata. Optimized for performance on large datasets.
extract
classmethod
¶
Extract field-level metadata from a DataFrame.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
Input DataFrame to analyze.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[FieldMetadata]
|
List of :class: |
Source code in src/choregraph/metadata.py
310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 | |
Metadata
¶
Centralized manager for dataset metadata. Reads directly from catalogue_stats.json without in-memory caching.
Source code in src/choregraph/metadata.py
update_stats
¶
Calculate and store stats for a dataset.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
Dataset name (Kedro catalog key)
TYPE:
|
df
|
The data to analyze (DataFrame, dict, or list)
|
dataset_id
|
Optional spec ID (input ID or output port ID)
TYPE:
|
dataset_type
|
"input" or "output"
TYPE:
|
Source code in src/choregraph/metadata.py
store_stats
¶
Store pre-extracted stats for a dataset directly to JSON file.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
Dataset name (Kedro catalog key)
TYPE:
|
fields
|
Pre-extracted field metadata list
TYPE:
|
row_count
|
Number of rows in the dataset
TYPE:
|
dataset_id
|
Optional XML ID of the dataset
TYPE:
|
dataset_type
|
"input" or "output"
TYPE:
|
dataset_info
|
Optional structural description (e.g. JSON cartography)
stored under the
TYPE:
|
Source code in src/choregraph/metadata.py
write_raw_cache
¶
Write a raw JSON string directly to catalogue_stats.json.
Used by the API flow: the Toolkit sends the pre-built catalogue_stats and the server writes it as-is.
Source code in src/choregraph/metadata.py
read_from_cache
¶
Load stats directly from catalogue_stats.json.
| PARAMETER | DESCRIPTION |
|---|---|
dataset_ids
|
If provided, only retrieves metadata for these specific dataset IDs. Accepts a single string or a list of strings. If None, retrieves all datasets.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
MetadataResult
|
MetadataResult (smart dict of dataset name -> DatasetStats) |
Source code in src/choregraph/metadata.py
clear
¶
Clear the JSON file on disk.
Source code in src/choregraph/metadata.py
get
¶
__contains__
¶
__len__
¶
remove_datasets
¶
Remove datasets from catalogue_stats.json by name.
| PARAMETER | DESCRIPTION |
|---|---|
names
|
Dataset names (filename stems) to remove.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
int
|
Number of datasets actually removed. |
Source code in src/choregraph/metadata.py
add_partition_field
¶
Add virtual __partition__ field to a partitioned dataset's metadata.
The field doesn't exist in the actual data files — it represents the index of each partition (file) in the dataset.
| PARAMETER | DESCRIPTION |
|---|---|
dataset_name
|
Name of the dataset in the catalogue.
TYPE:
|
n_partitions
|
Number of partitions.
TYPE:
|
partition_label
|
Semantic label (e.g. "time", "sheet", "slice").
TYPE:
|
Source code in src/choregraph/metadata.py
merge_datasets
¶
Merge pre-computed dataset entries into catalogue_stats.json.
Each entry should follow the catalogue_stats schema::
{
"row_count": int,
"fields": [{"id", "name", "data_type", ...}],
"type": "input",
...
}
| PARAMETER | DESCRIPTION |
|---|---|
entries
|
Dict of dataset_name -> stats dict.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
int
|
Number of datasets merged. |
Source code in src/choregraph/metadata.py
compute_file_stats
¶
Compute metadata for one or more data files.
Accepts a single path or a list of paths. When multiple paths are given (e.g. all CSVs in a temporal group), tabular files are aggregated so that min/max/distinct reflect the full range across the group. Non-tabular formats only use the first path.
Supports CSV, TSV, Parquet, JSON, images (PNG/JPG/TIFF/BMP/WEBP/GIF),
and MHD volumes. Returns a stats dict in the same format as
catalogue_stats.json dataset entries, or None if the file type
is unsupported.
This is a standalone function — no Kedro, no workspace, no DB needed.
| PARAMETER | DESCRIPTION |
|---|---|
file_paths
|
Absolute path to a file, or a list of paths to aggregate over.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Optional[dict]
|
Dict with |
Optional[dict]
|
keys, or |
Source code in src/choregraph/metadata.py
1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 | |