Loaders¶
CSV sniffing and catalog load argument preparation utilities. Detects separators and headers
for catalog.yml generation, ensuring input files are parsed correctly by Kedro datasets.
loaders
¶
Data loading utilities for CSV sniffing, characterization, and catalog configuration.
Provides CSV dialect detection (separator, header), full heuristic CSV
characterization (skip-line detection, field separator, record separator),
and prepares load_args dictionaries for Kedro catalog.yml generation.
When heuristic detection fails or produces an unusual delimiter, an optional LLM fallback (Google Gemini) is attempted before returning hard-coded defaults.
LLMCsvCharacterization
¶
Bases: BaseModel
Structured output returned by the LLM for CSV characterization.
set_csv_llm_delegate
¶
Register (or clear with None) the CSV LLM characterization delegate.
The delegate takes the file's first sample lines and returns the same dict
shape as :func:_llm_characterize_csv (or None on failure).
Source code in src/choregraph/loaders.py
split_csv_line
¶
Parse a CSV line with the given delimiter, considering quotes.
detect_skip_lines_smart
¶
Detect start of data by parsing full CSV logic (handling multiline quotes).
Tries each delimiter in delimiter_list and returns the index of the first line that begins a consistent run of rows with similar column counts.
| PARAMETER | DESCRIPTION |
|---|---|
lines
|
Lines read from the CSV file (first ~50 lines).
TYPE:
|
delimiter_list
|
Delimiters to try (e.g.
TYPE:
|
tolerance
|
Allowed difference in column count between consecutive rows.
TYPE:
|
min_ok_lines
|
Minimum consecutive consistent rows to confirm data start.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
int
|
Number of lines to skip (0 if data starts at the beginning). |
Source code in src/choregraph/loaders.py
remove_skiplines
¶
Delete the first skip_lines lines from a file using buffered I/O.
| PARAMETER | DESCRIPTION |
|---|---|
filepath
|
Path to the file to modify.
TYPE:
|
skip_lines
|
Number of lines to remove from the top.
TYPE:
|
buffer_size
|
Read buffer size in bytes.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
|
Source code in src/choregraph/loaders.py
characterize_csv
¶
Full heuristic CSV characterization with optional LLM fallback.
Detects field separator, record separator, header presence, and
non-data preamble lines. When the csv.Sniffer heuristic fails,
detects a separator outside the usual set [; , \t |], or
produces inconsistent column counts across rows, the function
attempts to characterize the file via an LLM.
If the LLM is unavailable or also fails, safe fallback values are
returned.
The LLM step always runs server-side: a service registers a delegate via
:func:set_csv_llm_delegate (file_service registers one that calls the
ai_service /ai/characterize_csv endpoint, the same pattern as Excel
tidying via /ai/preprocess_excel), keeping provider credentials confined
to the ai_service. When no delegate is registered (e.g. metadata stat
reading, or a standalone call), the LLM step is simply skipped and the
heuristic / safe-default values are returned.
| PARAMETER | DESCRIPTION |
|---|---|
filepath
|
Path to the CSV file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dict with keys |
dict
|
|
Source code in src/choregraph/loaders.py
350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 | |
sniff_csv_options
¶
Detect CSV separator and header row from a file sample.
Reads the first 2048 bytes of the file and uses :mod:csv.Sniffer to
detect the dialect.
| PARAMETER | DESCRIPTION |
|---|---|
filepath
|
Path to the CSV file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Dict with detected |
Dict[str, Any]
|
detection fails). |
Source code in src/choregraph/loaders.py
prepare_load_args
¶
Prepare load_args for Kedro catalog.yml generation.
Merges user-provided options with auto-detected CSV settings.
| PARAMETER | DESCRIPTION |
|---|---|
fmt
|
Data format string (e.g.
TYPE:
|
location
|
File path for auto-detection.
TYPE:
|
options
|
User-provided load options (take precedence over sniffed values).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Dict of load arguments suitable for |