LLM Manager¶

The AI service is provider-agnostic. The LLM provider (model, base URL, API key) is chosen once at startup from the operator's environment (.env, written by the setup) and is used for every request — clients cannot override it.

For the full provider plug-in walk-through (adding a new provider, local models via Ollama/llama.cpp, OpenAI-compatible proxies), see the main AI service README.

Layers¶

Code lives under ai_service/llm_processing/managers/.

Class	Role
`LLMManager` (`base.py`)	Abstract base — defines `ainvoke` / `ainvoke_structured`.
`LangChainLLMManager` (`generic.py`)	Default implementation. Wraps `langchain.init_chat_model` and handles structured-output retries (`json_schema` then `function_calling`, 3 attempts each).
`GeminiLLMManager` (`gemini.py`)	Subclass override that adds Google's context-cache support.
`RouterLLMManager` (`router.py`)	Singleton held by the FastAPI app. Dispatches a request to the right manager based on `llm_config.provider`.

Provider config (`LLMConfig`)¶

shared/llm_config.LLMConfig is the immutable dataclass carried through LangGraph nodes via RunnableConfig.configurable["llm_config"]. It always holds the server's boot-selected provider (same value for every request):

LLMConfig(
    provider="anthropic",
    api_key="...",
    base_url=None,        # optional — for OpenAI-compatible proxies
    model_override=None,  # optional — overrides yaml model for every node
)

At startup, ai_service walks PROVIDER_BOOT_ORDER (google_genai, openai, anthropic, mistralai, ollama, llamacpp) and picks the first provider whose env-level config (matching <PROVIDER>_API_KEY, or <PROVIDER>_BASE_URL + <PROVIDER>_MODEL for locals) passes a one-token smoke-test. That config is the boot-time default. If no provider passes, the service refuses to start — operators provide keys via the setup TUI, which writes them to docker-compose.yaml + .env.

LLMConfig.from_env_ordered() exposes the same boot list to callers (used in ai_service's lifespan). LLMConfig.from_env() returns the top-priority entry. ai_server.get_llm_config() returns the boot-selected default for every request — there is no header-based override.

Model pool¶

build_chat_model caches BaseChatModel instances in a pool keyed on (provider, model, base_url, sha256(api_key)[:12], frozen_kwargs) so users stay isolated and quantized-model swaps don't reuse a stale model object.

Token tracking¶

Every LLM call records token usage via TurnMetrics (turn_metrics.py):

input_tokens — tokens in the prompt
output_tokens — tokens generated
cached_tokens — tokens served from prompt cache (Gemini only currently)
latency_ms — round-trip time
cost_label — human-readable label set by the node ("XML Generation Node", "Feedback Node", …)

Metrics are persisted per-turn in turn_metrics_record for cost analysis and rendered as an end-of-turn table in service logs.

Per-node config (yaml)¶

Per-node kwargs (model, temperature, thinking…) live in llm_processing/configs/<provider>.yaml in the provider's native syntax — no abstraction layer. See the main README for the merging rules.