Media Understanding
Fased can summarize inbound media (image/audio/video) before the reply pipeline runs. It auto-detects when local tools or provider keys are available, and can be disabled or customized. If understanding is off, models still receive the original files/URLs as usual. This is a Gateway media/tool pipeline. It is not the Advanced > Nodes diagnostics tab. Use Agent > Services for friendly provider setup when available, Agent > Tools for per-Agent tool access, and Advanced > Config only for rawtools.media overrides.
Goals
- Optional: pre‑digest inbound media into short text for faster routing + better command parsing.
- Preserve original media delivery to the model (always).
- Support provider APIs and CLI fallbacks.
- Allow multiple models with ordered fallback (error/size/timeout).
High‑level behavior
- Collect inbound attachments (
MediaPaths,MediaUrls,MediaTypes). - For each enabled capability (image/audio/video), select attachments per policy (default: first).
- Choose the first eligible model entry (size + capability + auth).
- If a model fails or the media is too large, fall back to the next entry.
- On success:
Bodybecomes[Image],[Audio], or[Video]block.- Audio sets
{{Transcript}}; command parsing uses caption text when present, otherwise the transcript. - Captions are preserved as
User text:inside the block.
Config overview
tools.media supports shared models plus per‑capability overrides:
tools.media.models: shared model list (usecapabilitiesto gate).tools.media.image/tools.media.audio/tools.media.video:- defaults (
prompt,maxChars,maxBytes,timeoutSeconds,language) - provider overrides (
baseUrl,headers,providerOptions) - Deepgram audio options via
tools.media.audio.providerOptions.deepgram - optional per‑capability
modelslist (preferred before shared models) attachmentspolicy (mode,maxAttachments,prefer)scope(optional gating by channel/chatType/session key)
- defaults (
tools.media.concurrency: max concurrent capability runs (default 2).
Model entries
Eachmodels[] entry can be provider or CLI:
{{MediaDir}}(directory containing the media file){{OutputDir}}(scratch dir created for this run){{OutputBase}}(scratch file base path, no extension)
Defaults and limits
Recommended defaults:maxChars: 500 for image/video (short, command‑friendly)maxChars: unset for audio (full transcript unless you set a limit)maxBytes:- image: 10MB
- audio: 20MB
- video: 50MB
- If media exceeds
maxBytes, that model is skipped and the next model is tried. - If the model returns more than
maxChars, output is trimmed. promptdefaults to simple “Describe the .” plus themaxCharsguidance (image/video only).- If
<capability>.enabled: truebut no models are configured, Fased tries the active reply model when its provider supports the capability.
Auto-detect media understanding (default)
Iftools.media.<capability>.enabled is not set to false and you haven’t
configured models, Fased auto-detects in this order and stops at the first
working option:
- Local CLIs (audio only; if installed)
sherpa-onnx-offline(requiresSHERPA_ONNX_MODEL_DIRwith encoder/decoder/joiner/tokens)whisper-cli(whisper-cpp; usesWHISPER_CPP_MODELor the bundled tiny model)whisper(Python CLI; downloads models automatically)
- Gemini CLI (
gemini) usingread_many_files - Provider keys
- Audio: OpenAI → Groq → Deepgram → Google
- Image: OpenAI → Anthropic → Google → MiniMax
- Video: Google
PATH (we expand ~), or set an explicit CLI model with a full command path.
Capabilities (optional)
If you setcapabilities, the entry only runs for those media types. For shared
lists, Fased can infer defaults:
openai,anthropic,minimax: imagegoogle(Gemini API): image + audio + videogroq: audiodeepgram: audio
capabilities explicitly to avoid surprising matches.
If you omit capabilities, the entry is eligible for the list it appears in.
Provider support matrix (Fased integrations)
| Capability | Provider integration | Notes |
|---|---|---|
| Image | OpenAI / Anthropic / Google / others via the Fased provider registry | Any image-capable model in the registry works. |
| Audio | OpenAI, Groq, Deepgram, Google, Mistral | Provider transcription (Whisper/Deepgram/Gemini/Voxtral). |
| Video | Google (Gemini API) | Provider video understanding. |
Recommended providers
Image- Prefer your active model if it supports images.
- Good defaults:
openai/gpt-5.5,anthropic/claude-opus-4.7,google/gemini-3.1-pro-preview.
openai/gpt-4o-mini-transcribe,groq/whisper-large-v3-turbo,deepgram/nova-3, ormistral/voxtral-mini-latest.- CLI fallback:
whisper-cli(whisper-cpp) orwhisper. - Deepgram is configured through
tools.media.audioas a media transcription provider, not as an Agent model provider.
google/gemini-3-flash-preview(fast),google/gemini-3.1-pro-preview(richer).- CLI fallback:
geminiCLI (supportsread_fileon video/audio).
Attachment policy
Per‑capabilityattachments controls which attachments are processed:
mode:first(default) orallmaxAttachments: cap the number processed (default 1)prefer:first,last,path,url
mode: "all", outputs are labeled [Image 1/2], [Audio 2/2], etc.
Config examples
1) Shared models list + overrides
2) Audio + Video only (image off)
3) Optional image understanding
4) Multi-modal single entry (explicit capabilities)
Status output
When media understanding runs,/status includes a short summary line:
Notes
- Understanding is best‑effort. Errors do not block replies.
- Attachments are still passed to models even when understanding is disabled.
- Use
scopeto limit where understanding runs (e.g. only DMs).