Media Understanding

Fased can summarize inbound media (image/audio/video) before the reply pipeline runs. It auto-detects local tools or provider keys when available, and you can disable or customize it. If understanding is off, models still receive the original files or URLs as usual. This is a Gateway media/tool pipeline. It is not the Advanced > Nodes diagnostics tab. Use Agent > Services for friendly provider setup when available, Agent > Tools for per-Agent tool access, and Advanced > Config only for raw tools.media overrides.

Goals

Optional: turn inbound media into short text for faster routing and better command parsing.
Preserve original media delivery to the model (always).
Support provider APIs and CLI fallbacks.
Allow multiple models with ordered fallback for error, size, and timeout cases.

High‑level behavior

Collect inbound attachments (MediaPaths, MediaUrls, MediaTypes).
For each enabled capability (image/audio/video), select attachments per policy. Default: first.
Choose the first eligible model entry (size + capability + auth).
If a model fails or the media is too large, fall back to the next entry.
On success:
- Body becomes [Image], [Audio], or [Video] block.
- Audio sets {{Transcript}}; command parsing uses caption text when present, otherwise the transcript.
- Captions are preserved as User text: inside the block.

If understanding fails or is disabled, the reply flow continues with the original body and attachments.

Config overview

tools.media supports shared models plus per‑capability overrides:

tools.media.models: shared model list (use capabilities to gate).
tools.media.image / tools.media.audio / tools.media.video:
- defaults (prompt, maxChars, maxBytes, timeoutSeconds, language)
- provider overrides (baseUrl, headers, providerOptions)
- Deepgram audio options via tools.media.audio.providerOptions.deepgram
- optional per‑capability models list (preferred before shared models)
- attachments policy (mode, maxAttachments, prefer)
- scope (optional gating by channel/chatType/session key)
tools.media.concurrency: max concurrent capability runs. Default: 2.

{
  tools: {
    media: {
      models: [
        /* shared list */
      ],
      image: {
        /* optional overrides */
      },
      audio: {
        /* optional overrides */
      },
      video: {
        /* optional overrides */
      },
    },
  },
}

Model entries

Each models[] entry can be provider or CLI:

{
  type: "provider", // default if omitted
  provider: "openai",
  model: "gpt-5.5",
  prompt: "Describe the image in <= 500 chars.",
  maxChars: 500,
  maxBytes: 10485760,
  timeoutSeconds: 60,
  capabilities: ["image"], // optional, used for multi‑modal entries
  profile: "vision-profile",
  preferredProfile: "vision-fallback",
}

{
  type: "cli",
  command: "gemini",
  args: [
    "-m",
    "gemini-3-flash",
    "--allowed-tools",
    "read_file",
    "Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters.",
  ],
  maxChars: 500,
  maxBytes: 52428800,
  timeoutSeconds: 120,
  capabilities: ["video", "image"],
}

CLI templates can also use:

{{MediaDir}} (directory containing the media file)
{{OutputDir}} (scratch dir created for this run)
{{OutputBase}} (scratch file base path, no extension)

Defaults and limits

Recommended defaults:

maxChars: 500 for image/video (short, command‑friendly)
maxChars: unset for audio (full transcript unless you set a limit)
maxBytes:
- image: 10MB
- audio: 20MB
- video: 50MB

Rules:

If media exceeds maxBytes, that model is skipped and the next model is tried.
If the model returns more than maxChars, output is trimmed.
prompt defaults to simple “Describe the .” plus the maxChars guidance for image/video only.
If <capability>.enabled: true but no models are configured, Fased tries the active reply model when its provider supports the capability.

Auto-detect media understanding (default)

If tools.media.<capability>.enabled is not set to false and you haven’t configured models, Fased auto-detects in this order and stops at the first working option:

Local CLIs (audio only; if installed)
- sherpa-onnx-offline (requires SHERPA_ONNX_MODEL_DIR with encoder/decoder/joiner/tokens)
- whisper-cli (whisper-cpp; uses WHISPER_CPP_MODEL or the bundled tiny model)
- whisper (Python CLI; downloads models automatically)
Gemini CLI (gemini) using read_many_files
Provider keys
- Audio: OpenAI → Groq → Deepgram → Google
- Image: OpenAI → Anthropic → Google → MiniMax
- Video: Google

To disable auto-detection, set:

{
  tools: {
    media: {
      audio: {
        enabled: false,
      },
    },
  },
}

Binary detection is best-effort across macOS/Linux/Windows. Ensure the CLI is on PATH (Fased expands ~), or set an explicit CLI model with a full command path.

Capabilities (optional)

If you set capabilities, the entry only runs for those media types. For shared lists, Fased can infer defaults:

openai, anthropic, minimax: image
google (Gemini API): image + audio + video
groq: audio
deepgram: audio

For CLI entries, set capabilities explicitly to avoid surprising matches. If you omit capabilities, the entry is eligible for the list it appears in.

Provider support (Fased integrations)

Image: OpenAI, Anthropic, Google, and others through the Fased provider registry. Any image-capable model in the registry works.
Audio: OpenAI, Groq, Deepgram, Google, and Mistral for provider transcription, including Whisper, Deepgram, Gemini, and Voxtral paths.
Video: Google Gemini API for provider video understanding.

Recommended providers

Image

Prefer your active model if it supports images.
Good defaults: openai/gpt-5.5, anthropic/claude-opus-4.7, google/gemini-3.1-pro-preview.

Audio

openai/gpt-4o-mini-transcribe, groq/whisper-large-v3-turbo, deepgram/nova-3, or mistral/voxtral-mini-latest.
CLI fallback: whisper-cli (whisper-cpp) or whisper.
Deepgram is configured through tools.media.audio as a media transcription provider, not as an Agent model provider.

Video

google/gemini-3-flash-preview (fast), google/gemini-3.1-pro-preview (richer).
CLI fallback: gemini CLI (supports read_file on video/audio).

Attachment policy

Per‑capability attachments controls which attachments are processed:

mode: first (default) or all
maxAttachments: cap the number processed (default 1)
prefer: first, last, path, url

When mode: "all", outputs are labeled [Image 1/2], [Audio 2/2], etc.

Config examples

1) Shared models list + overrides

{
  tools: {
    media: {
      models: [
        { provider: "openai", model: "gpt-5.5", capabilities: ["image"] },
        {
          provider: "google",
          model: "gemini-3-flash-preview",
          capabilities: ["image", "audio", "video"],
        },
        {
          type: "cli",
          command: "gemini",
          args: [
            "-m",
            "gemini-3-flash",
            "--allowed-tools",
            "read_file",
            "Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters.",
          ],
          capabilities: ["image", "video"],
        },
      ],
      audio: {
        attachments: { mode: "all", maxAttachments: 2 },
      },
      video: {
        maxChars: 500,
      },
    },
  },
}

2) Audio + Video only (image off)

{
  tools: {
    media: {
      audio: {
        enabled: true,
        models: [
          { provider: "openai", model: "gpt-4o-mini-transcribe" },
          {
            type: "cli",
            command: "whisper",
            args: ["--model", "base", "{{MediaPath}}"],
          },
        ],
      },
      video: {
        enabled: true,
        maxChars: 500,
        models: [
          { provider: "google", model: "gemini-3-flash-preview" },
          {
            type: "cli",
            command: "gemini",
            args: [
              "-m",
              "gemini-3-flash",
              "--allowed-tools",
              "read_file",
              "Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters.",
            ],
          },
        ],
      },
    },
  },
}

3) Optional image understanding

{
  tools: {
    media: {
      image: {
        enabled: true,
        maxBytes: 10485760,
        maxChars: 500,
        models: [
          { provider: "openai", model: "gpt-5.5" },
          { provider: "anthropic", model: "claude-opus-4.7" },
          {
            type: "cli",
            command: "gemini",
            args: [
              "-m",
              "gemini-3-flash",
              "--allowed-tools",
              "read_file",
              "Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters.",
            ],
          },
        ],
      },
    },
  },
}

{
  tools: {
    media: {
      image: {
        models: [
          {
            provider: "google",
            model: "gemini-3.1-pro-preview",
            capabilities: ["image", "video", "audio"],
          },
        ],
      },
      audio: {
        models: [
          {
            provider: "google",
            model: "gemini-3.1-pro-preview",
            capabilities: ["image", "video", "audio"],
          },
        ],
      },
      video: {
        models: [
          {
            provider: "google",
            model: "gemini-3.1-pro-preview",
            capabilities: ["image", "video", "audio"],
          },
        ],
      },
    },
  },
}

Status output

When media understanding runs, /status includes a short summary line:

📎 Media: image ok (openai/gpt-5.5) · audio skipped (maxBytes)

This shows per‑capability outcomes and the chosen provider/model when applicable.

Notes

Understanding is best‑effort. Errors do not block replies.
Attachments are still passed to models even when understanding is disabled.
Use scope to limit where understanding runs (e.g. only DMs).

Overview

Core tools

Browser setup

Workflow tools

Agent coordination

Skills

Media and devices

Media Understanding

Media Understanding

Goals

High‑level behavior

Config overview

Model entries

Defaults and limits

Auto-detect media understanding (default)

Capabilities (optional)

Provider support (Fased integrations)

Recommended providers

Attachment policy

Config examples

1) Shared models list + overrides

2) Audio + Video only (image off)

3) Optional image understanding

Status output

Notes

​Media Understanding

​Goals

​High‑level behavior

​Config overview

​Model entries

​Defaults and limits

​Auto-detect media understanding (default)

​Capabilities (optional)

​Provider support (Fased integrations)

​Recommended providers

​Attachment policy

​Config examples

​1) Shared models list + overrides

​2) Audio + Video only (image off)

​3) Optional image understanding

​4) Multi-modal single entry (explicit capabilities)

​Status output

​Notes

​Related docs

Media Understanding

Goals

High‑level behavior

Config overview

Model entries

Defaults and limits

Auto-detect media understanding (default)

Capabilities (optional)

Provider support (Fased integrations)

Recommended providers

Attachment policy

Config examples

1) Shared models list + overrides

2) Audio + Video only (image off)

3) Optional image understanding

4) Multi-modal single entry (explicit capabilities)

Status output

Notes

Related docs