Skip to main content

April 2026

Generate soundtracks for video with an AI agent

A 30-second ad, end to end: an agent reads its own storyboard, plans the score, fans out music, voice, and SFX in parallel, then mixes to streaming loudness with ffmpeg. With concrete durations, prompts, LUFS targets, and per-call costs.

The 30-second ad scenario

A DTC coffee brand wants a 30-second ad for Instagram Reels. The storyboard the agent already produced has six cuts: 0–4s product hero, 4–10s morning routine A-roll, 10–16s pour shot, 16–22s testimonial line, 22–28s second testimonial line, 28–30s logo. The brief calls for warm acoustic indie at 96 BPM, two voiceover lines totalling roughly 240 characters, an espresso pull SFX at 10s and a soft swell at 28s. The agent has 45 seconds and about fifteen cents to deliver a finished MP4 with mixed audio. Here is how it pulls that off without a human in the loop.

Step 1: read duration and cut points

The agent already has the storyboard JSON it generated earlier in the conversation, so reading duration is a memory lookup, not a tool call. If the video came from outside, a single ffprobe -v error -show_entries format=duration shell call returns the duration; cut points come from the storyboard. The agent persists this as a small structured object: total duration 30.0s, cuts at [0, 4, 10, 16, 22, 28], voice slots at [16–19] and [19–22], SFX hits at [10.0, 28.0]. That object becomes the source of truth for every later step.

Step 2: plan the score before any tool call

Planning is a pure reasoning step. The agent fixes tone (warm, optimistic), genre (acoustic indie), tempo (96 BPM), instrument palette (fingerpicked acoustic guitar, brushed kick, soft Rhodes, light shaker), and an energy curve (gentle intro, lift at 10s with the pour, pull back under voice from 16s, swell to peak at 28s). This plan turns into a single Stable Audio 2 prompt and two ElevenLabs requests. Doing the plan once means every downstream call gets a consistent musical identity instead of three contradictory improvisations.

Step 3: fan out music, voice, and SFX in parallel

All four generation calls fire concurrently. Stable Audio 2 handles the music bed; it tops out at 32 seconds in a single call, so the agent asks for 32s and trims down later (the original plan of "video length plus 4s" gets capped at the model's 32s ceiling). ElevenLabs handles voice, with the same voice_id reused for both lines so the testimonial sounds like one person, not two strangers. SFX comes from a second Stable Audio 2 call with a 2-second target.

The music request looks like this on the wire:

{
  "tool": "generate_audio",
  "input": {
    "model": "stable-audio-2",
    "prompt": "Warm acoustic indie, 96 BPM, fingerpicked acoustic guitar, brushed kick, soft Rhodes, light shaker, gentle intro lifting at 10 seconds, pulling back under voice from 16 seconds, swelling to a peak at 28 seconds, no vocals, clean tail",
    "duration_seconds": 32,
    "format": "wav",
    "sample_rate": 48000
  }
}

Stable Audio 2 prefers prose over comma-salad: full sentences with tempo and instruments named explicitly land more reliably than tag-style prompts. The two ElevenLabs calls reuse voice_id: "21m00Tcm4TlvDq8ikWAM" with stability: 0.45, similarity_boost: 0.85, and style: 0.15. Locking the voice ID in the planning object before any call goes out is the single biggest lever for consistency; if the agent re-derives it per request, two lines drift into two voices roughly 20% of the time.

The SFX call is short: "Espresso machine pull, ceramic clink, close mic, dry" at 2 seconds for the pour, and a separate call for "Soft cinematic riser, warm, 1.5 seconds, no impact" for the logo swell. Each SFX clip is tiny in cost and lets the mix breathe at exactly the right cut.

Step 4: align voice and SFX to the cut sheet

Once all four assets resolve, the agent has four files plus the video. It checks the actual durations returned by get_generation, then computes offsets: VO1 starts at 16.0s, VO2 starts at 19.2s (after a 200ms gap), SFX1 plays at 10.0s, SFX2 plays at 28.5s. If a voiceover line came back longer than its slot, the agent re-requests that single line at a higher speaking_rate rather than retiming the whole video. Music stays untouched: the agent will fade and trim it in ffmpeg, not by regenerating.

Step 5: mix to -14 LUFS with sidechain ducking

Streaming platforms target around -14 LUFS integrated; Reels and TikTok normalise into roughly that band, YouTube sits at -14, Spotify at -14 for Loud and -19 for Quiet. The agent normalises the final mix to I=-14, TP=-1.5, LRA=11, with sidechain compression on the music keyed to the voice bus so the bed ducks roughly 6 dB when either voiceover line is present. One ffmpeg invocation does the lot:

ffmpeg -y \
  -i video.mp4 \
  -i music.wav \
  -i vo1.wav \
  -i vo2.wav \
  -i sfx_pour.wav \
  -i sfx_swell.wav \
  -filter_complex "
    [1:a]atrim=0:30,afade=t=out:st=27:d=3,volume=0.85[bed];
    [2:a]adelay=16000|16000[v1];
    [3:a]adelay=19200|19200[v2];
    [4:a]adelay=10000|10000,volume=0.9[s1];
    [5:a]adelay=28500|28500,volume=0.7[s2];
    [v1][v2]amix=inputs=2:normalize=0[voice];
    [bed][voice]sidechaincompress=threshold=0.05:ratio=8:attack=20:release=250[ducked];
    [ducked][voice][s1][s2]amix=inputs=4:normalize=0,
      loudnorm=I=-14:TP=-1.5:LRA=11:print_format=summary[mix]
  " \
  -map 0:v -map "[mix]" \
  -c:v copy -c:a aac -b:a 192k -ar 48000 \
  out.mp4

The interesting parts: adelay places each clip on the timeline in milliseconds; the voice bus is summed first so a single sidechain key controls ducking; loudnorm in single-pass mode is good enough for a 30-second asset and saves the round trip of a measurement pass. For longer pieces or anything going to broadcast, run it twice and feed the measured values back in.

Length matching and graceful tails

Music models do not stop on a beat at an arbitrary second. The reliable trick is to generate longer than you need, fade the tail, and trim. For this 30-second spot the agent asks Stable Audio 2 for its full 32-second ceiling, fades the last 3 seconds with afade=t=out:st=27:d=3, and trims to 30s in the same filter. That fade also covers the soft swell SFX at 28.5s, so the two endings land together rather than fighting. For pieces longer than 32 seconds, switch the music model to Suno, which produces section-aware audio with vocals and runs well past a minute.

Cost math for one finished spot

Stable Audio 2 runs roughly $0.02 per 30-second instrumental clip, so two calls (music plus the longer SFX) come to about $0.04. ElevenLabs charges roughly $0.30 per 1,000 characters; two voiceover lines at ~120 characters each is about $0.07. SFX clips are short enough to round to zero. Total audio for a 30-second ad lands between $0.10 and $0.15, dwarfed by the video generation that preceded it. See pricing for the per-call breakdown across models.

Voice consistency tactics

Three tactics keep a campaign sounding like one brand. First, lockvoice_id in a brand config the agent reads at the top of every job; never let the model invent one. Second, fix stability and similarity_boost at the same values across lines (0.45 and 0.85 work well for warm read-style spots). Third, when latency matters, like real-time avatars or live overlays, swap ElevenLabs for Cartesia using the same character mapping, and accept slightly less expressive output in return for sub-300ms first-token latency.

Where AI scoring still loses

Three jobs are not yet a fair fight for an agent. First, hit-point music for tight cuts: a trailer with 14 sub-second beat hits needs a composer who can write a downbeat to a frame, not a model that gets close on average. Second, brand-licensed jingles: if the brief is "the McDonald's five notes plus four bars of variation", you need cleared rights to a known motif, which neither Stable Audio 2 nor Suno can produce. Third, long-form film score with leitmotifs: a 90-minute feature where the antagonist's theme returns in three different keys across reels still needs a human composer holding the whole structure in their head. Pick AI scoring for ads, social, explainer videos, internal content, and rapid prototyping; pick a human for any of the three above.

Wire it up

AgentFramer exposes generate_audio, generate_video, and get_generation on the same MCP surface, so the entire pipeline above runs from one connection in Claude Code, Cursor, or any MCP host. The agent plans once, fans out four calls, polls for completion, and shells out to ffmpeg, all without leaving the conversation. Read the tool-call patterns guide for the full worked example, including retry and polling behaviour for long-running audio jobs.