Skip to main content

April 2026

AI music generation for agents: tools and patterns

Prompt formulas that work, voice IDs that stay consistent, loudness targets that pass platform checks, and the pipeline that finishes a 30-second ad in under a minute.

The silent-ad problem

An agent that ships a video without audio ships nothing. TikTok mutes silent uploads in the feed algorithm, YouTube buries them, and Instagram caps reach. Audio is half the deliverable, but most agent demos still treat it as an afterthought. The agents that ship good ads run music, voice, and SFX in parallel with video and stitch the whole thing in one pass. Here is how to build one.

The four audio models worth wiring up

AgentFramer's generate_audio tool covers four production models. Each is good at one thing and mediocre at the others, so the agent picks based on the job:

  • Stable Audio 2. Instrumental beds and loops up to 190 seconds. Prose-style prompts. Around $0.02 per 30 seconds. Commercial-use license on outputs.
  • Suno. Full songs with vocals, structured sections, lyrical control. Slower, more expensive, but the only option if the agent needs a hook with words.
  • ElevenLabs. Voiceover and characters. Roughly $0.30 per 1,000 characters on the standard tier. Voice IDs are stable across calls, which is the whole game for multi-shot ads.
  • Cartesia. Low-latency voice. Use it when the agent is generating dialogue inside a longer pipeline and total wall time matters more than the last 5% of quality.

Prompt formula: Stable Audio 2

Stable Audio 2 is instrumental-only and reads prose. The reliable formula is tempo + key + instruments + mood + production notes + duration. Skip section markers; it ignores them. Keep prompts under about 200 tokens.

{
  "model": "stable-audio-2",
  "prompt": "120 BPM, A minor, warm analog synth bass, soft Rhodes piano, brushed snare, vinyl crackle, melancholic but hopeful, lo-fi hip-hop production, sidechain compression, mixed for streaming",
  "duration_seconds": 32,
  "sample_rate": 44100,
  "format": "wav"
}

Note the duration of 32 seconds, not 30. Generating two seconds of headroom lets the agent fade the tail without cutting a phrase. Ask for 44.1 kHz WAV; downsample to MP3 only at the final mux step. Resampling and re-encoding twice in a pipeline is how MP3 artifacts sneak into a finished spot.

What does not work in Stable Audio prompts: section markers, lyrics, artist names, song titles. The model treats them as noise and your generation comes back generic. Stick to musical vocabulary, name two or three lead instruments, and pick one mood word. Five-word prompts produce mush; 25 to 40 words is the sweet spot.

Prompt formula: Suno

Suno accepts structure. Use bracketed section tags ([Verse], [Chorus], [Bridge], [Outro]) and write actual lyrics or instrumental descriptions inside each. The model honors them. Keep total runtime under 4 minutes per call to stay inside the standard generation window.

{
  "model": "suno",
  "style": "indie pop, female lead vocal, jangly guitar, 110 BPM, summery",
  "lyrics": "[Verse]\nWoke up to the city humming low\nCoffee on the counter, nowhere to go\n[Chorus]\nWe're chasing yellow lights again\nCounting every almost like a win\n[Outro]\n(soft fade, vocal ad-libs)",
  "instrumental": false,
  "duration_seconds": 60
}

For instrumental Suno, set instrumental: true and skip the lyrics field. Suno's instrumentals tend to be more melodic and more "produced" than Stable Audio's, which makes it the right call for hooks and stings even without vocals. The tradeoff: a Suno generation runs roughly 8 to 12 times the cost of a comparable Stable Audio render. Use Stable Audio for beds and looping background music, Suno for hooks, jingles, and any track that needs to sound like a finished song rather than a loop.

Voice IDs: pick once, remember forever

The number-one mistake agents make with ElevenLabs is regenerating the voice each call. The output drifts. Pick a voice ID once per character, store it in the agent's working memory, and pass the same ID for every line that character speaks. For a multi-character ad, give the agent a cast sheet at the start of the run:

{
  "cast": {
    "narrator":  { "voice_id": "EXAVITQu4vr4xnSDxMaL", "stability": 0.45, "similarity_boost": 0.75 },
    "alex":      { "voice_id": "21m00Tcm4TlvDq8ikWAM", "stability": 0.5,  "similarity_boost": 0.8 },
    "manager":   { "voice_id": "AZnzlk1XvdvUeBnXmlld", "stability": 0.55, "similarity_boost": 0.7 }
  },
  "script": [
    { "speaker": "narrator", "line": "Most teams wait three weeks for a render." },
    { "speaker": "alex",     "line": "I shipped mine before lunch." },
    { "speaker": "manager",  "line": "How?" }
  ]
}

Stability 0.4 to 0.5 keeps emotion. Push above 0.7 and lines flatten. For ad reads, keep similarity_boost at 0.75 or higher so the timbre stays locked between takes. If a line clips emotionally where the script needs calm, drop stability to 0.35 and regenerate; do not rewrite the line.

Sound effects and seed control

Short SFX (transitions, UI dings, whooshes) generate fast and cheap through Stable Audio with 2 to 4 second durations. Prompt them concretely: "short metallic whoosh, 1.5 seconds, sharp transient, reverb tail" beats "cool sound effect" every time. For repeated stings inside one ad (a logo sting that plays at 0:05 and 0:25), pass the same seed value across both calls so the two hits are identical. Different seeds will produce two adjacent but noticeably different sounds, and viewers will hear the mismatch even if they cannot name it.

Loudness: -14 LUFS or your ad gets ducked

Spotify, YouTube, TikTok, and Apple Music all normalize to roughly -14 LUFS integrated. Submit a track at -9 LUFS and the platform turns it down by 5 dB, killing the punch the agent worked for. Submit at -20 and it stays quiet under the dialog. Pick the target by destination:

  • Streaming and social: -14 LUFS integrated, -1 dBTP true peak.
  • Podcasts: -16 LUFS mono, -19 LUFS stereo, -1 dBTP.
  • Broadcast (EBU R128): -23 LUFS, -1 dBTP.

Run the agent's final mix through ffmpeg's two-pass loudnorm filter. One pass measures, the second pass corrects. The shortcut single-pass version is good enough for 90% of ad work:

ffmpeg -i mix.wav \
  -af "loudnorm=I=-14:TP=-1:LRA=11:print_format=summary" \
  -ar 48000 -ac 2 \
  out.wav

For a final muxed video, set I=-14 for social, I=-16 for podcasts. LRA (loudness range) of 11 keeps dynamics natural; drop it to 7 if the agent is mixing for noisy mobile playback.

The parallel-first pipeline

Audio finishes before video, every time. A 30-second Stable Audio render lands in 8 to 15 seconds. ElevenLabs returns a 100-character line in under 3. A 6-second video shot takes 40 to 90 seconds. The agent that fires both at once and waits on whichever finishes last ships in roughly the time of the slowest video shot. The agent that goes sequential pays for it twice.

// Fire everything at once
const jobs = await Promise.all([
  generate_video({ prompt: shotA, duration_seconds: 6 }),
  generate_video({ prompt: shotB, duration_seconds: 6 }),
  generate_audio({ model: "stable-audio-2", prompt: bedPrompt, duration_seconds: 14 }),
  generate_audio({ model: "elevenlabs", voice_id: cast.narrator.voice_id, text: line1 }),
  generate_audio({ model: "elevenlabs", voice_id: cast.alex.voice_id,     text: line2 }),
]);

// Poll all generation IDs together
const assets = await Promise.all(jobs.map(j => get_generation(j.id)));

// Mux when the last one resolves

See tool call patterns for the full polling and retry shape. The pattern that breaks: an agent generating audio first, then using the audio length to drive video shot durations. That is sequential by definition and adds 30 to 60 seconds of wall time. Decide shot lengths up front from the script, generate to those lengths, and trim audio to match.

Cost math for a 30-second ad

Three video shots at 6 seconds each, one music bed, three voice lines averaging 80 characters. Concrete numbers:

  • 3 video shots: variable, but roughly $0.30 to $1.20 total
  • 1 Stable Audio bed (32 s): about $0.02
  • 3 ElevenLabs lines (240 chars): about $0.07
  • Final loudnorm pass: free, runs locally

Audio is under 10 cents on a typical ad. The trap is regeneration. An agent left unsupervised will regenerate a voice line eight times chasing a perfect take, turning a 7-cent job into 56 cents and burning 30 seconds of wall time. Cap the agent at three takes per voice line and one bed regeneration. Have it call get_credits before any batch above five audio jobs. See the pricing page for current per-model rates. Licensing is uniform across the four models above: commercial use of generated outputs is permitted, with the standard caveat that the agent must not prompt with the names of living artists or copyrighted song titles.

Where AI music falls down

Most agent audio works. Three categories still do not, and pretending otherwise is how teams burn a week chasing a generated track that will never ship.

  • Lyrical content with brand IP requirements. If legal needs to clear every line, generated lyrics are a non-starter. Suno will hand you a singable hook, but it will also happily echo phrases close to copyrighted material. For any spot where a brand or platform requires written, cleared lyrics, write them by hand and feed them to Suno as input. Do not let the model invent the words.
  • Signature artist sound. "Make it sound like Billie Eilish" is the request that breaks every pipeline. The models will get close enough to be uncanny and not close enough to be useful, and the legal exposure of shipping it is real. If the brief names a living artist, the agent should refuse and surface the constraint to a human. Use genre, era, and instrumentation instead.
  • Syncing music to dialogue beats. AI music is generated as a continuous bed. It does not know where the VO lands, where the punchline hits, or where the cut is. Hit-point scoring (the music swells exactly on the logo reveal) is still a human edit. The agent's job is to deliver a clean bed at the right tempo and key; the sync happens in the editor.

Build the agent so it knows its limits. Refuse artist soundalikes, accept hand-written lyrics, and hand off hit-point work to a human editor. The remaining 80% of audio jobs ship cleanly.

Ship it

Connect over MCP and the same surface gives the agent image, video, and audio with one auth. The quickstart takes about two minutes. Wire up the parallel pipeline above, run everything through loudnorm, lock voice IDs at run start, and the agent will ship audio that does not get ducked, does not drift between lines, and does not cost more than the video it scores.