April 2026
How AI agents generate video, end to end
A 60-second clip is roughly 25 tool calls, 10 minutes of wall time, and around $4 of compute. Here is the exact playbook an agent uses to plan, generate, poll, and stitch video through MCP, the cost math per model, and the cases where this whole approach falls over.
Why video is not just a longer image call
An image is one tool call returning one URL in 4 to 12 seconds. A video that lasts longer than 8 seconds is never one call. Sora 2 caps a single generation at around 10 seconds. Veo 3 caps at 8. Kling 2 and Runway Gen-4 cap at 10. So a 60-second hero ad becomes 6 to 10 separate clips, each generated independently, each with its own failure rate, each with a job ID the agent has to poll for 60 to 180 seconds before the mp4 URL appears.
On top of that, the audio track is its own pipeline (music from Stable Audio 2 or Suno, voiceover from ElevenLabs or Cartesia), and the final mp4 needs to be cut, joined, and mixed. The agent is orchestrating somewhere between 12 and 30 tool calls in a directed graph, not a linear loop.
Step 1: the storyboard JSON
Before any model spends a credit, the agent emits a storyboard. Treating this as structured JSON (rather than prose) is what makes the rest of the pipeline parallelizable. The style preamble at the top is reused verbatim in every shot prompt to prevent drift.
{
"duration_s": 18,
"fps": 24,
"aspect": "16:9",
"style_preamble": "cinematic, 35mm film grain, warm tungsten key light, shallow depth of field, anamorphic lens flare, color graded teal and amber",
"seed": 884412,
"shots": [
{
"id": "shot_01",
"duration_s": 6,
"model": "kling-2",
"prompt": "A barista pours espresso into a white ceramic cup, slow push-in, steam curling, hands in focus",
"negative": "text, watermark, distorted hands"
},
{
"id": "shot_02",
"duration_s": 6,
"model": "kling-2",
"prompt": "Close-up of crema swirling in the cup, slow motion, droplets visible on rim",
"negative": "text, watermark"
},
{
"id": "shot_03",
"duration_s": 6,
"model": "kling-i2v",
"init_image": "shot_01.last_frame",
"prompt": "The barista smiles and slides the cup across the counter toward camera",
"negative": "extra fingers, blurry face"
}
],
"audio": {
"music": { "model": "stable-audio-2", "prompt": "warm acoustic guitar, light percussion, 90 bpm, 18s" },
"vo": null
}
}Step 2: parallel generate plus poll
Every generate_video call returns a job ID in under a second. The agent fires all independent shots at once, then enters a poll loop on get_generation. Recommended cadence: poll every 8 seconds for the first minute, then every 15 seconds. Kling 2 typically resolves in 60 to 90 seconds for a 6-second clip. Sora 2 takes 90 to 180. Veo 3 lands around 120. The agent should never block on the first job; it inspects the whole job set on each tick and queues reshoots immediately when one fails.
Reshoot rates in the wild are the single most useful number to calibrate against. On 6-second shots with clean prompts: Kling 2 fails roughly 1 in 10, Sora 2 around 1 in 14, Veo 3 around 1 in 12, Hailuo and Luma closer to 1 in 6. Failures are usually distorted hands, text artifacts in signage, or a face that drifts off-model. Build the reshoot budget into the storyboard from the start: for a 10-shot sequence, expect to issue at least one extra generate_video call. The agent should not retry the same prompt; bump the seed by one and add a sharper negative prompt on whatever the artifact was.
Step 3: continuity tactics
This is where most agent video pipelines fall down. Three tactics cover 80 percent of cases:
- Seed pinning. Pass the same seed (e.g. 884412) to every shot in the same scene. On text-to-video models this barely helps; on image-to-video chains it matters a lot.
- Identity-locked hero frame. Generate the protagonist once with FLUX 1.1 pro at 1024x1024, optionally with an identity LoRA. Then feed that frame into Kling i2v or Runway Gen-4 image-to-video for every shot the character appears in. The face stays the same because it is literally the same pixels.
- Last-frame chaining. For continuous action, the last frame of shot N becomes the init_image of shot N+1. This avoids hard cuts on continuous motion and is the only way to get smooth 12 to 18 second sequences out of an 8-second-cap model.
Step 4: stitch with ffmpeg
Once every job is succeeded the agent has N mp4 URLs and one or two audio URLs. Concatenation is two ffmpeg commands. The agent runs these in a sandbox or as a tool call:
# inputs.txt file 'shot_01.mp4' file 'shot_02.mp4' file 'shot_03.mp4' # concat clips, copy streams, no re-encode ffmpeg -f concat -safe 0 -i inputs.txt -c copy video_only.mp4 # overlay music, fade out last 1s, mix to AAC stereo ffmpeg -i video_only.mp4 -i music.mp3 \ -filter_complex "[1:a]afade=t=out:st=17:d=1[a]" \ -map 0:v -map "[a]" -c:v copy -c:a aac -b:a 192k \ -shortest final.mp4
Tool budget by clip length
Plan your context window and rate limits around these numbers. They assume one reshoot every 8 shots, which is realistic for Kling 2 and Veo 3 in 2026.
- 6 seconds, single shot: 1 generate_video + 6 to 10 polls + 1 audio + 1 stitch. About 9 to 13 tool calls. Wall time under 2 minutes.
- 15 seconds, three shots: 3 generate_video + 15 to 20 polls (parallel) + 1 audio + 3 polls + 1 stitch. About 23 to 28 tool calls. Wall time 3 to 4 minutes.
- 60 seconds, ten shots: 10 generate_video + 1 reshoot + 50 to 70 polls + 2 audio + 1 stitch. About 65 to 85 tool calls. Wall time 8 to 12 minutes.
If you want the exact polling backoff and reshoot heuristics, see the tool-call patterns guide.
Cost example: 60 seconds, three models
Pricing as of April 2026 on AgentFramer routed compute. A 60-second 1080p clip is 10 shots of 6 seconds. Numbers below are list price; in practice you also pay roughly $0.04 to $0.10 in audio and stitch calls.
- Kling 2 (10 x 6s): about $0.35 per shot, total around $3.50 for video. Decent motion, soft on faces. Best for B-roll and abstract sequences.
- Sora 2 (6 x 10s): about $1.20 per shot, total around $7.20. Stronger physics and continuity, slower (3 minutes per shot), bigger context to manage.
- Veo 3 (8 x 8s): about $0.80 per shot, total around $6.40. Strong native audio (saves the music call) and good prompt adherence.
The bigger lesson buried in those numbers: Sora 2 costs roughly 2x Kling 2 and runs 2x slower, but reshoot rates are similar. For anything that does not need Sora-tier physics (water, hair, glass, crowds), Kling 2 ships a finished clip in half the wall time at a third of the cost. Use Sora 2 for the one or two hero shots that actually require it, then drop back down to Kling 2 for inserts and B-roll. The same mixed-model pattern is documented in the MCP tools reference.
Model pick table
- Drafts and iteration: Hailuo MiniMax or Luma Ray 2. 6 seconds in 45 to 60 seconds at roughly $0.18 per shot. Fast enough to keep an agent loop tight.
- Hero clips: Sora 2 for narrative motion, Veo 3 when you also need synced ambient audio.
- Image-to-video: Kling i2v and Runway Gen-4 i2v. Pair with a FLUX 1.1 pro hero frame to lock identity across an entire sequence.
- Stylized loops and social: Kling 2 standard. Cheapest path to a polished 6-second cut.
The full list of supported video models is on the models page.
Where agent-generated video falls apart
Everything above works for product B-roll, social cutdowns, mood films, abstract loops, and most explainer-style sequences under 30 seconds. Past that, four problems show up consistently and none of them are solved by spending more money on tool calls.
The first is long-form narrative. A storyboard with 12 shots is fine. A storyboard with 40 shots that has to maintain a story arc, spatial geography, and consistent secondary characters is not something a current agent can plan reliably. Drift compounds: by shot 25 the wardrobe has changed, the kitchen has rotated 90 degrees, and the dog has become a different breed. The fix is human intervention every 6 to 8 shots, at which point you are using the agent as an assistant rather than a director.
The second is character continuity beyond a single scene. Identity LoRAs and image- to-video chains hold for one location and one outfit. The moment the character has to walk from a kitchen into a forest, the lighting changes, the i2v reference drifts, and seed pinning stops mattering. You can paper over this with strong style preambles and reference frames, but it is not a solved problem at the model layer in 2026.
The third is spoken dialogue with lip-sync. Veo 3 produces ambient diegetic sound well, but synced dialogue still looks uncanny on close-ups longer than 2 seconds. The honest pipeline is: generate silent video, generate the voiceover separately with ElevenLabs, then run a lip-sync model in post. That adds another tool, another failure mode, and another 30 to 60 seconds of latency per shot. For most product use cases the right answer is to write around it: cut to B-roll on every line, or use voiceover-over-visuals instead of on-camera speech.
The fourth is exact brand assets. If the script requires the actual product, the actual logo, the actual packaging, generative video will not get you there. It will get you a thing that looks like the product. For real product hero shots the pattern is inverted: shoot the product practically, then use generative video for the environment, the transitions, and the background plates around it. AgentFramer is happy to generate those plates; it cannot manufacture your logo.
Wire it up
The same MCP surface that exposes generate_image and generate_audio exposes generate_video and get_generation. The storyboard JSON, the polling loop, and the ffmpeg stitch above all run unchanged against any MCP- capable client. Set up the server in five minutes from the quickstart and the next 60-second clip your agent ships will cost you about $4 and 25 tool calls, exactly as planned.