April 2026

AI agents that generate images: a production walkthrough

A marketing agent gets a Linear ticket at 09:14: ship four launch shots for the new headphones by lunch. Hero, lifestyle, detail, and a social square. Here is exactly what the agent does, what JSON moves over the wire, and where it almost falls over.

The scenario

The agent runs in Claude Code with the AgentFramer MCP server attached. The brief is in a Notion doc; the brand palette and product photography references are linked. Output goes to a Postgres table that powers the campaign CMS. Total budget for the run: about $0.18 in image credits and roughly 90 seconds of wall time if nothing retries. Realistic wall time with one retry: about 110 seconds. Nothing about that is heroic, but every step has a way to fail and the agent has to know which.

Step 1: pre-flight credits and model pick

Before a batch of any size, the agent calls get_credits. Four FLUX 1.1 pro renders at 1024x1024 cost roughly $0.16 in current pricing, plus a 10 percent buffer for retries. If the workspace has under $0.25 of headroom, the agent stops and asks. This single check kills the most embarrassing failure mode: a half-finished launch because credits ran out on shot three.

The model pick is not a coin flip. For a launch hero with crisp product geometry, no on-image text, and natural light, FLUX 1.1 pro is the right default. The agent uses a small decision tree:

Needs legible text in the image? Pick Ideogram v2 or Imagen 3. FLUX and SDXL still mangle anything longer than a single word.
Photoreal product, no text? FLUX 1.1 pro at 1024x1024. 6-12 seconds, very strong composition.
Draft, thumbnail, or moodboard? SDXL Lightning or FLUX schnell. Sub-3-second runs, fine for iteration before the agent commits to a final.
Stylized illustration or poster? Playground v3 or HiDream. Better at non-photoreal aesthetics than FLUX.
Editing an existing asset? Move to a model with image-to-image and inpainting. A pure text-to-image call is the wrong shape for that work.

The full lineup is on the models page. For this run the agent locks in FLUX 1.1 pro and a fixed seed strategy: one base seed for the hero, derived seeds for the other three shots so the lighting and palette stay close.

Step 2: fire four jobs in parallel

The agent does not call generate_image in a for-loop. It fires all four tool calls in the same turn. AgentFramer queues them server-side and returns four job IDs with estimated completion times. End-to-end this saves about 18-25 seconds compared to serial calls.

The request shape for one of the four shots looks like this:

{
  "tool": "generate_image",
  "arguments": {
    "model": "flux-1.1-pro",
    "prompt": "studio product shot of matte black over-ear headphones on a brushed concrete plinth, soft top light from camera left, shallow depth of field, no text, no logo overlays, color palette: charcoal, warm grey, single amber accent",
    "negative_prompt": "watermark, text, logo, hands, person, low quality, deformed",
    "width": 1024,
    "height": 1024,
    "seed": 73104,
    "guidance": 3.5,
    "steps": 28,
    "metadata": {
      "campaign": "aurora-launch",
      "shot": "hero",
      "ticket": "MKT-2241"
    }
  }
}

The response comes back fast, well under a second, because no image has been rendered yet:

{
  "id": "gen_01HWX2K8Y4N5R7P9Q1B3T6V8M0",
  "status": "queued",
  "model": "flux-1.1-pro",
  "estimated_seconds": 9,
  "created_at": "2026-04-30T09:17:42Z"
}

Four IDs go into a small in-memory map keyed by shot name. The agent now has work to do while it waits. Worth pausing on a few request fields. The seed is the single most useful knob the agent owns: pin it across reruns and the same prompt collapses to the same image. The guidance value of 3.5 is the sweet spot for FLUX 1.1 pro on photoreal subjects; push it above 5 and you get plastic skin and over-saturated highlights, drop it below 2.5 and the prompt loses grip on the composition. The steps count of 28 is a reasonable middle: 20 is noticeably softer, 40 buys diminishing detail at 30 percent more wall time. The metadata field is not for the model; it is for the agent's own audit trail. Three months later, when someone asks "where did this hero shot come from," the answer is a single SQL query.

Step 3: poll with backoff, not a tight loop

Polling is where naive agents burn tool-call budget. The pattern that works: wait the estimated time first, then check, then back off by 1.5x on each subsequent miss, capped at 8 seconds. For a typical FLUX 1.1 pro batch this resolves in 2-3 polls per job.

type Pending = { id: string; shot: string; nextDelayMs: number };

async function settle(jobs: Pending[]) {
  const done: Record<string, { url: string; seed: number }> = {};
  const failed: Record<string, string> = {};
  let pending = [...jobs];

  while (pending.length > 0) {
    await sleep(Math.min(...pending.map((j) => j.nextDelayMs)));
    const next: Pending[] = [];
    for (const job of pending) {
      const r = await mcp.call("get_generation", { id: job.id });
      if (r.status === "succeeded") {
        done[job.shot] = { url: r.output_url, seed: r.seed };
      } else if (r.status === "failed") {
        failed[job.shot] = r.error?.code ?? "unknown";
      } else {
        next.push({ ...job, nextDelayMs: Math.min(job.nextDelayMs * 1.5, 8000) });
      }
    }
    pending = next;
  }
  return { done, failed };
}

Step 4: the NSFW filter rejection (it will happen)

Across roughly 50 production runs of this size, expect 1-3 of the 200 individual generations to come back with status: "failed" and an error code in the family of safety_filter or policy_block. This is not because the prompt was lewd. It is because text-to-image safety classifiers are jumpy about specific tokens (skin, body, anatomy, certain brand names, certain place names) even in obviously commercial contexts.

On this run, the lifestyle shot fails. Original prompt referenced "a woman wearing the headphones, eyes closed, lit from behind." The recovery is mechanical: the agent rewrites the prompt with softer phrasing and resubmits with the same seed, so the composition stays close. "A person wearing the headphones, calm expression, backlit golden hour, head and shoulders only" passes on the second try. Total cost: one extra render, about 9 seconds of wall time, $0.04.

If a second retry fails, the agent does not loop a third time. It logs the prompt, switches the failed shot to a different model (often Imagen 3, which has a different safety profile), and tries once more. If that still fails, it surfaces the shot to a human. Three strikes, then escalate. See the tool call patterns guide for the full retry ladder.

Step 5: write URLs to the database

Output URLs from get_generation are stable but they are AgentFramer-hosted. For a production CMS the agent should either pin them as-is (fine for short-lived campaigns), copy them into the team's own object storage (S3, R2, Blob), or do both. This run does both: the agent writes the AgentFramer URL, the seed, the model, and the prompt to the campaign_assets table, then enqueues a background job to mirror the file to R2. The seed and prompt fields are non-negotiable. Without them, regenerating a near-identical variant in three weeks is impossible.

One last detail people skip: log the cost. Every get_generation response includes a billed-credit value. Sum it per ticket and per campaign. The first time finance asks "what did the AI cost us last quarter," you want to answer with a query, not a guess. Keeping it in the same row as the asset means the cost story and the asset story never drift apart.

Where this approach breaks

Be honest about the failure surface. Agent-driven image generation is excellent for a fairly narrow band of work and bad outside it.

Brand-strict art direction. If the design team rejects anything that is not pixel-perfect to a 40-page brand book, an agent will burn credits and still not satisfy the reviewer. Use it for drafts and let a human pick the final.
Character continuity across many shots. Same face, same outfit, ten frames. Even with locked seeds and careful prompts, identity drifts. LoRAs, IP-Adapter, or a video model with a reference image are usually a better shape than four independent text-to-image calls.
Anything requiring taste in the loop. Editorial covers, awards work, anything where the answer to "is this good?" is "I'll know when I see it." The agent can accelerate the search but should not be the final picker.
Regulated or sensitive imagery. Medical, legal, political, anything involving real named people. The combination of model bias, safety filters, and licensing exposure is not worth the speedup.

When you do not want an agent generating images

The honest counter-list: do not point an agent at this if a designer can knock it out in fifteen minutes, if the asset ships to a homepage hero and gets seen by millions, or if the review cycle has more than two named approvers. Agent generation pays off on volume, drafts, and internal-facing assets where iteration speed beats final polish. A four-shot launch like this run is right in the sweet spot: enough volume to justify the orchestration, low enough stakes that "good enough on the second try" is genuinely good enough.

Cost math sharpens this. Four FLUX 1.1 pro renders at $0.04 each is $0.16. A junior designer billing 90 minutes at $80 per hour is $120. The agent wins by three orders of magnitude on cost alone, but only if the team accepts that the output is a draft surface, not a final. Treat it as a paint roller, not a paintbrush. Volume is the point.

Ship it

The whole loop above is roughly 150 lines of TypeScript on top of the AgentFramer MCP server. No queue infrastructure, no model hosting, no polling service to operate. Connect the server to your agent, give it the four tools (generate_image, get_generation, list_recent_generations, get_credits), and let it work. Start with the quickstart, then create a workspace and run your first batch. The next launch ticket is already in the queue.