Skip to main content

April 2026

Batch image generation for AI agents: patterns and pitfalls

Five hundred ecommerce product shots, due before the marketing team logs in tomorrow. The agent has eight hours and one job. Here is the shape of work that finishes on time without burning the budget.

The 500-shot overnight scenario

Concrete setup. A catalog team has 500 SKUs. Each SKU needs one hero shot on FLUX 1.1 pro and three lifestyle variants on SDXL Lightning. The agent kicks off at 22:00, results checked at 07:00. That is 2,000 generations in nine hours. At a naive one-at-a-time pace, even a fast model finishes a fraction of the work. The job only fits inside the window if the agent treats the batch as a set, not a loop.

The patterns below are the ones that actually keep that run on the rails: bounded concurrency, deterministic idempotency keys, retry-with-fallback, a circuit breaker, and budget rules the agent can enforce without a human in the loop.

None of this is theoretical. Most of the failures you will hit on a two-thousand-image run do not show up at twenty. The interesting thing about scale is which problems become statistically certain. At twenty images, a 1% NSFW false positive rate means you probably see zero. At two thousand, you will see twenty, and they will arrive in clusters because the same kind of prompt template trips the same filter rule.

Concurrency, not iteration

Naive agents iterate one job at a time. The right pattern fires jobs concurrently and bounds the parallelism so a slow tail does not block the runway. AgentFramer's per-workspace concurrency limit is set high enough that batches of fifty in flight at once are routine; the agent's job is to feed that pipe without overshooting it.

A semaphore-bounded Promise.allSettled is the right primitive. It runs up to N jobs concurrently, never aborts the whole batch on a single failure, and lets you classify outcomes afterward. The 500-shot run becomes ten waves of fifty rather than five hundred sequential calls.

The wrong primitive is Promise.all. One rejection unwinds the whole batch, you lose track of which jobs succeeded, and the agent ends up retrying work that already completed. Use allSettled and let the caller decide what counts as a failure.

// Bounded concurrency over a batch with deterministic idempotency keys.
import { createHash } from "node:crypto";

type Job = {
  sku: string;
  variant: "hero" | "lifestyle-1" | "lifestyle-2" | "lifestyle-3";
  model: "flux-1.1-pro" | "sdxl-lightning";
  prompt: string;
  promptVersion: string;
};

const idempotencyKey = (j: Job) =>
  createHash("sha256")
    .update([j.sku, j.variant, j.model, j.promptVersion].join("|"))
    .digest("hex");

async function semaphore<T>(limit: number, tasks: (() => Promise<T>)[]) {
  const results: PromiseSettledResult<T>[] = new Array(tasks.length);
  let cursor = 0;
  const workers = Array.from({ length: limit }, async () => {
    while (cursor < tasks.length) {
      const i = cursor++;
      try {
        results[i] = { status: "fulfilled", value: await tasks[i]() };
      } catch (err) {
        results[i] = { status: "rejected", reason: err };
      }
    }
  });
  await Promise.all(workers);
  return results;
}

export async function runBatch(jobs: Job[], generate: (j: Job, key: string) => Promise<unknown>) {
  const tasks = jobs.map((j) => () => generate(j, idempotencyKey(j)));
  return semaphore(50, tasks);
}

Idempotency: deterministic keys per logical image

Agents retry. Networks blip. Workers crash. Without idempotency, every retry is a new charge for work you already paid for. AgentFramer's generate_image accepts an idempotency_key. The recipe that holds up under load is a hash of every input that defines the logical image:

sha256(sku + "|" + variant + "|" + model + "|" + prompt_version)

Same inputs, same key, same result. A retried call returns the existing job; a re-run after a prompt edit (new prompt_version) generates fresh images. This is what lets the agent treat generate_image as safe to call twice. See the MCP tool reference for the full parameter list.

A subtle but important rule: keep the seed out of the key unless you intend the seed to be part of the image's identity. If the seed is randomized per call, every retry computes a new key and you have lost idempotency. Pin the seed (or omit it from the key) and you keep the safety net. The same goes for any per-call jitter in prompts — bake stable phrasing into prompt_version rather than mutating the prompt string between attempts.

Retry-with-fallback, not retry-the-same-thing

In a batch of two thousand, a few percent will fail. The failures cluster into shapes you can plan for:

  • NSFW filter. Soften the prompt (strip ambiguous adjectives, add neutral context) and retry once. If it still fires, mark and skip.
  • Per-model rate limit. Do not retry the same model. Fall back to an equivalent: FLUX 1.1 pro to FLUX dev for hero, SDXL Lightning to FLUX schnell for variants.
  • 429 / transient. Exponential backoff with jitter. Cap at three attempts.
  • Opaque upstream error. One retry, then mark for human review.

The agent should keep a small state map (jobId → status) and re-shoot the failed subset at the end of the run, not abort the batch. Persist that map outside the agent's working memory — a database row, a JSON file in object storage, anywhere a restart can rehydrate it. Combined with idempotency keys, this means a crashed agent can resume without re-billing a single completed job.

Distinguish between fallback at the job level and fallback at the model level. Job-level fallback (soften the prompt, try again) is cheap and the right first move for content-policy failures. Model-level fallback (FLUX 1.1 pro to FLUX dev) is what saves the batch when an entire upstream is degraded. Wire both, and order them so the cheaper option runs first.

Circuit breaker

Every batch eventually meets a degraded provider. Without a brake, the agent burns retries on jobs that were never going to succeed. Rule: if more than 10% of jobs in the last sixty seconds fail, pause for sixty seconds before resuming. Two consecutive trips and the agent halts the run and reports state.

The thresholds are tunable, but the shape matters more than the numbers. A cooldown that is too short turns the breaker into a noise generator; one that is too long stalls a recoverable batch. Sixty seconds is long enough for a transient upstream blip to clear, short enough that the overnight window is not blown by one trip.

class CircuitBreaker {
  private events: { t: number; ok: boolean }[] = [];
  private trippedUntil = 0;
  private trips = 0;

  constructor(
    private windowMs = 60_000,
    private threshold = 0.1,
    private cooldownMs = 60_000,
    private maxTrips = 2,
  ) {}

  record(ok: boolean) {
    const now = Date.now();
    this.events.push({ t: now, ok });
    this.events = this.events.filter((e) => now - e.t <= this.windowMs);
    const total = this.events.length;
    if (total < 20) return;
    const failRate = this.events.filter((e) => !e.ok).length / total;
    if (failRate > this.threshold) {
      this.trippedUntil = now + this.cooldownMs;
      this.trips++;
    }
  }

  async gate() {
    if (this.trips >= this.maxTrips) throw new Error("circuit open: halt batch");
    const wait = this.trippedUntil - Date.now();
    if (wait > 0) await new Promise((r) => setTimeout(r, wait));
  }
}

Cost discipline

Batches are where cost surprises happen. The same prompt accidentally re-issued at fifty-wide concurrency is a five-figure mistake before the agent notices. Four rules that keep it boring:

  1. Read get_credits before the batch. If the projected spend (jobs × model price) exceeds available credits, stop and ask.
  2. Hard per-batch cap. The agent maintains a running cost estimate and aborts the wave if it crosses the cap, even mid-run.
  3. Draft-then-final. SDXL Lightning passes for first-look variants; FLUX 1.1 pro only on the hero shot or after human approval. For this catalog run, that ratio alone cuts spend by roughly 60% versus running the premium model on every variant.
  4. Read get_credits again after the batch and reconcile against the projection. A delta over 5% means a bug, not a pricing surprise.

Model selection is a cost lever as much as a quality one — the models page lists the FLUX, SDXL, Ideogram, Imagen, and Playground options with their relative pricing.

Storage at scale: don't carry 500 URLs in context

Fifty signed URLs in an agent's context window is fine. Two thousand is not — the context bill alone makes the batch unprofitable, and the agent starts confusing one URL for another. The pattern: store only job IDs during the run, then call list_recent_generations once the batch is complete to materialize the URLs into the downstream system (catalog DB, asset pipeline, CDN). Signed URLs default to 30-day retention, which is more than enough headroom for an overnight job and a morning review.

A useful discipline: the agent's chat history should never contain more than a handful of URLs at once. If it does, the batch is mis-shaped. Push results to storage, push pointers to a queue, and keep the conversation focused on decisions, not artifacts.

When batch is the wrong unit

Batch is the right shape when the work is repetitive, the prompt template is settled, and the success criterion is "ship 2,000 images that look like the spec." Batch is the wrong shape when:

  • Heavy art-direction iteration. The hero illustration for a product launch is a one-of-one. You need ten attempts, each informed by the last. That is a conversation, not a queue.
  • Brand reviews. Every frame needs a human eye before the next one is generated. A batch of fifty just produces fifty things to throw away.
  • Creative exploration. When the goal is "find a direction we like," variance is the feature. Generating two thousand variations of the same prompt is the opposite of what you want; you want twenty very different prompts.

For those, drop concurrency to one and put the human in the loop. Batch infrastructure should be able to do this without code changes — the same generate_image tool, same idempotency keys, just a smaller wave size and a pause between each call.

What "done" looks like at 07:00

A successful overnight run leaves three things on the morning review desk: a count (1,994 of 2,000 succeeded), a small list of failures with reasons (six NSFW false positives, prompts attached), and a credit reconciliation that matches the projection within a few percent. No URL dump in chat. No "the batch crashed at item 370 and I do not know what state it is in." That is the difference between batch generation that scales and batch generation that gets paged on.

The whole pattern in one line: bound concurrency, key every job, retry with fallbacks, brake on systemic failure, cap the spend, and pull URLs out of context. More worked examples in the tool-call patterns guide.