Video Production Pipeline

Published articles become narrated videos automatically. The pipeline takes article text as input and produces a finished video with AI-generated slides, professional narration, word-level timing, and a music bed. No manual timing. No editing software. One command.

Standard articles target 5-7 minutes. Deep investigations with multiple source documents and extensive background can run 10-14 minutes. The length is driven by the content, not by a preset.


Pipeline Overview

Six stages run in sequence. Each stage produces output that feeds directly into the next.

1. Newscast Script

The article text is rewritten for broadcast delivery. Print writing and broadcast writing are different disciplines. Print leads with context; broadcast leads with what happened. Print tolerates long sentences; broadcast needs to be spoken aloud in 30-second segments without the listener losing the thread.

An AI agent rewrites the article in newscast style: punchy sentence structure, active voice, spoken transitions between topics, and a script length calibrated to the target video duration. The script is not a summary. It covers the same ground as the article but in the form that works when read aloud.

2. TTS Narration

The newscast script is fed to a text-to-speech engine. The voice is prompted with style parameters: pacing, tone, and register appropriate for the publication. The engine outputs a single audio file covering the full script.

A separate voice profile is used for promotional inserts. Mid-roll promos use a distinct voice so listeners can distinguish editorial content from sponsored content. The promo pool is evergreen: the same set of promos rotates across multiple videos rather than requiring new production for each one.

3. Word-Level Timestamps

The narration audio is processed to extract word-level timestamps. Each word in the script gets a start time and end time in the audio. This is the data that makes automated slide transitions possible without manual timing.

The timestamp extractor maps the script text to the spoken audio, accounting for natural variation in pacing. The result is a timing file: a complete record of when each word was spoken.

4. Slide Generation

Slides are generated by an AI image generation API. Each slide corresponds to a section of the script. The AI receives a description of what the slide should show, derived from the script content, along with constraints for typography, color palette, and layout.

Key design decision: slides render all text natively as part of the generated image. There are no FFmpeg text overlays applied after the fact. This means what you see in the slide is exactly what the image generation API produced. Overlays introduce font inconsistencies and require separate positioning logic; native rendering avoids both problems.

The slide set covers the full video: an abstract title card for the opening, document-composite images for middle sections where real source material is layered onto generated backgrounds, and a closing card. Transition timing for each slide is calculated from the word-level timestamps.

5. FFmpeg Assembly

FFmpeg receives the slide images, the narration audio, the timestamp-driven transition schedule, and a music bed track. It assembles the final video: slides advance on the timestamp-driven cues, the narration plays over them, and the music bed plays at reduced volume underneath the narration.

The music bed includes a short lead-in before the narration starts. This gives the video an audio opening rather than cutting directly to speech. After the narration ends, the music fades out.

No manual editing step exists in this pipeline. The timestamps drive all transitions. If the narration pacing changes, regenerating the timestamps and re-running FFmpeg produces a corrected video automatically.

6. Output

The finished video file is ready for upload. It goes into the article canvas alongside the TTS audio player, so readers can choose between reading the article, listening to narrated audio, or watching the video.


Key Design Decisions

Slides render all text natively. The image generation API creates the final visual, including any text on the slide. No post-processing text overlay step exists. This keeps the visual consistent and eliminates a class of positioning bugs that come with layering text onto images in FFmpeg.

Timestamp-driven transitions. Every slide change is triggered by a word in the narration script, not by a fixed interval. If a section takes longer to say than expected, the slide stays up until the relevant words are spoken. Manual timing is not required at any point.

Fresh build pattern. When something needs to change, the pipeline regenerates from source rather than patching the assembled video. Change the script, regenerate the narration, extract new timestamps, and re-run assembly. This takes minutes and produces a clean output. Patching video files is error-prone and slow; regenerating is faster and more reliable.


Tools Involved

Tool Function
Script rewriter AI agent that converts article text to newscast script format.
TTS engine Converts script text to narration audio. Supports speech rate control via a flag. Separate voice profile for mid-roll promos.
Timestamp extractor Processes narration audio and outputs word-level timing data for the full script.
Slide generation script Calls the image generation API with slide prompts derived from the script. Returns image files for each slide.
FFmpeg assembler Takes slides, narration, timestamps, and music bed. Outputs the finished video file.

Agent Dispatch

The full pipeline runs as an agent-dispatched workflow. A dedicated agent spec covers the end-to-end process: inputs, stage sequence, quality checks, output paths, and error handling. The orchestrator invokes the agent with the article source and target publication. The agent handles the rest without further prompting.

Before dispatch, the orchestrator queries the memory server for video production rules. Corrections from prior runs, format decisions, and known edge cases are injected into the agent prompt as a rules block. This means lessons learned from previous video builds reach the agent on every run, not just the ones where someone remembered to mention them.

The /video-production skill is the trigger command. Say "make a video" or "produce video" and the skill fires, performs the pre-dispatch memory recall, and dispatches the agent.