How to Build an AI Agent for Automated Image & Video Creation
Nov 06, 2025
We’ve been assembling the ultimate AI agent, piece by piece. So far, we have agents that can manage email, organize Google Drive, schedule your calendar, and conduct web research. Each one is a powerful tool on its own. But today, we build the creative arm of our operation: the content agent.
This agent is designed to handle the entire content lifecycle. It can create images from a simple idea, edit those images with specific instructions, and even transform static pictures into dynamic videos. It’s a practical step in moving from scattered, manual tasks — what I call Level 1 automation — to building a scalable, departmental system, which is the hallmark of Level 3.
Let’s look at how it works.
Get this workflow — Here
At the core, an orchestrator agent acts as the main controller, allowing us to give commands through Telegram. We’ve given this orchestrator a new instruction: when a task involves content, it must use the “Creative Agent.”
This Creative Agent is a specialist. Its instructions are clear: you are an expert in generating AI images and videos. You have tools to create, edit, and convert media.
I’ve provided it with specific guidelines:
- Image Prompts: Must be detailed and stylized.
- Video Prompts: Should be concise and energetic, describing sounds and dialogue to create a seamless final product.
This setup ensures that instead of just executing a command, the agent understands the intent behind the request and uses its specialized tools to achieve the best possible outcome.
The first workflow is straightforward. We send the agent a prompt and a chat ID. The agent takes this, expands on the idea, and generates an image.
For example, I started with a simple command:
“Use the creative agent to generate an image of an influencer man.”
The agent didn’t just take my words literally. It created a more descriptive prompt — “male influencer photo realistic” — and used that to generate two high-quality images. The workflow is simple:
- The request is received.
- It calls the image generation model (in this case, Flux dev).
- After a short wait, the final image is downloaded.
- The image is sent back via Telegram and simultaneously stored in Google Drive for future use.
This is a perfect example of a Level 2 automation: taking a repetitive, manual process and creating a reliable, automated workflow.
Creating an image is one thing. Editing it is another. This is where we can test the agent’s ability to handle multi-step, contextual tasks.
For this, we use a tool that leverages Google’s Gemini Nano Banana model. The workflow needs to be slightly more intelligent. It can’t just take a prompt; it needs to find the specific image to edit.
I gave it this command:
“Can you take this photo of Influencer Man portrait.png and edit this image to give him wings. Get the file ID from the Google Drive agent.”
Here’s what happened behind the scenes:
- The orchestrator first tasked the Google Drive agent to search for the file named “Influencer Man portrait.png.”
- Once found, the Drive agent returned the unique file ID.
- The orchestrator then passed this file ID and the editing prompt (“give him wings”) to the Creative Agent.
- The Creative Agent used a workflow to grant temporary shareable permissions to the file, download it, and send it to the Gemini model for editing.
The result was the original image, now with wings. This demonstrates a more advanced automation, where different specialized agents collaborate to complete a complex request.
The final step is video generation. The agent can do this in two ways: creating a video from a text prompt or animating an existing image.
For text-to-video, the workflow is simple. We provide a prompt, title, and aspect ratio. I tested it with, “Create a video of a man walking in the street.” The agent used a model like Google’s VEO 3.1 to generate a short clip. While faster models produce lower-resolution video, you can easily switch to higher-quality options for a more refined result.
The more impressive workflow is image-to-video. This is where the system truly comes together. I asked the agent to take the edited image of the man with wings and convert it into a video.
This process is the most complex:
- The agent first finds the file ID in Google Drive.
- It changes the file’s permissions to make it accessible.
- The file is downloaded and converted into a base64 string.
- This string is sent to the video generation model, which animates the static image.
- The final video is delivered via Telegram and saved to Google Drive.
Having an agent that can handle emails, files, and content creation is more than just a time-saver. It’s a strategic shift. Most companies are stuck in Level 1, using a dozen different tools for a dozen different tasks. One tool for images, another for video, a third for social media scheduling. It’s chaotic and inefficient.
This ultimate AI agent demonstrates the principles of Level 4 automation: integrating disparate systems into a single, cohesive unit. When your agents can think and collaborate, you move beyond simple task automation. You start building an intelligent system that can handle entire operational functions.
The workflows for this content agent are intricate. If you’re not technically savvy, setting up the connections and permissions can be tricky. I will have the full n8n code and detailed instructions in our Corporate Automation Library (CAL), which hosts over 60 high-impact, high-ROI automations. New corporate automations are added weekly.
Click Here to gain access to CAL.
Once your content is generated, the next logical step is to publish it. You can automate that too. But first, it’s crucial to understand where your business currently stands. Are you operating in the chaos of Level 1, or are you building integrated systems?
Ritesh Kanjee | Automations Architect & Founder Augmented AI
(121K Subscribers | 58K LinkedIn Followers)
From 80-Hour Weeks to 4-Hour Workflows
Get my Corporate Automation Starter Pack and discover how I automated my way from burnout to freedom. Includes the AI maturity audit + ready-to-deploy n8n workflows that save hours every day.
We hate SPAM. We will never sell your information, for any reason.

