Auto-Generating Meeting Minutes from Video with a Gemini CLI Skill
I created a Gemini CLI skill called video-to-minutes that auto-generates meeting minutes from video files.
Overview of the video-to-minutes Skill
This skill takes a video file as input, executes the following steps sequentially, and ultimately generates meeting minutes in Markdown format.
| Step | Process | Tool Used |
|---|---|---|
| 1 | Check and install prerequisite tools | ffmpeg, whisper |
| 2 | Get video path and capture interval from user (default 60s) | - |
| 3 | Extract audio from video | ffmpeg |
| 4 | Transcribe audio (automated) | whisper (turbo model) |
| 5 | Capture images from video at regular intervals | ffmpeg |
| 6 | Collect proper nouns from user | - |
| 7 | Analyze transcript and generate meeting minutes | Gemini |
| 8 | Save as Markdown file | - |
Why I Turned This into a Skill
In my previous article “Gemini CLI’s Default Capabilities Are So Powerful That I Reconsidered My Approach to Creating Skills”, I introduced the DCAP cycle — an approach where you first ask the AI directly and only turn repeated operations into skills.
This video-to-minutes workflow was exactly a case that warranted being turned into a skill.
| Criteria for Skill Creation | Why It Applies |
|---|---|
| Repeated execution | The same workflow runs for every meeting |
| Fine-tuning has stabilized | ffmpeg options and Whisper model settings are settled |
| Want to share with others | Team members can create minutes using the same process |
| Want to ensure quality | Standardize the meeting minutes format |
Key Points of the Workflow
Automated Whisper Transcription
The agent automatically runs Whisper transcription using the turbo model.
whisper meeting_audio.wav --language ja --model turboThe generated meeting_audio.txt is auto-detected, and the user is only prompted for the path if the file is not found.
Collecting Proper Nouns to Improve Accuracy
A step was added to collect proper nouns (names of people, companies, products, etc.) from the user before generating the meeting minutes. This helps reduce transcription errors and improves the overall accuracy of the minutes.
Capturing Images to Preserve Visual Information
In addition to audio transcription, the skill captures images from the video at regular intervals, allowing slide and whiteboard content to be included in the minutes.
ffmpeg -i "<VIDEO_FILE_PATH>" -vf fps=1/<INTERVAL_IN_SECONDS> captures/capture_%03d.pngThe capture interval is user-configurable, allowing flexible adjustment based on the meeting content.
Summary
The workflow for generating meeting minutes from video requires executing multiple tools in sequence, and doing it manually every time is tedious. Multi-step workflows like this that are used repeatedly are where skill creation has the greatest impact.
The skill is published in the oh-my-skills repository.
That’s all from the Gemba, reporting on creating a skill that auto-generates meeting minutes from video.