Auto-Generating Meeting Minutes from Video with a Gemini CLI Skill

I created a Gemini CLI skill called video-to-minutes that auto-generates meeting minutes from video files.

feat: Add video-to-minutes skill #4

Overview of the video-to-minutes Skill

This skill takes a video file as input, executes the following steps sequentially, and ultimately generates meeting minutes in Markdown format.

Step	Process	Tool Used
1	Check and install prerequisite tools	`ffmpeg`, `whisper`
2	Get video path and capture interval from user (default 60s)	-
3	Extract audio from video	`ffmpeg`
4	Transcribe audio (automated)	`whisper` (turbo model)
5	Capture images from video at regular intervals	`ffmpeg`
6	Collect proper nouns from user	-
7	Analyze transcript and generate meeting minutes	Gemini
8	Save as Markdown file	-

Why I Turned This into a Skill

In my previous article “Gemini CLI’s Default Capabilities Are So Powerful That I Reconsidered My Approach to Creating Skills”, I introduced the DCAP cycle — an approach where you first ask the AI directly and only turn repeated operations into skills.

This video-to-minutes workflow was exactly a case that warranted being turned into a skill.

Criteria for Skill Creation	Why It Applies
Repeated execution	The same workflow runs for every meeting
Fine-tuning has stabilized	ffmpeg options and Whisper model settings are settled
Want to share with others	Team members can create minutes using the same process
Want to ensure quality	Standardize the meeting minutes format

Key Points of the Workflow

Automated Whisper Transcription

The agent automatically runs Whisper transcription using the turbo model.

whisper meeting_audio.wav --language ja --model turbo

The generated meeting_audio.txt is auto-detected, and the user is only prompted for the path if the file is not found.

update: Improve video-to-minutes skill workflow #5

Collecting Proper Nouns to Improve Accuracy

A step was added to collect proper nouns (names of people, companies, products, etc.) from the user before generating the meeting minutes. This helps reduce transcription errors and improves the overall accuracy of the minutes.

Capturing Images to Preserve Visual Information

In addition to audio transcription, the skill captures images from the video at regular intervals, allowing slide and whiteboard content to be included in the minutes.

ffmpeg -i "<VIDEO_FILE_PATH>" -vf fps=1/<INTERVAL_IN_SECONDS> captures/capture_%03d.png

The capture interval is user-configurable, allowing flexible adjustment based on the meeting content.

Summary

The workflow for generating meeting minutes from video requires executing multiple tools in sequence, and doing it manually every time is tedious. Multi-step workflows like this that are used repeatedly are where skill creation has the greatest impact.

The skill is published in the oh-my-skills repository.

That’s all from the Gemba, reporting on creating a skill that auto-generates meeting minutes from video.