Audio to Text Meeting Transcription with LLM
Zack’s assistant had reached a new level. It could not only summarize his inbox but also draft reply options. His mornings were lighter, his writing faster, and his inbox stress much lower.
But there was still one thing weighing him down: meetings.
Every week, Zack sat through hours of calls. Client updates, internal check-ins, planning discussions. And after each one, someone (often Zack) had to write meeting notes.
The problem? Notes were inconsistent. Sometimes Zack forgot details. Sometimes tasks got lost. And sometimes, nobody wrote anything at all.
“If only I could get a transcript of the meeting automatically,” Zack thought. “Then I could summarize it just like my emails.”
That’s when he discovered speech-to-text (ASR).
Step 1: Basics of speech-to-text (ASR)
ASR stands for Automatic Speech Recognition. It’s the technology that turns spoken words into written text.
- When you ask Siri or Google Assistant a question, ASR is working behind the scenes.
- When you use YouTube auto-captions, that’s ASR too.
- For Zack, ASR meant: take the audio recording of a meeting and turn it into plain text.
Once he had text, he could send it to GPT-4o to create summaries, action items, and decisions, the same way he did with emails.
So the plan was simple:
- Record or use a meeting audio file.
- Use an ASR model to transcribe the speech.
- Clean and store the text.
- Use the LLM later to summarize and extract tasks.
But first, Zack needed to pick the right tool.
Step 2: Choose Whisper or Distil-Whisper
Zack found two popular open-source models for transcription:
- Whisper (by OpenAI):
- Strong accuracy across many languages.
- Handles different accents.
- Available in sizes from tiny to large.
- Downside: the larger models can be heavy on slower laptops.
- Distil-Whisper (by Hugging Face):
- A distilled (smaller, faster) version of Whisper.
- Runs quicker on modest machines.
- Slight trade-off in accuracy but still very good.
Since Zack was just starting, he decided to use Whisper small. It balanced speed and accuracy. Later, he could experiment with Distil-Whisper if he needed something lighter.
Step 3: Setting up Whisper
Here’s what Zack did step by step:
A. Install PyTorch
Whisper runs on PyTorch, a deep learning library.
- Open your terminal (make sure your Python virtual environment is active).
- Install PyTorch with:
1
pip install torch torchvision torchaudio
If you have a GPU, you can follow PyTorch’s website to install the CUDA-enabled version for speed. But CPU-only works fine for small files.
B. Install Whisper
Next, Zack installed Whisper itself:
1
pip install git+https://github.com/openai/whisper.git
This gave him the Whisper library and command-line tool.
C. Install FFmpeg
Whisper needs FFmpeg to handle audio formats.
- On macOS (with Homebrew):
1
brew install ffmpeg
- On Ubuntu/Linux:
1
sudo apt-get install ffmpeg
- On Windows:
- Download from ffmpeg.org.
- Add the bin/ folder to your PATH.
Now Zack was ready to transcribe audio.
Step 4: Transcribing a sample meeting file
Zack had a sample file called meeting_sample.mp3. It was a short 1-minute clip where his team discussed moving a project deadline.
A. Using Whisper CLI (easiest way)
Whisper comes with a command-line tool. Zack ran:
1
whisper meeting_sample.mp3 --model small --language English
Whisper downloaded the “small” model and started processing. After a minute, Zack saw:
1 2 3
[00:00.000 --> 00:10.000] Hi team, just a quick update about the project. [00:10.000 --> 00:20.000] We may need to move the deadline from March 10 to March 15. [00:20.000 --> 00:30.000] Ali, can you confirm if that works with the client?
Whisper also saved text files:
- meeting_sample.txt: plain text transcript.
- meeting_sample.srt: subtitles with timestamps.
B. Using Whisper in Python
Zack wanted to integrate this into his assistant. So he wrote transcribe.py:
1 2 3 4 5 6 7 8 9 10 11
import whisper # Load model model = whisper.load_model("small") # Transcribe audio result = model.transcribe("meeting_sample.mp3") # Print transcript print("=== Transcript ===") print(result["text"])
When he ran:
1
python transcribe.py
The output was:
1 2
=== Transcript === Hi team, just a quick update about the project. We may need to move the deadline from March 10 to March 15. Ali, can you confirm if that works with the client?
Clean, readable text. Exactly what Zack needed.
Step 5: Cleaning transcripts
Raw transcripts are good, but Zack noticed some issues:
- Whisper sometimes added filler words (“uh,” “you know”).
- Long pauses created odd breaks.
- Names weren’t always labeled.
So Zack added a post-processing step.
1 2 3 4 5 6 7 8 9 10 11 12
def clean_transcript(text): # Remove filler words fillers = ["uh", "um", "you know", "like"] for f in fillers: text = text.replace(f, "") # Fix spacing return " ".join(text.split()) raw = result["text"] cleaned = clean_transcript(raw) print("=== Cleaned Transcript ===") print(cleaned)
Now the transcript was smoother and easier to read.
Step 6: Practice exercise
Here’s your challenge:
- Install Whisper and its dependencies.
- Find or record a 1-minute audio file (.mp3 or .wav). It could be you talking, a podcast clip, or a voice memo.
- Run the Whisper command-line tool or the Python script.
- Print the transcript.
- Optional: Add the clean_transcript function to remove filler words.
Expected output:
1 2
=== Transcript === Welcome to the weekly sync. Today we decided to push the testing deadline by three days to allow more QA. Sara will update the project plan.
Congratulations! you just turned audio into text with AI.
Zack’s feedback
When Zack saw his first transcript, he felt a wave of relief.
“This changes everything,” he said. “Now I don’t need to write notes while trying to listen at the same time. I can just record the meeting, run it through Whisper, and get a full transcript.”
For Zack, this meant:
- No more missed details.
- No stress about remembering deadlines.
- A foundation for his assistant to summarize meetings the same way it summarized emails.
Conclusion
In this lesson, Zack’s assistant gained ears. It could now listen to meetings and turn them into text:
- He learned the basics of ASR (Automatic Speech Recognition).
- He installed Whisper (or Distil-Whisper for lighter machines).
- He transcribed a sample audio file into text.
- He cleaned the transcript for better readability.
- He completed a mini-exercise transcribing his own 1-minute clip.
From here, Zack was ready for the next step: summarizing meeting transcripts and extracting action items automatically. This would finally solve his meeting problem: not just hearing what was said, but turning it into clear tasks and decisions.
Frequently Asked Questions
ASR (Automatic Speech Recognition) is the process of converting spoken words into text, used in tools like Siri, YouTube captions, and Whisper.
You can use OpenAI’s Whisper for accurate transcription, or Distil-Whisper if you want a lighter, faster version on smaller machines.
No. Whisper can run on CPU for short audio clips. A GPU just makes transcription faster, especially for longer recordings.
Whisper works with common formats like MP3, WAV, and M4A. With FFmpeg installed, it can handle most standard audio files.
Record or find a 1-minute audio clip, run it through Whisper, and print out the transcript. Optionally, clean it by removing filler words.
Still have questions?Contact our support team