Step-by-Step: Fine-Tuning Whisper Tiny for Multilingual Clinical Conversations

DM Television

A first look at the new Intel-powered Microsoft Surface 5G laptop

July

S	M	T	W	T	F	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Step-by-Step: Fine-Tuning Whisper Tiny for Multilingual Clinical Conversations

Tags: api applications audio option web testing

Author: DATE POSTED:May 21, 2025

Feed: Hacker Noon - Medium

View: Original article

In this post, I will walk you through the process of fine-tuning a translation model for code-switched Bengali-English speech, specifically focusing on clinical conversations. By leveraging the Whisper Tiny model and FastAPI, I’ll demonstrate how to fine-tune a model for a specific task, such as translating speech from multiple languages into a single language. We'll cover the entire workflow, from dataset creation and fine-tuning the model to deploying it with FastAPI for real-time predictions.

Step 1: Dataset Creation

Before we can begin fine-tuning the model, we need to have a reliable dataset. For this task, I created a synthetic dataset called MediBeng. This dataset includes code-switched Bengali-English conversations that simulate clinical discussions. In a real-world scenario, doctors and patients often switch between languages during conversations, making transcription and translation more complex.

The MediBeng dataset is designed specifically for this task, including both speech recognition (ASR) and machine translation (MT) tasks. This dataset simulates conversations in which Bengali and English are mixed in a natural way, representing real clinical dialogue. You can find the dataset hosted on Hugging Face, and it is open-source, meaning you can freely use and modify it for your own fine-tuning projects.

By using synthetic data generation techniques, we ensure that the model receives a broad variety of speech patterns, making it robust for real-world applications, especially in multilingual environments like healthcare.

The full process is available in this repository: https://github.com/pr0mila/ParquetToHuggingFace

Step 2: Clone the Repository and Set Up Your Environment

To start fine-tuning the model, we first need to clone the repository where the training code is located. The repository contains everything needed to fine-tune the Whisper Tiny model for the translation task.

The repository is: https://github.com/pr0mila/MediBeng-Whisper-Tiny

After cloning, it’s essential to set up your development environment.

You'll need to install the required dependencies, including libraries for PyTorch, Transformers, and other tools such as Gradio and FastAPI. These tools will not only help with the fine-tuning process but also assist with testing and deployment later on.

The configuration files and scripts in the repository are designed to be as simple and modular as possible, so you don’t have to worry about intricate setup steps. However, make sure to check for any system-specific adjustments and installation instructions in the repository’s documentation.

Step 3: Data Loading and Preprocessing

With the repository set up, the next step is loading and preprocessing the dataset. The data_loader.py script provided in the repository is specifically designed to handle the MediBeng dataset. It will take care of loading the audio files, along with the corresponding transcriptions, and then split them into training and testing sets.

Data preprocessing is essential in fine-tuning a model, as it ensures that the input data is in the right format. You’ll need to:

Load the dataset: Use the provided data loader to read the data from the MediBeng dataset.
Preprocess the audio files: Convert the audio files into a suitable format for the model, such as converting them into spectrograms or MFCCs (Mel-frequency cepstral coefficients).
Split the data: Divide the data into training and testing sets. The repository suggests using 80% for training and 20% for testing, but this can be adjusted based on your needs.

Once the dataset is processed, it's ready for use in training.

Step 4: Fine-Tuning the Model

Now, the fun part begins – fine-tuning the Whisper Tiny model. Fine-tuning involves training the pre-trained Whisper model on your specific task, in this case, code-switched Bengali-English translation.

The fine-tuning process in the repository is straightforward. Here are the key steps:

Adjust Configuration: Update the configuration file to specify the model’s task (translation) and the target language (English). The configuration also includes hyperparameters for training, such as learning rate and batch size.
Start Training: You can now begin training the model using the provided script. Fine-tuning typically takes some time, depending on your computational resources and dataset size. During training, the model learns to translate code-switched conversations into English.
Monitor Performance: While training, it's important to track the model's training loss and Word Error Rate (WER). These metrics help assess the model’s learning progress. Over time, you should see both the training loss and WER decreasing, indicating that the model is improving.
Save the Model: Once fine-tuning is complete, the model is saved and ready for testing. At this stage, you should have a model that can accurately translate code-switched Bengali-English speech into English.

Step 5: Uploading the Fine-Tuned Model

After fine-tuning the model, you can upload it to Hugging Face for easy sharing and access. This allows others to try your model and experiment with it.

Here’s what you need to do:

Set up your Hugging Face token: This allows you to authenticate and upload the model to your Hugging Face account.
Configure Output Directory: Set the output directory in the configuration file where the fine-tuned model is stored.
Upload: Use the provided upload script to send your fine-tuned model to Hugging Face. Once uploaded, it becomes publicly available for others to use.

Step 6: Deploy with FastAPI

Once the model is fine-tuned and uploaded, it's time to make it available for real-time predictions. This is where FastAPI comes into play. FastAPI allows you to build an API endpoint that other systems can interact with, sending audio files and receiving translations.

Here’s how you can deploy the fine-tuned model with FastAPI:

Set up FastAPI: Create an API endpoint that listens for requests. This endpoint will accept audio files, process them, and return the translations.
Run the FastAPI Server: Once the FastAPI server is running, you can test it by sending requests to the model endpoint. The server will load the fine-tuned model and provide real-time translations of the code-switched Bengali-English speech.
Test the Endpoint: You can test the FastAPI service using tools like Postman or directly through the browser. The model will take an audio file as input and return a translated text response.
Deploy: Once tested, you can deploy the FastAPI service on any server, allowing users to interact with your model online.

Step 7: Optional - Gradio Interface

In addition to the FastAPI service, you can also create a Gradio interface for easier interaction. Gradio provides a user-friendly web interface to upload audio files and receive translations. This is a great option for non-technical users who want to try the model without dealing with API calls.

To set up the Gradio interface, simply follow the steps in the repository. Gradio will host the model in a local web interface, where users can interact with the model by uploading their audio files and viewing the translations.

Conclusion

Fine-tuning a model like Whisper Tiny for translation tasks in clinical settings is an excellent way to enhance speech recognition and translation capabilities in multilingual environments. By using FastAPI, you can easily deploy the model for real-time applications, making it accessible for various use cases, such as clinical transcription and multilingual patient records.

Once you’ve followed the steps outlined in this post, you’ll have a fully fine-tuned model that can accurately translate code-switched Bengali-English speech and be deployed for real-time predictions. Additionally, you can experiment with the Gradio interface to make the model more user-friendly.

Feed: Hacker Noon - Medium

View: Original article

Tags: api applications audio option web testing