In this article, we’ll build a speech-to-text application using OpenAI’s Whisper, along with React, Node.js, and FFmpeg. The app will take user input, synthesize it into speech using OpenAI’s Whisper API, and output the resulting text. Whisper gives the most accurate speech-to-text transcription I’ve used, even for a non-native English speaker.
Introducing Whisper
OpenAI explains that Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the Web.
Text is easier to search and store than audio. However, transcribing audio to text can be quite laborious. ASRs like Whisper can detect speech and transcribe the audio to text with a high level of accuracy and very quickly, making it a particularly useful tool.
Prerequisites
This article is aimed at developers who are familiar with JavaScript and have a basic understanding of React and Express.
If you want to build along, you’ll need an API key. You can obtain one by signing up for an account on the OpenAI platform. Once you have an API key, make sure to keep it secure and not share it publicly.
Tech Stack
We’ll be building the frontend of this app with Create React App (CRA). All we’ll be doing in the frontend is uploading files, picking time boundaries, making network requests and managing a few states. I chose CRA for simplicity. Feel free to use any frontend library you prefer or even plain old JS. The code should be mostly transferable.
For the backend, we’ll be using Node.js and Express, just so we can stick with a full JS stack for this app. You can use Fastify or any other alternative in place of Express and you should still be able to follow along.
Note: in order to keep this article focussed on the subject, long blocks of code will be linked to, so we can focus on the real tasks at hand.
Setting Up the Project
We start by creating a new folder that will contain both the frontend and backend for the project for organizational purposes. Feel free to choose any other structure you prefer:
mkdir speech-to-text-app
cd speech-to-text-app
Next, we initialize a new React application using create-react-app
:
npx create-react-app frontend
Navigate to the new frontend
folder and install axios
to make network requests and react-dropzone
for file upload with the code below:
cd frontend
npm install axios react-dropzone react-select react-toastify
Now, let’s switch back into the main folder and create the backend
folder:
cd ..
mkdir backend
cd backend
Next, we initialize a new Node application in our backend
directory, while also installing the required libraries:
npm init -y
npm install express dotenv cors multer form-data axios fluent-ffmpeg ffmetadata ffmpeg-static
npm install --save-dev nodemon
In the code above, we’ve installed the following libraries:
dotenv
: necessary to keep our OpenAI API key away from the source code.cors
: to enable cross-origin requests.multer
: middleware for uploading our audio files. It adds a.file
or.files
object to the request object, which we’ll then access in our route handlers.form-data
: to programmatically create and submit forms with file uploads and fields to a server.axios
: to make network requests to the Whisper endpoint.
Also, since we’ll be using FFmpeg for audio trimming, we have these libraries:
fluent-ffmpeg
: this provides a fluent API to work with the FFmpeg tool, which we’ll use for audio trimming.ffmetadata
: this is used for reading and writing metadata in media files. We need it to retrieve the audio duration.ffmpeg-static
: this provides static FFmpeg binaries for different platforms, and simplifies deploying FFmpeg.
Our entry file for the Node.js app will be index.js
. Create the file inside the backend
folder and open it in a code editor. Let’s wire up a basic Express server:
const express = require('express');
const cors = require('cors');
const app = express();
app.use(cors());
app.use(express.json());
app.get('/', (req, res) => {
res.send('Welcome to the Speech-to-Text API!');
});
const PORT = process.env.PORT || 3001;
app.listen(PORT, () => {
console.log(`Server is running on port ${PORT}`);
});
Update package.json
in the backend
folder to include start and dev scripts:
"scripts": {
"start": "node index.js",
"dev": "nodemon index.js",
}
The above code simply registers a simple GET
route. When we run npm run dev
and go to localhost:3001
or whatever our port is, we should see the welcome text.
Integrating Whisper
Now it’s time to add the secret sauce! In this section, we’ll:
- accept a file upload on a
POST
route - convert the file to a readable stream
- very importantly, send the file to Whisper for transcription
- send the response back as JSON
Let’s now create a .env
file at the root of the backend
folder to store our API Key, and remember to add it to gitignore
:
OPENAI_API_KEY=YOUR_API_KEY_HERE
First, let’s import some of the libraries we need to update file uploads, network requests and streaming:
const multer = require('multer')
const FormData = require('form-data');
const { Readable } = require('stream');
const axios = require('axios');
const upload = multer();
Next, we’ll create a simple utility function to convert the file buffer into a readable stream that we’ll send to Whisper:
const bufferToStream = (buffer) => {
return Readable.from(buffer);
}
We’ll create a new route, /api/transcribe
, and use axios to make a request to OpenAI.
First, import axios
at the top of the app.js
file: const axios = require('axios');
.
Then, create the new route, like so:
app.post('/api/transcribe', upload.single('file'), async (req, res) => {
try {
const audioFile = req.file;
if (!audioFile) {
return res.status(400).json({ error: 'No audio file provided' });
}
const formData = new FormData();
const audioStream = bufferToStream(audioFile.buffer);
formData.append('file', audioStream, { filename: 'audio.mp3', contentType: audioFile.mimetype });
formData.append('model', 'whisper-1');
formData.append('response_format', 'json');
const config = {
headers: {
"Content-Type": `multipart/form-data; boundary=${formData._boundary}`,
"Authorization": `Bearer ${process.env.OPENAI_API_KEY}`,
},
};
// Call the OpenAI Whisper API to transcribe the audio
const response = await axios.post('https://api.openai.com/v1/audio/transcriptions', formData, config);
const transcription = response.data.text;
res.json({ transcription });
} catch (error) {
res.status(500).json({ error: 'Error transcribing audio' });
}
});
In the code above, we use the utility function bufferToStream
to convert the audio file buffer into a readable stream, then send it over a network request to Whisper and await
the response, which is then sent back as a JSON
response.
You can check the docs for more on the request and response for Whisper.
Installing FFmpeg
We’ll add additional functionality below to allow the user to transcribe a part of the audio. To do this, our API endpoint will accept startTime
and endTime
, after which we’ll trim the audio with ffmpeg
.
Installing FFmpeg for Windows
To install FFmpeg for Windows, follow the simple steps below:
- Visit the FFmpeg official website’s download page here.
- Under the Windows icon there are several links. Choose the link that says “Windows Builds”, by gyan.dev.
- Download the build that corresponds to our system (32 or 64 bit). Make sure to download the “static” version to get all the libraries included.
- Extract the downloaded ZIP file. We can place the extracted folder wherever we prefer.
- To use FFmpeg from the command line without having to navigate to its folder, add the FFmpeg
bin
folder to the system PATH.
Installing FFmpeg for macOS
If we’re on macOS, we can install FFmpeg with Homebrew:
brew install ffmpeg
Installing FFmpeg for Linux
If we’re on Linux, we can install FFmpeg with apt
, dnf
or pacman
, depending on our Linux distribution. Here’s the command for installing with apt
:
sudo apt update
sudo apt install ffmpeg
Trim Audio in the Code
Why do we need to trim the audio? Say a user has an hour-long audio file and only wants to transcribe from the 15-minute mark to 45-minute mark. With FFmpeg, we can trim to the exact startTime
and endTime
, before sending the trimmed stream to Whisper for transcription.
First, we’ll import the the following libraries:
const ffmpeg = require('fluent-ffmpeg');
const ffmpegPath = require('ffmpeg-static');
const ffmetadata = require('ffmetadata');
const fs = require('fs');
ffmpeg.setFfmpegPath(ffmpegPath);
fluent-ffmpeg
is a Node.js module that provides a fluent API for interacting with FFmpeg.ffmetadata
will be used to read the metadata of the audio file — specifically, theduration
.ffmpeg.setFfmpegPath(ffmpegPath)
is used to explicitly set the path to the FFmpeg binary.
Next, let’s create a utility function to convert time passed as mm:ss
into seconds. This can be outside of our app.post
route, just like the bufferToStream
function:
/**
* Convert time string of the format 'mm:ss' into seconds.
* @param {string} timeString - Time string in the format 'mm:ss'.
* @return {number} - The time in seconds.
*/
const parseTimeStringToSeconds = timeString => {
const [minutes, seconds] = timeString.split(':').map(tm => parseInt(tm));
return minutes * 60 + seconds;
}
Next, we should update our app.post
route to do the following:
- accept the
startTime
andendTime
- calculate the duration
- deal with basic error handling
- convert audio buffer to stream
- trim audio with FFmpeg
- send the trimmed audio to OpenAI for transcription
The trimAudio
function trims an audio stream between a specified start time and end time, and returns a promise that resolves with the trimmed audio data. If an error occurs at any point in this process, the promise is rejected with that error.
Let’s break down the function step by step.
Define the trim audio function. The
trimAudio
function is asynchronous and accepts theaudioStream
andendTime
as arguments. We define temporary filenames for processing the audio:const trimAudio = async (audioStream, endTime) => { const tempFileName = `temp-${Date.now()}.mp3`; const outputFileName = `output-${Date.now()}.mp3`;
Write stream to a temporary file. We write the incoming audio stream into a temporary file using
fs.createWriteStream()
. If there’s an error, thePromise
gets rejected:return new Promise((resolve, reject) => { audioStream.pipe(fs.createWriteStream(tempFileName))
Read metadata and set endTime. After the audio stream finishes writing to the temporary file, we read the metadata of the file using
ffmetadata.read()
. If the providedendTime
is longer than the audio duration, we adjustendTime
to be the duration of the audio:.on('finish', () => { ffmetadata.read(tempFileName, (err, metadata) => { if (err) reject(err); const duration = parseFloat(metadata.duration); if (endTime > duration) endTime = duration;
Trim Audio using FFmpeg. We utilize FFmpeg to trim the audio based on the start time (
startSeconds
) received and duration (timeDuration
) calculated earlier. The trimmed audio is written to the output file:ffmpeg(tempFileName) .setStartTime(startSeconds) .setDuration(timeDuration) .output(outputFileName)
Delete temporary files and resolve promise. After trimming the audio, we delete the temporary file and read the trimmed audio into a buffer. We also delete the output file using the Node.js file system after reading it to the buffer. If everything goes well, the
Promise
gets resolved with thetrimmedAudioBuffer
. In case of an error, thePromise
gets rejected:.on('end', () => { fs.unlink(tempFileName, (err) => { if (err) console.error('Error deleting temp file:', err); });const trimmedAudioBuffer = fs.readFileSync(outputFileName); fs.unlink(outputFileName, (err) => { if (err) console.error('Error deleting output file:', err); }); resolve(trimmedAudioBuffer); }) .on('error', reject) .run();
The full code for the endpoint is available in this GitHub repo.
The Frontend
The styling will be done with Tailwind, but I won’t cover setting up Tailwind. You can read about how to set up and use Tailwind here.
Creating the TimePicker component
Since our API accepts startTime
and endTime
, let’s create a TimePicker
component with react-select
.
Using react-select
simply adds other features to the select menu like searching the options, but it’s not critical to this article and can be skipped.
Let’s break down the TimePicker
React component below:
Imports and component declaration. First, we import necessary packages and declare our
TimePicker
component. TheTimePicker
component accepts the propsid
,label
,value
,onChange
, andmaxDuration
:import React, { useState, useEffect, useCallback } from 'react'; import Select from 'react-select'; const TimePicker = ({ id, label, value, onChange, maxDuration }) => {
Parse the
value
prop. Thevalue
prop is expected to be a time string (formatHH:MM:SS
). Here we split the time into hours, minutes, and seconds:const [hours, minutes, seconds] = value.split(':').map((v) => parseInt(v, 10));
Calculate maximum values.
maxDuration
is the maximum time in seconds that can be selected, based on audio duration. It’s converted into hours, minutes, and seconds:const validMaxDuration = maxDuration === Infinity ? 0 : maxDuration const maxHours = Math.floor(validMaxDuration / 3600); const maxMinutes = Math.floor((validMaxDuration % 3600) / 60); const maxSeconds = Math.floor(validMaxDuration % 60);
Options for time selects. We create arrays for possible hours, minutes, and seconds options, and state hooks to manage the minute and second options:
const hoursOptions = Array.from({ length: Math.max(0, maxHours) + 1 }, (_, i) => i); const minutesSecondsOptions = Array.from({ length: 60 }, (_, i) => i); const [minuteOptions, setMinuteOptions] = useState(minutesSecondsOptions); const [secondOptions, setSecondOptions] = useState(minutesSecondsOptions);
Update value function. This function updates the current value by calling the
onChange
function passed in as a prop:const updateValue = (newHours, newMinutes, newSeconds) => { onChange(`${String(newHours).padStart(2, '0')}:${String(newMinutes).padStart(2, '0')}:${String(newSeconds).padStart(2, '0')}`); };
Update minute and second options function. This function updates the minute and second options depending on the selected hours and minutes:
const updateMinuteAndSecondOptions = useCallback((newHours, newMinutes) => { const minutesSecondsOptions = Array.from({ length: 60 }, (_, i) => i); let newMinuteOptions = minutesSecondsOptions; let newSecondOptions = minutesSecondsOptions; if (newHours === maxHours) { newMinuteOptions = Array.from({ length: Math.max(0, maxMinutes) + 1 }, (_, i) => i); if (newMinutes === maxMinutes) { newSecondOptions = Array.from({ length: Math.max(0, maxSeconds) + 1 }, (_, i) => i); } } setMinuteOptions(newMinuteOptions); setSecondOptions(newSecondOptions); }, [maxHours, maxMinutes, maxSeconds]);
Effect Hook. This calls
updateMinuteAndSecondOptions
whenhours
orminutes
change:useEffect(() => { updateMinuteAndSecondOptions(hours, minutes); }, [hours, minutes, updateMinuteAndSecondOptions]);
Helper functions. These two helper functions convert time integers to select options and vice versa:
const toOption = (value) => ({ value: value, label: String(value).padStart(2, '0'), }); const fromOption = (option) => option.value;
Render. The
render
function displays the time picker, which consists of three dropdown menus (hours, minutes, seconds) managed by thereact-select
library. Changing the value in the select boxes will callupdateValue
andupdateMinuteAndSecondOptions
, which were explained above.
You can find the full source code of the TimePicker component on GitHub.
The main component
Now let’s build the main frontend component by replacing App.js
.
The App component will implement a transcription page with the following functionalities:
- Define helper functions for time format conversion.
- Update
startTime
andendTime
based on selection from theTimePicker
component. - Define a
getAudioDuration
function that retrieves the duration of the audio file and updates theaudioDuration
state. - Handle file uploads for the audio file to be transcribed.
- Define a
transcribeAudio
function that sends the audio file by making an HTTP POST request to our API. - Render UI for file upload.
- Render
TimePicker
components for selectingstartTime
andendTime
. - Display notification messages.
- Display the transcribed text.
Let’s break this component down into several smaller sections:
Imports and helper functions. Import necessary modules and define helper functions for time conversions:
import React, { useState, useCallback } from 'react'; import { useDropzone } from 'react-dropzone'; // for file upload import axios from 'axios'; // to make network request import TimePicker from './TimePicker'; // our custom TimePicker import { toast, ToastContainer } from 'react-toastify'; // for toast notification // Helper functions (timeToSeconds, secondsToTime, timeToMinutesAndSeconds)
Component declaration and state hooks. Declare the
TranscriptionPage
component and initialize state hooks:const TranscriptionPage = () => { const [uploading, setUploading] = useState(false); const [transcription, setTranscription] = useState(''); const [audioFile, setAudioFile] = useState(null); const [startTime, setStartTime] = useState('00:00:00'); const [endTime, setEndTime] = useState('00:10:00'); // 10 minutes default endtime const [audioDuration, setAudioDuration] = useState(null); // ...
Event handlers. Define various event handlers — for handling start time change, getting audio duration, handling file drop, and transcribing audio:
const handleStartTimeChange = (newStartTime) => { //... }; const getAudioDuration = (file) => { //... }; const onDrop = useCallback((acceptedFiles) => { //... }, []); const transcribeAudio = async () => { // we'll explain this in detail shortly //... };
Use the Dropzone hook. Use the
useDropzone
hook from thereact-dropzone
library to handle file drops:const { getRootProps, getInputProps, isDragActive, isDragReject } = useDropzone({ onDrop, accept: 'audio/*', });
Render. Finally, render the component. This includes a dropzone for file upload,
TimePicker
components for setting start and end times, a button for starting the transcription process, and a display for the resulting transcription.
The transcribeAudio
function is an asynchronous function responsible for sending the audio file to a server for transcription. Let’s break it down:
const transcribeAudio = async () => {
setUploading(true);
try {
const formData = new FormData();
audioFile && formData.append('file', audioFile);
formData.append('startTime', timeToMinutesAndSeconds(startTime));
formData.append('endTime', timeToMinutesAndSeconds(endTime));
const response = await axios.post(`http://localhost:3001/api/transcribe`, formData, {
headers: { 'Content-Type': 'multipart/form-data' },
});
setTranscription(response.data.transcription);
toast.success('Transcription successful.')
} catch (error) {
toast.error('An error occurred during transcription.');
} finally {
setUploading(false);
}
};
Here’s a more detailed look:
setUploading(true);
. This line sets theuploading
state totrue
, which we use to indicate to the user that the transcription process has started.const formData = new FormData();
.FormData
is a web API used to send form data to the server. It allows us to send key–value pairs where the value can be a Blob, File or a string.The
audioFile
is appended to theformData
object, provided it’s not null (audioFile && formData.append('file', audioFile);
). The start and end times are also appended to theformData
object, but they’re converted toMM:SS
format first.The
axios.post
method is used to send theformData
to a server endpoint (http://localhost:3001/api/transcribe
). Changehttp://localhost:3001
to the server address. This is done with anawait
keyword, meaning that the function will pause and wait for the Promise to be resolved or be rejected.If the request is successful, the response object will contain the transcription result (
response.data.transcription
). This is then set to thetranscription
state using thesetTranscription
function. A successful toast notification is then shown.If an error occurs during the process, an error toast notification is shown.
In the
finally
block, regardless of the outcome (success or error), theuploading
state is set back tofalse
to allow the user to try again.
In essence, the transcribeAudio
function is responsible for coordinating the entire transcription process, including handling the form data, making the server request, and handling the server response.
You can find the full source code of the App component on GitHub.
Conclusion
We’ve reached the end and now have a full web application that transcribes speech to text with the power of Whisper.
We could definitely add a lot more functionality, but I’ll let you build the rest on your own. Hopefully we’ve gotten you off to a good start.
Here’s the full source code:
Frequently Asked Questions (FAQs) about Speech-to-Text with Whisper, React, and Node
What is Whisper and how does it work with React and Node?
Whisper is an automatic speech recognition (ASR) system developed by OpenAI. It’s designed to convert spoken language into written text. When used with React and Node, Whisper can provide real-time transcription services for applications. React, a JavaScript library for building user interfaces, can display the transcriptions on the front-end, while Node.js, a back-end JavaScript runtime, can handle the server-side operations such as sending audio data to Whisper and receiving transcriptions.
How can I install and set up Whisper in my React and Node project?
To install Whisper, you need to clone the Whisper ASR repository from GitHub. After cloning, you can install the necessary dependencies using npm or yarn. Setting up Whisper in your project involves configuring your server-side code (Node.js) to send audio data to Whisper and receive transcriptions. On the front-end (React), you need to set up components to display the transcriptions.
Can I use Whisper for languages other than English?
Currently, Whisper primarily supports English. However, OpenAI is continuously working on improving and expanding the capabilities of Whisper, so support for other languages may be added in the future.
How accurate is Whisper in transcribing speech to text?
Whisper is designed to be highly accurate in transcribing speech to text. However, like any ASR system, its accuracy can be influenced by factors such as the clarity of the speech, background noise, and the speaker’s accent.
How can I improve the accuracy of Whisper’s transcriptions?
You can improve the accuracy of Whisper’s transcriptions by ensuring clear and distinct speech, minimizing background noise, and using a high-quality microphone. Additionally, you can customize Whisper’s settings to better suit your specific use case.
Is Whisper suitable for real-time transcription in a production environment?
Yes, Whisper is designed to handle real-time transcription in a production environment. Its performance can be optimized by properly configuring your server-side code and ensuring a stable internet connection.
Can I use Whisper for offline transcription?
Currently, Whisper requires an internet connection to function as it needs to communicate with the OpenAI servers for transcription. Offline functionality is not available at this time.
How can I handle errors or issues when using Whisper?
When using Whisper, you can handle errors or issues by implementing error handling in your code. This can involve catching and logging errors, retrying operations, and providing user-friendly error messages.
Is there a cost associated with using Whisper?
As of now, Whisper is an open-source project and can be used free of charge. However, it’s always a good idea to check the official OpenAI website for any updates regarding pricing.
Can I contribute to the development of Whisper?
Yes, as an open-source project, contributions to the development of Whisper are welcome. You can contribute by submitting pull requests on GitHub, reporting issues, or suggesting improvements.
Abiodun Sulaiman is a seasoned full-stack developer with a decade of hands-on experience in the JavaScript ecosystem. His expertise spans across web and mobile applications, making him adept at navigating complex project requirements and delivering robust solutions.