Convert an mp4 video file into a text summary using python

In this post, we will use FFMPEG and the python speech_recognition module to convert an mp4 video file (e.g youtube) into a text summary.

Darren Willenberg
MLthinkbox
5 min readNov 8, 2022

--

Development process

Introduction

Lately I have been curious about how to extract text from video content such as youtube and tiktok. Potential use cases are to analyse video sentiments on a specific topic, to create subtitles or (as in my case) to be used as input into a text analytics application.

To achieve our goal we will need to follow a few very specific steps, namely: 1) Installing the FFMPEG software, 2) executing various data transformations from MP4 to MP3 to Wav, 3) and finally, applying the Speech_recognition python module to transcribe the latent text. Lets get started!

Installing FFMPEG

The most pythonic way to convert an MP4 video file into an appropriate data processing format is to use FFMPEG. FFMPEG can be used by executing cmd commands directly from your python script. We will get into this a bit later. If you do not already have FFMPEG installed you can follow the steps below or you can skip ahead to the code example.

FFMPEG installation STEP 1: Download latest build of ffmpeg-git-essentials.7z by clicking here.

ffmpeg download options
Downloaded ffmpeg file

FFMPEG installation STEP 2: Create a folder called, for example, FFMPEG in your C:/ drive and extract the contents of the downloaded 7z file into the newly created folder.

FFMPEG installation STEP 3: Add the path of the new folder to your windows system Environment. Make sure to use the path to the bin folder where the “ffmpeg” executable is located.

If you are using a mac then look here on how to add environment variables.

FFMPEG installation STEP 4: You can confirm installation by typing ffmpeg in the cmd.

Code walk through

Once FFMPEG is confirmed to be working we can get into transcribing text from video!

We will import os, speech_recognition and ffmpeg modules. I am also declaring variables for the location of the project home directory, the video to be converted as well as the location of the ffmpeg.exe.

Issuing ffmpeg cmd commands via the python os was a bit troublesome requiring a lot of attention to ffmpeg installation and environment variables. FFmpeg cmd commands typical start with the ffmpeg variable followed by the input and output files and their formats.

ffmpeg -i <input_file.format> <output_file.format>

You can find the ffmpeg cheatsheet here. Once you have a bit of control with ffmpeg cmd commands you can insert necessary your variables into a python string. You can learn about python sting formatting here.

Earlier we imported speech_recognition as “sr”. We use this to load the recognizer function and to input the processed audio file.

Finally we declare the length of time over which we want to transcribe audio into text and declare our audio variable as source, which is syntax unique to speech recognition. More details on this here.

Based on my review, there are minor mistakes in the resulting text when speech is mumbled or if an unusual word is used. The output text is however highly interpretable and can be used for further analysis. If you are interested in comparing the input and output you can watch the original video here.

Conclusions

  • Speech recognition requires data in wav format
  • Installing FFMPEG takes a bit of patience
  • The output text makes sense and can be used for further text analysis purposes.

If you find a better way of doing this, please let me know! The python code can be accessed here. Thanks!

Overview of python dependencies

The OS module in python provides functions for interacting with the operating system. OS, comes under Python’s standard utility modules. This module provides a portable way of using operating system dependent functionality.

The ffmpeg module is required inorder to convert between different audio-visual content such as mp4, mp3 and wav.

The speech recognition module will take the wav file as an input and provided interpreted output text.

References

--

--

Darren Willenberg
MLthinkbox

Engineer | Analyst | Data Science Enthusiast | UCT | MLthinkbox Publication Founder