What’s A Video Without Audio?

Published in

InSide InVideo

13 min readNov 11, 2020

This piece has been intricately penned down by Ajitesh Singh, Aditya Krishnan, Kshitij Agrawal

Did you know that two identical videos but with different music can evoke drastically different emotions in a person? The choice of music (or background audio) can also differentiate between a professional-looking video and an amateur video.

Sound elements don’t just manipulate the audience’s emotions; they can also help them interpret the story. As director George Lucas has repeatedly explained, “I feel that sound is half the experience.” Nowhere is this idea more clear than in Steven Spielberg’s classic film “Jaws.” Associating a sound, such as the bass thrum, with a menacing character adds an element of foreshadowing and builds heightened tension for the moment because the audience knows who is coming. The famous bass line in “Jaws” has become so synonymous with lurking danger that the sound itself can set off a narrative cue for audiences without any visual elements.

Building InVideo has been a fascinating journey of realising the importance of music in a video.

The role of the editor

If you want your video to leave a lasting impression, you must ensure that it has the right audio and audio transitions. Now, who can make that happen? You’re right, an editor!

If you have 2 mins, do check out this Super Bowl ad here. You’ll see that even though the video has a lot of dialogue, the background music is striking and definitely catches your attention

Introducing an InVideo solution — what it is, how it works & why it is helpful.

From all the data (surveys, user sessions, support feedback, amplitude data etc) we collected from our existing users, almost all of the use cases require a maximum of two layers of audio. We did not want to increase the complexity of the timeline by offering more.

Below are the new changes in the audio timeline. Hopefully, these changes will accommodate for the maximum use cases.

Timeline

In our previous timeline, you could only see voice-over as a movable entity, and that too only within the scene’s duration. Although this technique always kept it in sync with a scene, it was largely considered restrictive by our users. The fact that a voice-over could not be placed in-between 2 scenes or while a transition was playing sometimes seemed to break the flow of an otherwise gripping video.

We decided to renovate our entire audio structure to allow users to place audio wherever they felt like. This meant all audio including voice-over, background music, and automated text-to-speech audio.

Separate Audio lines

There are now 2 separate lines on the timeline for audio. One for voice-over and automated text-to-speech and another for background music. All audio is distributed in these lines and labeled for easy recognition. These audio are unaffected by what happens to scenes’ durations. They do not automatically adjust to match scene timings. They also feature a waveform on their bodies which accurately represent their sound intensity at a particular time.

These waveforms are also helpful for trimming on the timeline.

Trimming on the timeline

Previously, users were only able to drag a voice-over on the timeline and that too only within a scene’s duration.

Now, along with dragging, they can also resize any audio.

What does resizing do? It inherently trims your audio just like in the trimmer. It is purposed for quick trims of audio that don’t really need a trimmer to be opened and for trims that can be done just by looking at the waveforms on the timeline. This facilitates much faster trimming.

Easy repositioning

It often is a hassle if you have placed all audio in their respective times and you need to insert one more in the middle. For this, we did not want you to individually move every audio element to make space for the middle audio. So we ensured that audio on the timeline can be pushed by other audio if you are dragging or resizing so that you have to move only one audio to reposition all others easily.

Quick options

We also added some quick options on the audio in the timeline itself which you can see by hovering on it. Some options like a loop, trim, duplicate and delete need to quickly accessed and so we made it readily available.

Voiceover Recording

Our previous recorder was pretty simple which only showed the duration of your recording. It was missing the feedback that could make it more alive and less stopwatch-y. In order to do that, we added a live waveform to the recorder which shows you the intensity of your voice as you speak. A flowing waveform that would keep around a minute of your latest speech in sight is what we felt would make it useful and more responsive at the same time.

Coming to audio quality, we record at a standard 44.1 kHz sampling frequency. Even wondered the reason for this number?

The Nyquist–Shannon sampling theorem says the sampling frequency must be greater than twice the maximum frequency one wishes to reproduce. Since the human hearing range is roughly 20 Hz to 20,000 Hz, the sampling rate had to be greater than 40 kHz.

Live Waveform

So, how did we make a live waveform? We made use of the canvas element to draw it. We wanted a smooth wave which flows somewhat slowly so that it doesn’t escape the user’s view quickly.

For this purpose, we used the same AudioBuffer array from recording, which is just a Float32Array with positive and negative values (negative as well because audio is a wave, which here is stored in a sampled format). We find the maximum intensity of data points in a 40ms timespan and make our filtered array.

We consider the vertical centre of the canvas as 0 intensity.
We plot points on the canvas relative to the centre and keep connecting them using lines.
As to make the waveform flow, we keep offsetting it with time to make it move towards the left from the horizontal centre.

Why does the wave not start from the right edge?

The right edge of the recorder is very close to the edge of the screen. Keeping it in the middle felt more natural without straining the eye. Just try staring at the edge of the screen for a minute and you’ll see what we mean.

AudioWorklet

Along with visuals, we added technical optimisations to our recorder as well. In our previous version, we were making use of ScriptProcessorNode which had 2 problems. The event handling is asynchronous by design and it executed code on the main thread.

This execution pressures the main thread that is commonly crowded with various UI and DOM-related tasks causing either the UI to "jank" or audio to "glitch”.

We made use of the newly introduced AudioWorklet and AudioWorkletProcessor in the Web Audio API which nicely keeps the user-supplied JavaScript code all within a separate audio processing thread — that is, it doesn’t have to jump over to the main thread to process audio. This ensures zero additional latency and synchronous rendering which allows the audio to record accurately while allowing the UI to run smoothly.

Trimmer

Our previous audio trimmer in the right panel was fine for 20 second long recordings but anything longer than that would make it feel cramped. It also felt jerky which made it even more difficult for precise positioning.

For this, we gave it a much larger space on the bottom panel and kept it on the same theme as the recorder such that they complement each other. It provides you with the same waveform that you saw while recording which you can now use for trimming.

Now, what about long recordings, you know, those that exceed 10 or 20 mins?

For this, we introduced a new feature of zooming on the trimmer. You can now zoom in or out on the trimmer using a slider. The levels of zooming are calculated dynamically depending on the audio duration. The longer the audio, the more zoom you will have. With more zoom, you will be able to see the minute highs and lows of the audio in its waveform.

Coming to how we addressed the jerky nature of trim handles. We re-designed our handles to be thicker for better grabbing and exclusive of the trimmed duration. This means you can consider the handles to be floating while only what remains in between the handles will be trimmed. This does not allow them to ever interfere with your trims but rather move freely and smoothly till it reaches the ends of audio. Yes, the trimmed region can be dragged as well, just like old times!

These trims are also pseudo-trims which mean we do not actually edit your audio to make new audio. We programmatically make trims such that you can always go back to the original file if you wanted to.

Transition from scenes level audio to project level audio

For those who have used InVideo would be familiar with the concept of scenes but for those who haven’t, here’s a quick recap.

Just like slides in a PowerPoint, a video with InVideo is made up of multiple scenes. Any number of text/images/videos can be added to each scene with custom animations. Additionally, users also have an option to add a background music/voice-over but there were quite a lot of limitations.

To name a few — you can have only one background music for the entire video and voice-overs were strictly scene level. For eg. If you had a video with 4 scenes like the one shown above and consider the duration of scene 1 was 10 secs. You can have a voiceover of maximum 10 secs in that scene, and there was no way to extend the voice-over without increasing the duration of the scene. This scene level voiceover was created for a text to video workflow and was really limited with a quick video edit workflow. To give flexibility of allowing users to add and adjust audio with no constraints, it was a no brainer to bring audio project level.

The video below is a good explainer of what project level means.

Technical challenges in terms of backward compatibility

To understand why backward compatibility was a major challenge, let me show you how a project is stored. Any project is just a nested recursive JSON.

An important thing to understand here is a child cannot exceed the duration limits of its parent. Hence, previously, voiceover couldn’t exceed the duration of a scene since hierarchically voiceover is stored inside a scene. We had to bring the voiceover out in such a way that none of the old projects gets affected.

Sounds easy, right?

A script that upgrades the version of the JSON should do the trick. The problem arises when you have to support both the versions, because an important white-labeled Client of ours who uses InVideo has a slower release cycle. Managing both the versions separately would have been a headache.

Here’s how we solved it.

We created a v2 version of the JSON which would be compatible with the previous version and secondly not deleting the voice-overs inside the scene only for Reuters.
This means a Reuters project can be opened and edited in both the old and the new version. It was important to delete the voice-over inside the scene for other users because voice-over drives the duration of the scene (i.e. a scene cannot be shorter than the voice-over duration) which could create confusion for existing users and might create chaos for support.
So we did this in the cleanest way possible by modifying the save project API to handle both the cases.

Volume Adjustment algorithm

Now if multiple audios are playing simultaneously, there needs to be a provision to decide the volume levels of audio when it overlaps with another audio element.

So each audio element has two volume levels. i.e a High Volume level and a Low Volume level.

The portion of audio that overlaps with another audio plays at a low volume level, otherwise it plays at a high volume level.

HS — High Volume ; LS — Low Volume

Each audio element has a list of volume states. A duration/volume in any audio element impacts other audio elements and their volume states as well.

Pseudo code of how volume states are calculated:

Get a list of intervals (start time and end time) of each audio element. For the above example it would be something like this

[
	[7, 15],
	[2, 10],
	[0, 20]
]

Flatten this two dimensional array, sort it and also add a marking with S(start) and E(end)

['0S', '2S', '7S', '10E', '15E', '20E']

Now iterate through this array and keep a count of S (start) and E(end) visited. Now at any point of the Count of Start Visited - Count of End Visited > 1 then you know the overlap has started. Similarly, if Count of processed start - Count of processed end = 1 then overlap ends. Store all the overlapping sections. In our case it's just one [2, 15].
Finally for each audio element, split the duration into volume stated based on overlap.

How is preview for audio generated?

A naive way of generating audio preview on the web is with setTimeout and setInterval. But this approach has a lot of drawbacks. To list a few, setTimeout and setInterval don’t give accuracy guarantee, it can lag arbitrarily, also it involves a lot of calculation to implement play-pause with audio. We wanted to have control of play pause to a granular level such that even fade in and fade out works flawlessly no matter where the preview begins.

There is another popular concept of tweening (or inbetweening) in the animation profession which can be used to our benefit. A tween is simply generating intermediate frames from the first and the last frame. We use a similar concept to generate audio preview with play pause functionality. A tween is generated for each audio and then later this tween exposes apis to seek, play, pause etc. There are various external libraries which exposes apis that can be easily used, or if you are a geek join InVideo to make it from scratch. 🥳.

Deep diving into Crossfade

A fade is simply a gradual increase in audio volume over a period of time. So to implement a fade programmatically, the first thought that would pop in your mind is to start at audio volume 0 and keep increasing it by some delta such that it finally reaches volume 1.

A linear crossfade is not always the right choice. Blending two uncorrelated audio signals with fade would exhibit a volume dip. Any professional video creator would notice it.

Here is why that happens.

Let’s first understand the difference between energy and amplitude.

So when we transition the volume from 0 to 1, we are actually changing the amplitude but the perceived loudness (energy) of an audio signal by a human ear is actually the square root of this amplitude. For eg. to get a perceived loudness of 1/2, the amplitude must be √1/2 = 0.707

In equal power crossfade the sum of square of amplitudes is equal to 1 and they intersect at 0.707 which reduces the volume dip to a negligible value. Hence, we are planning to move to equal power fade in which the fade is non-linear and the intersection in case of a crossfade happens at a higher amplitude.

So far we have been controlling audio purely with volume levels to preview it in the browser. But what if we want to permanently reduce the loudness of an audio by 50% or maybe increase the loudness by 100%?

In Audio engineering, increase or increase in the loudness is done by applying gains which is measured in dB (decibels).

What is a Gain?

A gain is simply a ratio of the signal measured at the output to the signal measured at the input. For response curves, a gain is measured on a logarithmic scale and is almost always measured in decibels. Formula to find out gain is

so the gain that needs to be applied to increase the volume level by 100% = 6 dB

similarly, gain to decrease the volume by 50% = -6dB

Impact of new Audio Timeline:

This graph shows the usage of voiceover in the past few days. As you see here the number of people using voiceover which spans across scenes is higher than the number of users voiceover.

This graph shows the usage of music in the past few days. So the blur graph is Music usage which has either 0 or > 1. This usage is about 20% of total music usage.

What’s next?

Auto-extend music while adding scenes on storyboard editor — this will make it easier for storyboard users who do not care about multiple music.
Play audio while trimming — this will lead to easier trimming and getting feedback while trimming the audio on the timeline.
Equal Power Crossfade to make the audio sound even better.
Replace will be next — replacing music tracks is something that will allow users to make changes to audio tracks quickly and easily.
SFX sounds — provide a richer video.