How we built Text at InVideo #ChallengeAccepted

Published in

InSide InVideo

7 min readOct 12, 2020

Smooth sailing into an enduring storm..

As a startup, we have noticed the ubiquitous presence of the Pareto principle (80/20 rule) in a lot of things we do. It makes us happy and even over-optimistic sometimes, when so much of the work seems to be done with so little effort.

Case in point — InVideo’s feature-rich text module. Building out HTML based text editing with highly customisable properties (anything from making a text bold and changing the font, to adjusting the kerning of the characters) majorly relied on just a handful of functions, namely — the two below.

The applyStyle function was used to assign a specific HTML property (like font, colour, kerning) to any text element. Since HTML exposes most of the properties we needed to begin with, right off the bat — a majority of our work seemed to be done!

Or so we thought..

Let’s talk about the remaining 80% of the effort, shall we?

Like with any semantic entity such as an API, system or framework — things start getting tricky when some functionality is not exposed by default, or the entity is not structurally written to handle a specific use case.

We realised very early on that we had a lot of such functionalities to build and corner-cases to handle, which HTML wouldn’t support straight away. Things like character-based animations, auto-resizing for custom dimensions, cross-platform rendering, multi-language support, etc. were essential to building out a holistic editor and required us to write our own custom implementations to handle them.

Here are some of the challenges we faced while building the text module at InVideo -

1. Challenges with auto-resizing text

We have a mode in InVideo where the font size of a text element is automatically adjusted to optimally fit its bounding area. Adding/removing text, changing the width or height of its bounding box, and any such actions would trigger the auto-resize method to return the new optimal font size in real-time.

Now, this is not a cumbersome problem as is. There are quite a few libraries that do this as well, for example — http://fittextjs.com/

However, it got interesting when combined with the following added functionalities -

Allow Highlighting of Text:

On InVideo, a user can highlight just one word in a sentence, multiple non-adjacent words or the whole sentence. Highlighting changes the font style, colour and/or banding of the highlighted part.

This feature added the complexity where we now needed to perform the re-sizing on multiple sequential and nested HTML tags, instead of a single entity. Very few libraries supported this back in 2018.

As a result, we ended up building out the resizing logic ourselves. This included extending the basic resize logic of the nested entities to the whole parent box through sequential calculations while factoring in the kerning, padding, font-style and scale.

Different zoom levels

While our videos are mainly exported in 1080p (1920x1080) — they are edited at a different resolution on the editor window (depending on the user’s screen). We stress on providing a WYSIWYG (what you see is what you get) experience, to support which — the editor canvas output needs to look exactly the same as the final video export.

We started off with applying a simple CSS scale transform to the entire editor to resize the dimensions

However, this method gave us two problems -

Any element would have to go through the unnecessary and repeated scale calculation, every time it would be moved, resized or rotated — leading to a decreased fps.
There are minor precision losses across systems, browsers and servers, which deplete the WYSIWYG experience. For instance, a text sentence which fit in one line on the editor would spill into the second line in the final export, as seen below -

See how the text just changed its structure? That is caused sometimes by loss of precision.

So, we decided to take the control in-house and introduced a custom text scaling implementation (‘ScaleFactor’), which in its simplest form, calculates the font size in different dimensions basis the original the video dimension, while maintaining consistency of experience.
With text handled this way, every other non-text element could then be handled in percentages, and therefore work with any dimension.

2. Rendering and cross-platform challenges

Scaling precision loss was not the only cause of the text looking different (Exhibit A and B above) between the final video export and the editor canvas. We fought other battles too -

Framework differences

One such battle was regarding an intermittent issue that had a seemingly random frequency of occurrence and was thus rather challenging to resolve.

Only after extensive debugging, we figured out that bootstrap, which we were using at the time on the editor, overrides a few properties — mainly box-sizing; while our final video rendering, which was done using a vanilla approach, did not override them.

Changing it to vanilla on the editor too, solved the problem. Seems naive in hindsight, but we never imagined that it could be because of a CSS framework!

Later, we figured out that we were using bootstrap (at the time) on the editor and it overrides a few properties — mainly box sizing while rendering was done using a vanilla approach.

Seems naive, but when the issue is intermittent, you never could imagine that it could be because of the css framework.

Operating System Differences

This one is my absolute favourite. We use ubuntu servers for the final video export and most of our users work on either Windows or macOS PCs. There are certain rare rendering engine differences with respect to operating systems, most of which don’t show up in daily use.

However, we weren’t immune to this and faced the complexity with certain types of whitespaces.

For instance,

had a noticeable difference when it came to ubuntu vs other OS (that too for very specific cases and fonts). We support multiple spaces in our text. This compounded the problem and the gap difference was noticeable for our users.

Eventually, we solved this by using non-breaking spaces instead, which use the intrinsic font-defined space width and are thus consistent across systems. It worked like a charm.

3. Supporting multiple languages

While Windows and Mac OS’s come with out-of-the-box support for international language fonts, Ubuntu does not. Luckily it is configurable in Ubuntu.

The most common way is -

This will take care of all the font installations.

However, this approach overlooked the fact that there would be conflicts between certain font families when it came to representing emojis. As a result, the emojis of those fonts would be rendered as rectangular boxes instead. Something like this -

To resolve this, we had to sift through all the fonts and re-install manually or skip the fonts where the conflicts would occur.

4. Splitting Text for Animations

In order to provide various text animation effects like the typewriter, waterfall, etc — we had to split the whole text entity by lines, words, or characters and then animate the individual split entities sequentially — to produce the desired effects.

Splitting the text sounded like a fairly easy thing to do, with a number of articles and libraries available (GSAP SplitText, Jquery Splitlines). Again, however, none of the standard libraries supported nested HTML handling, which was needed for highlighting and applying banded backgrounds to the text.

We had to thus deep-dive and write the splitting logic on our own.

When we switched to GSAP as the base of our animation engine, we thought GSAP would handle it. However, even then, we found certain inconsistencies between GSAP and InVideo.

So we stayed with our in-house logic.

5. Professional Looking Animation Effects

Continuing from the last point, animating each character or line sequentially looks rather amateur and unprofessional if not done in the right way.

To hit upon the smoothness required, we did many iterations of these animation effects on parameters like -

sequence delay (how long after the first character begins it’s animation, should the second character start animating, and so on..)
bezier curves
animation duration

This became even more challenging because of the dynamic nature of input text on which the effects had to be applied. For instance, the user is free to input a text of font size anywhere from 20 to 200, or variable text length from 3–100 characters or 2–10 lines, etc. The effect still needs to look good.

We were finally able to hit upon the right combination of parameters which produced smooth professional-quality effects on the text, while reasonably handling dynamicity. There is, however, much room for improvement here still.

These little gaps and complexities forced us to build our own implementations for seemingly common and solved problems. In hindsight, this really helped us understand how text works on HTML and get into the nitty-gritty details of text. It also helps us better anticipate the problems that we will face going forward, as we build this for scale and add many more functionalities.

So you see, the first 20% effort + solving the above-mentioned problems + a few of our secret ingredients = Text at Invideo