How to correctly handle native http/https requests in node — The Snowman That Loved Music (and how your javascript code does not understand that)

Gevorg A. Galstyan

6 min readMar 20, 2019

https://the-showman-and-the-g-clef-u8pmjbhb7ixy.runkit.sh/

Let me tell you a story…

There was a snowman that loved music… very very very much…

Even better… Let me show you the story… https://the-showman-and-the-g-clef-u8pmjbhb7ixy.runkit.sh/

JavaScript and Unicodes

And here is how the story had been written

Now let’s explain what is happening there.

We have defined 4 constants at the top of the code

const A_LETTER = 'A'; // Letter "A"
const SNOWMAN = '☃︎'; // Snowman
const GROWING_HEARTH = '💗'; // Growing Heart
const G_CLEF = '𝄞'; // Musical Symbol G Clef

It is just the letter “A” and 3 Unicode characters: The Snowman, The Growing Heart, and The Musical G Clef Symbol.

On the next several lines we are printing the constants and their lengths and also printing the 1st and the 2nd elements of that constants.

console.log(`${A_LETTER} length is ${A_LETTER.length}`); // "A length is 1"
console.log(`${A_LETTER[0]} and ${A_LETTER[1]}`); // "A and undefined"console.log(`${SNOWMAN} length is ${SNOWMAN.length}`); // "☃︎ length is 2"
console.log(`${SNOWMAN[0]} and ${SNOWMAN[1]}`); // "☃ and ︎"console.log(`${GROWING_HEARTH} length is ${GROWING_HEARTH.length}`); // "💗 length is 2"
console.log(`${GROWING_HEARTH[0]} and ${GROWING_HEARTH[1]}`); // "� and �"console.log(`${G_CLEF} length is ${G_CLEF.length}`); // "𝄞 length is 2"
console.log(`${G_CLEF[0]} and ${G_CLEF[1]}`); // "� and �"

As you noticed there is something strange going on here.

The length of A is 1 as you would expect and the 1st element in it (index 0) is the letter A itself. The 2nd element is undefined which is also totally normal and expected.

But the length of the snowman character is 2, and as for letter A, the 1st element is the showman (☃) and the 2nd element is not undefined but empty?!

The situation is even more confusing for our heart character. The length of it is 2, and the 1st and 2nd elements of it are question marks (�). We have the same results for our music character.

So what is going on here?

This is all happening because of how Javascript works with strings and Unicode. For very details investigation results, I recommend you to read a very good article called “JavaScript’s internal character encoding: UCS-2 or UTF-16?”.

In short…

There are some characters in Unicode that can be encoded in UTF-16 using 2 16-bit code units and this way of encoding is called surrogate pairs. This is there because Unicode authors have underestimated how big Unicode can get and eventually got out of available space.

Now, most of the tools you use on an everyday basis know this and know how to handle this, including some parts of javascript.

But… Let me continue with the rest of our code…

const storyArray = [];
for (let i = 0; i < 2 ** 13; i++) {
    storyArray.push(SNOWMAN);
    storyArray.push(GROWING_HEARTH);
    storyArray.push(G_CLEF);
}
console.log(storyArray);
console.log(storyArray.length); // 24576const responseBody = storyArray.join('');console.log(responseBody);
console.log(responseBody.length); // 49152

Here we are pushing the snowman, the heart, and the clef into an array 8192 times. With gives us an array of 24576 items. We then join the array into a single long string which has a length of 49152 (twice as big as the array is, and now we know why… each character in the array has a length of 2 usual characters/letters).

The rest of the code is pretty straightforward too. We are just creating an endpoint which is returning the story of the snowman that loved music in the body if you make a request to https://the-showman-and-the-g-clef-u8pmjbhb7ixy.runkit.sh/.

NodeJS and HTTP requests

I am assuming that you know what NodeJS is and needed to make HTTP or HTTPS requests. You have probably searched for “how to make an http request in nodejs”. And you have probably landed on an answer looking similar to this:

// https://www.twilio.com/blog/2017/08/http-requests-in-node-js.html
const https = require('https');https.get('https://api.nasa.gov/planetary/apod?api_key=DEMO_KEY', (resp) => {
  let data = '';  // A chunk of data has been recieved.
  resp.on('data', (chunk) => {
    data += chunk;
  });  // The whole response has been received. Print out the result.
  resp.on('end', () => {
    console.log(JSON.parse(data).explanation);
  });}).on("error", (err) => {
  console.log("Error: " + err.message);
});

Showing what is wrong with the search results

You need to know that when your computer is making requests over the internet to some websites it does not get all of the response in one piece. The response can arrive in chunks. You should also know that the chunks do not have an exact size and that the second request to the same URL may arrive in chunks that have different sizes from the first request. That is why in the code examples below you will see chunk variables and you must have figured out that this fact has something to do with the first part of our article where we are returning a long string of “composite” Unicode characters.

The problem

With some minor modifications, we are able to get the story of our snowman.

But when we run the code our console.log(data.length); does not always return the same number. Sometimes it is 49161 or 49158 or even 49163.

And the body is also different:

Did you notice the question marks and other strange symbols? Yes, the chunks have cut our Unicode characters in half in some places.

How can we fix the bug

First of all, we need to understand what is a chunk in this context. If you take a look at the logs you will notice that when we are logging the chunks they are printed as Buffer <E2, 98, 83, EF, B8, 8E, F0, 9F, 92, 97, F0, 9D, 84, 9E, E2, 98, 83, EF, B8, 8E, F0, 9F, 92, 97, F0, 9D, 84, …>, which lets us conclude that each chunk is a Buffer. So when we are doing data += chunk we are trying to add (concatenate) a Buffer to a string. In this kind of cases, JavaScript calls the toString() method of the Buffer which has it’s default the encoding set to utf8. And because of the partial Unicode characters at the end of some chunks, the toString() method returns questions marks or other Unicode stuff that matches the partial character. Well, it is obvious now that working with the Buffer at this point of the code as with a valid substring is incorrect.

Our Options

We have 2 options to fix the problem

We can set the encoding on the response by calling resp.setEncoding('utf8');
We can push all the chunks in an array and then join them into a final Buffer that we can operate with.

1st option will instruct the HTTP response to arrive as a Unicode string instead of a Buffer and that is fine in almost all of the cases, but there are some edge cases that can cause this approach to not work, sometimes because of how the server returns the data or sets the headers of the response.

2nd option is the best solution you can have at this moment and for a few good reasons

We are not forcing any encoding on the response which eliminates the concerns related to our 1st option.
We are not calling the toString() method on each chunk midway because we are not treating it as a string, yet.
We are not modifying any of the chunks at all.
We are not concatenating strings each time we receive a chunk
We are not working with mutable variables(we do not need to have let in the code)
We are calling the toString() method only once.

And here is the code:

The only new thing in this code is theBuffer.concat() function which as you have already guessed is concatenating an array of Buffers into a single Buffer that includes all of the data of all of the chunks in the correct order without any characters being modified. And so, calling the toString() method in this Buffer is going to return the exact and correct response of the server.

Now your code understands all of the characters in the story correctly and knows that our ☃︎💗𝄞 thousands of time.

Links:

Unicode® Character Table

All Unicode Symbols with Names and Descriptions on One Page: ❤ ☀ ★ ☂ ☻ ♞ ☯ ☭ ☢ € → ☎ ❄ ♫ ✂ ▶ ✇ ♎ ⇧ ☮ ♻ ⌘ ⌛ ☘ ✈ ✔ ☊ ♔ ♕…

unicode-table.com

JavaScript’s internal character encoding: UCS-2 or UTF-16?

Does JavaScript use UCS-2 or UTF-16 encoding? Since I couldn’t find a definitive answer to this question anywhere, I…

mathiasbynens.be