Process HTTP body line-by-line

Mayank C
Tech Tonic

--

In some use cases, a file upload server may need to process the request body line-by-line. For example — Process the uploaded text file, do some processing (like count the words), and return the number of words found in the file. There are two ways to do this:

  • Convert the uploaded file to text, split by LF, and process the lines
  • Direct the uploaded file stream through a LF stream processor, and process the lines

In this article, we’ll learn how to efficiently process an HTTP body line-by-line.

First way — Without streams

The first way is fast, easy to use, and is suitable for small sized files. There are three simple steps:

  • Convert the body to string (using .text() API)
  • Split the body by LF (using .split() API)
  • Process the lines by splitting each line by space and adding to the word count

Here is an implementation of the above algorithm:

import { serve } from "https://deno.land/std/http/mod.ts";const port = 9000;async function handleRequest(req: Request) {
let words = 0;
if (req.body) {
const file = await req.text();
for (const line of file.split("\n")) {
words += line.split(" ").length;
}
}
return new Response(words.toString() + "\n");
}
serve(handleRequest, { port });
console.log(`Listening on http://localhost:${port}`);

Let’s do a run with a ~2M text file containing ~39K lines:

$ du -kh textFile2M.txt 
1.8M textFile2M.txt
$ wc -l textFile2M.txt
39012 textFile2M.txt

A hundred requests are made using curl’s parallel mode:

$ curl -i --parallel --parallel-immediate --parallel-max 5 --data-binary "@./textFile2M.txt" --config /var/tmp/curl.txt

The processing was very fast. The peak memory usage was:

Physical footprint (peak):  79.5M

Now, let’s repeat the test with a ~50M text file containing ~1M lines:

$ du -kh textFile50M.txt 
50M textFile50M.txt
$ wc -l textFile50M.txt
1053486 textFile50M.txt

A hundred requests are made using curl’s parallel mode:

curl -i --parallel --parallel-immediate --parallel-max 5 --data-binary "@./textFile50M.txt" --config /var/tmp/curl.txt

The processing was still very fast. The peak memory usage was:

Physical footprint (peak):  660.3M

That was a sharp increase in the memory usage! When the load stops, the memory usage does go down.

That’s all about the first way. Let’s have a look at the stream way and find out if we can do better.

Second way — Streams

The second way is also easy, but a bit slow when compared to the first way. The advantage of the second way is that it keeps memory under control.

There are three simple steps:

  • Import LineStream API from standard library’s streams module
  • Pipe the request body to LineStream (using .pipeThrough() API)
  • Process the yielded lines by splitting each line by space and adding to the word count

Here is the implementation of the above algorithm:

import { serve } from "https://deno.land/std/http/mod.ts";
import { LineStream } from "https://deno.land/std/streams/delimiter.ts";
const port = 9000;const td = new TextDecoder();
const dec = (b: Uint8Array) => td.decode(b);
async function handleRequest(req: Request) {
let words = 0;
if (req.body) {
const file = req.body?.pipeThrough(new LineStream());
for await (const line of file) {
words += dec(line).split(" ").length;
}
}
return new Response(words.toString() + "\n");
}
serve(handleRequest, { port });
console.log(`Listening on http://localhost:${port}`);

Let’s repeat the first test with a ~2M text file containing ~39K lines:

$ du -kh textFile2M.txt 
1.8M textFile2M.txt
$ wc -l textFile2M.txt
39012 textFile2M.txt

A hundred requests are made using curl’s parallel mode:

$ curl -i --parallel --parallel-immediate --parallel-max 5 --data-binary "@./textFile2M.txt" --config /var/tmp/curl.txt

The processing is quite slow compared to the first way. The peak memory usage was:

Physical footprint (peak):  58.4M

The peak memory usage is less than the first way, but not by a big margin.

Now, let’s repeat the second test with a ~50M text file containing ~1M lines:

$ du -kh textFile50M.txt 
50M textFile50M.txt
$ wc -l textFile50M.txt
1053486 textFile50M.txt

A hundred requests are made using curl’s parallel mode:

$ curl -i --parallel --parallel-immediate --parallel-max 5 --data-binary "@./textFile50M.txt" --config /var/tmp/curl.txt

The processing was quite slow when compared to the first way. The peak memory usage was:

Physical footprint (peak):  59.1M

Now, that’s a huge difference from the first way, where the peak memory usage for 50M files was ~660M.

Regardless of the size of file (2M or 50M), the memory usage stayed at ~59M.

--

--