Hashing big file with FileReader JS

5 min readMay 5, 2018

The purpose of this post is to discuss a client-side technique to hashing big file (more than 1Gbytes) inside your browser. The file is loading locally, in the browser, with the standard FileReader object.

In this article I start to show the easiest way to load and hash a single file; unfortunately, this procedure doesn’t work for big file because browser has different compatibility and memory constrains. Then I describe two different solutions to read and hash big file.

A working example:

website: https://lvaccaro.github.io/hashfilereader
github: https://github.com/lvaccaro/hashfilereader

Simple File Loader

The FileReader object lets web applications asynchronously read the contents of files (or raw data buffers) stored on the user's computer, using File or Blob objects to specify the file or data to read.

FileReader

The FileReader object lets web applications asynchronously read the contents of files (or raw data buffers) stored on…

developer.mozilla.org

FileReader.onload() event is triggered each time the reading operation is successfully completed. The following code show how FileReader works:

var file = event.target.files[0];
var reader = new FileReader();
reader.onload = function (event) {
   var data = event.target.result;
   console.log('Data: ' + data);
};
reader.readAsBinaryString(file);

The loaded file is storage in the ram/browser. It seems in Chrome 45 the limit is 261 MB, but there is not a common standard and each browser has different memory size limit. Unfortunately there is no error (FileReader.error == null) when the size is above that limit, the result is just an empty string.

Hashing a file

To hashing a file, I need to import a crypto library: CryptoJS. This is JavaScript library of crypto standards with supported many hashing functions, like: sha1, sha256, hmac-sha256, aes, ..

brix/crypto-js

crypto-js - JavaScript library of crypto standards.

github.com

The following code show how to encrypt a plain text:

var plain = "Hello World!";
var encrypted = CryptoJS.SHA256( plain );

CryptoJS supports to build hash from chunk of file. The previous atomic example could be also written in the following way with the same result:

var plain0 = "Hello ";
var plain1 = "World!";
var hash = CryptoJS.algo.SHA256.create();
hash.update(plain0);
hash.update(plain1);
var encrypted = hash.finalize().toString();

Hashing an loaded file

In order to hash the loaded file, using code from previous examples, in my example the implementation look like the following:

var file = event.target.files[0];
var reader = new FileReader();
reader.onload = function (event) {
   var data = event.target.result;
   var encrypted = CryptoJS.SHA256( data );
   console.log('encrypted: ' + encrypted);
};
reader.readAsBinaryString(file);

This technique keeps all the loading problem described in the “Simple File Loader” section, it is not feasible for big file.

Loading & Hashing at chunk

In order to solve the problems with big files the optimal way is to asynchronous load a file at chunk and hashing each chunk together.

The reading procedure is split into multiple reading operations in order to asynchronous read & hash a single chunk each time. The following code calls an loading procedure (showed below) with callbacks parameters:

onProgress callback for each chunk,
onFinish callback at the end of the reading.

var file = event.target.files[0];
var SHA256 = CryptoJS.algo.SHA256.create();
var counter = 0;
var self = this;loading(file,
   function (data) {
      var wordBuffer = CryptoJS.lib.WordArray.create(data);
      SHA256.update(wordBuffer);
      counter += data.byteLength;
      console.log((( counter / file.size)*100).toFixed(0) + '%');
   }, function (data) {
      console.log('100%'); 
      var encrypted = SHA256.finalize().toString();
      console.log('encrypted: ' + encrypted);
   });

The loading procedure asynchronous read chunks of 1MB size. This parameter chunkSize is customisable: bigger chunk size cause more used memory and less time; on the other hand, a bigger chunk size can reach the limit of allowed memory (remember the reading is asynchronous and can exist in memory more chunks at the same time). The loading procedure is the following:

function loading(file, callbackProgress, callbackFinal) {
   var chunkSize  = 1024*1024; // bytes
   var offset     = 0;
   var size=chunkSize;
   var partial;
   var index = 0;

   if(file.size===0){
      callbackFinal();
   }
   while (offset < file.size) {
      partial = file.slice(offset, offset+size);
      var reader = new FileReader;
      reader.size = chunkSize;
      reader.offset = offset;
      reader.index = index;
      reader.onload = function(evt) {
         callbackRead(this, file, evt, callbackProgress, callbackFinal);
      };
      reader.readAsArrayBuffer(partial);
      offset += chunkSize;
      index += 1;
   }
}

Each time an asynchronous chunk is successfully read, the callbackRead function check if the current chunk is the last or not, in order to call the right callback.

function callbackRead(reader, file, evt, callbackProgress, callbackFinal){
      callbackProgress(evt.target.result);
      if ( reader.offset + reader.size >= file.size ){
         callbackFinal();
      }
}

Reordering chunks: with time-shifting

In my test on different browsers, asynchronous callbacks are called not in order: the size of the last chunk is less than the chunk fixed max-size, so the last chunk arrives before others. As result, callback functions are called not in order, and hash updating get wrong.

For this reason, I build a patch to order chunks with a time-shifting technique: if the chunk is not in order, wait some time (10 msec) and retry.

var lastOffset = 0;
function callbackRead(reader, file, evt, callbackProgress, callbackFinal){
   if(lastOffset === reader.offset) {
      // in order chunk
      lastOffset = reader.offset+reader.size;
      callbackProgress(evt.target.result);
      if ( reader.offset + reader.size >= file.size ){
         callbackFinal();
      }
   } else {
      // not in order chunk
      timeout = setTimeout(function () {
         callbackRead(reader,file,evt, callbackProgress, callbackFinal);
      }, 10);
   }
}

Reordering chunks: with memory-buffering

An alternative approach to reordering chunks is based on memory: storing a buffer of not in order chunks, and for each new ordered chunk, check and re-order previously received chunks. On the other hand, if there are too many chunks in the buffer, the memory of the browser can saturate.

function callbackRead(reader, file, evt, callbackProgress, callbackFinal){

    if(lastOffset !== reader.offset){
        // not of order chunk: put into buffer
        previous.push({ offset: reader.offset, size: reader.size, result: reader.result});
        return;
    }

    function parseResult(offset, size, result) {
        lastOffset = offset + size;
        callbackProgress(result);
        if (offset + size >= file.size) {
            lastOffset = 0;
            callbackFinal();
        }
    }

    // in order chunk
    parseResult(reader.offset, reader.size, reader.result);

    // check previous buffered chunks
    var buffered = [{}]
    while (buffered.length > 0) {
        buffered = previous.filter(function (item) {
            return item.offset === lastOffset;
        });
        buffered.forEach(function (item) {
            parseResult(item.offset, item.size, item.result);
            previous.remove(item);
        })
    }

}

Performance

Tested on Chrome 65.0 on macOS 10.12.6 both techniques with chunk size of 1MB and 10MB.

For 100MB file size, the total number of chunks (1MB size) is 100. In memory-buffering, the number of not in order and buffered chunks are 32. In time-shifting, the number of delaying chunks are 1124, because there is not a good way to estimate the delay for the next check.

A better optimisation could be made by setting the chunkSize value that it should be a tradeoff:

small chunkSize requires more time
big chunkSize requires more memory

On the other hand, the time-shifting is slower but uses much less memory.

Conclusion

I test successfully these techniques on multiple browser, such as Chrome, Firefox, IExplorer with files of size more than 1GB. The reading process is fully client side so there are NOT any constraints related to bandwidth/network.

Maybe there are better techniques available on Internet, please let me know by sharing & commenting this post.