How to split any large file into multiple smaller files by size or by “number of files” using java and join them back

Malkeith Singh
5 min readOct 22, 2022

--

split-files-java

This is going to be more than just a “how to” blog. My intention, along with providing a thorough explaination to the question, is to clear some misconceptions about Files and Inputstreams in java and explain some key concepts along the way. This is going to be more of a tutorial then an “how to” blog.

If you want to skip ahead and directly look at the solution, you can find that, here.

Now that we have got that out of the way, lets get started:

Lets start with spliting the large file “by size” first, the method signature should look something like this

public List<File> splitBySize(File largeFile, int maxChunkSize)  {
// split the file by size
}

The method takes in two parameters, a large file object referenced by largeFile and maxChunkSize which represents the maximum size in bytes that each of the smaller chunked files can be. For example a value of 1024 for maxChunkSize means that each of the smaller chunks would be of 1KB in size. Finally the method returns the list of chunked files as the output.

Those of you who are wondering that the largeFile variable will already hold the entire file in memory, well that’s not the case. When you do a

File input = new File(INPUT_FILE_PATH);

the input object does not hold the entire file in memory but just the metadata about it. To load the actual data into memory, you would have to open an inputStream to it and then read the data from it. How you read the data from it determines how much data gets stored in memory.

Our objective is to hold only maxChunkSize amount of data in memory at a a time. This way we can control how much data is read into memory and can avoid any OOM exceptions.

This is also how you read and process a large file in java. You pull only a fixed amount of data from an inputstream at a time, process it then get rid of it, i..e remove any references of it from you program.

Moving on, lets look at the implementation:

As mentioned above, first thing you need to do is open an inputStream to the file, lets do that

InputStream in = Files.newInputStream(largeFile.toPath())

Note:- There is still no data from the actual file contents read into memory yet.

Next we need to ensure that only maxChunkSize size of data is read into memory at a time, towards that end lets create a byte array of size maxChunkSize.

byte[] buffer = new byte[maxChunkSize]

This is where memory is allocated or reserved, we will use this array as a buffer, only this amount of data will be loaded into memory at a time.

Now to read the actual data we call the read method on the opened input stream and pass the buffer in, like so

int dataRead = in.read(buffer);

Here, the dataRead variable tells us how much data was actually read and subsequently written to the buffer. For example, lets say we reach EOF before the buffer is filled then the dataRead variable will let us know how much data was actually read.

Finally, the complete implementaion of the method:

public static List<File> splitBySize(File largeFile, int maxChunkSize) throws IOException {
List<File> list = new ArrayList<>();
try (InputStream in = Files.newInputStream(largeFile.toPath())) {
final byte[] buffer = new byte[maxChunkSize];
int dataRead = in.read(buffer);
while (dataRead > -1) {
File fileChunk = stageFile(buffer, dataRead);
list.add(fileChunk);
dataRead = in.read(buffer);
}
}
return list;
}

I have already explained the important bits of the code, the rest should be self explanatory. If you want more info on the above implementation then let me know in the comments and I will explain that bit there.

Note:- The stageFile method above takes in the buffer and creates a file that will only contain “dataRead” amount of bytes from the buffer as contents.

For example, lets say dataRead equals buffer size, then the created file will contain the entire contents of the buffer, on the other hand if, say dataRead is 100 then the created file will contain only the first 100 bytes from the buffer as contents.

stageFile implementation:

private File stageFile(byte[] buffer, int length) throws IOException {
File outPutFile = File.createTempFile("temp-", "-split", new File(TEMP_DIRECTORY));
try(FileOutputStream fos = new FileOutputStream(outPutFile)) {
fos.write(buffer, 0, length);
}
return outPutFile;
}

The above implementation is for local fs but you can easily modify it work with any cloud or network storage.

Extras: You can always abstract the stageFile method out so you can work with diferent implementation like one for cloud storage, one for local storage etc. I have already done this as part of my use case and you can find that code here.

Split by “number of files” :

Moving on to the second part of the question “What if you want to split the large file into a specific number of smaller files”. Well that can be achieved easily, all we need to do is figure out how much “size in bytes” each of smaller file would be and then we can resuse the implementation for “split by size” which we have discussed above.

The below code does just that, given the totalBytes and the numberOfFiles, it return back the size in bytes of the smaller files.

private int getSizeInBytes(long totalBytes, int numberOfFiles) {
if (totalBytes % numberOfFiles != 0) {
totalBytes = ((totalBytes / numberOfFiles) + 1)*numberOfFiles;
}
long x = totalBytes / numberOfFiles;
if (x > Integer.MAX_VALUE){
throw new NumberFormatException("Byte chunk too large");

}
return (int) x;
}

Using the method above our “split by size” implementation we can deduce the below implementation

public List<File> splitByNumberOfFiles(File largeFile, int noOfFiles) {
return splitBySize(largeFile, getSizeInBytes(largeFile.length(), noOfFiles));
}

inally, lets move on to the third and the last part of question “joining the files back”

We can join the files back the same way we split them. We will read each file (in the order they are to be joined) into a buffer and write that buffer out to an OutputStream.

Again note that only buffer amount of memory will be consumed at a time.

Below is the implementation for the method:

public File join(List<File> list) throws IOException {
File outPutFile = File.createTempFile("temp-", "unsplit", new File(TEMP_DIRECTORY));
FileOutputStream fos = new FileOutputStream(outPutFile);
for (File file : list) {
Files.copy(file.toPath(), fos);
}
fos.close();
return outPutFile;
}

With this we have reached the end of this article, if you have any questions or suggestion please mention them in the comments and I will get back to you.

here is a link to the full code.

If you found this article helpful you can support me by following me, you can also buy me a coffee here . Also, let me know what you would like me to write about next.

That’s all folks, see you on the next one.

--

--

Malkeith Singh

Malkeith is a web developer with a stronger affinity towards backend development. He Loves designing and developing backend systems for the web.