Join Files

alex_ber
Geek Culture
Published in
3 min readApr 12, 2021

Suppose, that you have some multi-threaded/multi-process application where each thread/process creates some file (each thread/process create different file) and you want to join them to one file. Created file have some naming convention, for example output*.txt, and the resulting file should be output.txt.

Note: You can put, for example, thread/process name in place of asterisk to make than unique enough.

Now, in the main process/thread you want to create final file that will have content of all created files (they’re identified by file mask, such as output*.txt) to one final file (such asoutput.txt).

Note: If you’re expecting to have some output-SpawnPoolWorker-2.txt and it is absent, you may want to add some indication to the final file. The code below, will not do it automatically. If you want, you can create output-SpawnPoolWorker-2.txt with error indication yourself, before joining results.

In order to do it, I’ve created join_files function.

The source code you can found here. It is available as part of my AlexBerUtils s project.

You can install AlexBerUtils from PyPi:

python3 -m pip install -U alex-ber-utils

See here for more details explanation on how to install.

Let’s see some code stub for application:

Code of app.py:

Note: Here I’m using init_app_conf module. You can read about it here.

Note: Typically, at the end of the script/application’s main() function there are following lines:

See here Making more yo relative path to file to work for my alternative setup.

To understand GuardedWorkerException, see Exception Propagation from another Process.

You can look on Appendix in Exception Propagation from another Process to find the missing part of the code.

There are 2 interesting parts here:_write_partial_output() and join_files().

_write_partial_output() should be really part of some class. account_id, user_id should be calculated value. The first part, lines 67-74 are just some sample code on how to generate unique (enough) filename per process\thread — it takes filename without suffix from configuration dict and appends some unique part and retain suffix (also directory is retained). The second part, lines 75–76 are very simple-minded way to store the content as CSV. I will repeat, this is just sample and is not intented to be general-purpose reusable code snippet.

join_files() — before we’re calling this method we retrieve full filename from the configuration dict (for example, /tmp/output.txt). join_files() will look into the directory where the file is pointed out (for example, /tmp/); it’s filename will be filename of the resulted file (for example output.txt) and join_files() will look on output*.txt in /tmp/ to find all created files (that is done in _write_partial_output()).

How join_files() works?

  1. First, it created file-like object (for write) that refer to the output file.
  2. Than it iterate over the files that has similar mask (such as output*.txt) and create file-like object (for read).
    Implementation note: I’m using pathlib.path.glob().
  3. Than it copies data chunk by chunk from the read file-like object to the write file-like object.
    Implementation note: I’m using shutil.copyfileobj().

The source code you can found here. It is available as part of my AlexBerUtils s project.

You can install AlexBerUtils from PyPi:

python3 -m pip install -U alex-ber-utils

See here for more details explanation on how to install.

--

--