Join Files
Suppose, that you have some multi-threaded/multi-process application where each thread/process creates some file (each thread/process create different file) and you want to join them to one file. Created file have some naming convention, for example output*.txt
, and the resulting file should be output.txt
.
Note: You can put, for example, thread/process name in place of asterisk to make than unique enough.
Now, in the main process/thread you want to create final file that will have content of all created files (they’re identified by file mask, such as output*.txt
) to one final file (such asoutput.txt
).
Note: If you’re expecting to have some output-SpawnPoolWorker-2.txt
and it is absent, you may want to add some indication to the final file. The code below, will not do it automatically. If you want, you can create output-SpawnPoolWorker-2.txt
with error indication yourself, before joining results.
In order to do it, I’ve created join_files
function.
The source code you can found here. It is available as part of my AlexBerUtils s project.
You can install AlexBerUtils from PyPi:
python3 -m pip install -U alex-ber-utils
See here for more details explanation on how to install.
Let’s see some code stub for application:
Code of app.py
:
Note: Here I’m using init_app_conf module. You can read about it here.
Note: Typically, at the end of the script/application’s main()
function there are following lines:
See here Making more yo relative path to file to work for my alternative setup.
To understand GuardedWorkerException
, see Exception Propagation from another Process.
You can look on Appendix
in Exception Propagation from another Process to find the missing part of the code.
There are 2 interesting parts here:_write_partial_output()
and join_files()
.
_write_partial_output()
should be really part of some class. account_id
, user_id
should be calculated value. The first part, lines 67-74 are just some sample code on how to generate unique (enough) filename per process\thread — it takes filename without suffix from configuration dict and appends some unique part and retain suffix (also directory is retained). The second part, lines 75–76 are very simple-minded way to store the content as CSV. I will repeat, this is just sample and is not intented to be general-purpose reusable code snippet.
join_files()
— before we’re calling this method we retrieve full filename from the configuration dict (for example, /tmp/output.txt
). join_files()
will look into the directory where the file is pointed out (for example, /tmp/
); it’s filename will be filename of the resulted file (for example output.txt
) and join_files()
will look on output*.txt
in /tmp/
to find all created files (that is done in _write_partial_output()
).
How join_files()
works?
- First, it created file-like object (for write) that refer to the output file.
- Than it iterate over the files that has similar mask (such as
output*.txt
) and create file-like object (for read).
Implementation note: I’m usingpathlib.path.glob()
. - Than it copies data chunk by chunk from the read file-like object to the write file-like object.
Implementation note: I’m usingshutil.copyfileobj()
.
The source code you can found here. It is available as part of my AlexBerUtils s project.
You can install AlexBerUtils from PyPi:
python3 -m pip install -U alex-ber-utils
See here for more details explanation on how to install.
See also:
- Integrating Python’s logging and warnings packages.
fixabscwd()
function inmains
module or Making relative path to file to work.fix_retry_env()
function inmains
module or Make path to file on Windows works on Linux.FixRelCwd()
function inmains
module or Making more to relative path to file to workGuardedWorkerException()
function inmains
module or Exception Propagation from another Processjoin_files()
function infiles
module or Join Files.- stdLogging module, or My stdLogging Module