Hideout, a caching tool for developing data-intensive projects

4 min readMay 18, 2020

Projects such as log data mining or machine learning handle large amounts of data. Developing the scripts or libraries for handling large amounts of data is tedious and time-consuming, since running such scripts takes minutes or hours. This aspect of the data handling projects prevents us from applying good practice such as refactoring or unit tests. In the end, we are not able to keep such a project clean.

This article shows how we keep data handling projects clean using Hideout caching.

Problem: keep code quality of data handling projects

Common data handling projects have more than one step (or method call) that takes a long time. Some projects download input data from the data warehouse. Other projects have training in the machine learning model which also takes a long time.

More complex projects contain multiple time-consuming steps.

Such projects which have many steps that take a long time to run takes up a lot of our developing resources.

This time-consuming aspect of data handling projects makes difficult to keep the quality of project codes.

Inadequate solutions

To improve development efficiency, two insufficient solutions are commonly applied.

Add temporary files into VCS repository

One solution is to add the intermediate data such as input load from DWH or machine learning models.

├── Makefile
├── README.md
├── config
│   ├── __init__.py
│   └── env.py
├── data
│   ├── dictionary.dic
│   ├── preprocessed
│   │   ├── preprocessed_input1.txt
│   │   └── preprocessed_input2.txt
│   ├── models
│   │   ├── validation_model1.dat
│   │   └── validation_model2.dat
...

In the above project repository, preprocessed data and model files are stored data directory. For developing, these files are useful, since we can skip the preprocessor creating a model loading the data from the files.

Add caching procedures

Another solution is adding caching procedure in the data generation functions. The following is a function adding caching procedure at the beginning and end of the function.

def generate_id_map(cache_file_path):
    if os.path.exists(cache_file_path) and not self.force:
        with open(cache_file_path, mode='rb') as f:
            return pickle.load(f)

    id_map = _generate_map_impl()

    if not os.path.exists(cache_file_path) or not self.force:
        with open(cache_file_path, mode='wb') as f:
            pickle.dump(ingredient_id_map, f)
    return id_map

This solution is a bit better than adding intermediate files in the project repositories. We can remove the cache files any time since the cache file does not exist the above function generates and save the cache file automatically with the latest settings.

This solution still has a problem. We need to add the caching procedures to all the functions which handle large data. The cost to implements the procedure to all the functions is quite high, and in addition, such procedures sacrifice the readability of the functions.

Hideout

To cache the objects for specified functions， I made a tiny library, Hideout. Hideout store and load cache just adding a decoration.

Basic usage

The following function takes 1000 seconds to run and therefore we do not want to run many times in developments.

def generate_large_object(times):
    sleep(1000)
    return map(lambda x: x*2, range(times))

For cache the results, we just add the decoration resumable supported with Hideout.

@resumable()
def generate_large_object(times):
    sleep(1000)
    return map(lambda x: x*2, range(times))

When we run the above function, Hideout does not save/load the cache files, only when the environment variable HIDEOUT_ENABLE_CACHE is set to true, Hideout generates the cache files under caches directory.

Skip Caching of Specified Stages

Large projects have multiple components that handle multiple data sources or convert them. In such a project, we sometimes want to suppress caching only for the specified functions.

Hideout support stage for the purpose. When we want to suppress a certain stage, we add the name to the environment variable HIDEOUT_SKIP_STAGES.

For example, the following project has multiple data sources and steps to convert and integrate input data sources.

The following is the function used in the final stage.

@resumable(stage="final")
def generate_large_object(data_1, data_2):
    sleep(1000)
    return data_1 + data_2

For example, if we want to suppress the caching for the above function (generate_large_object), we run the command with HIDEOUT_SKIP_STAGES=final and Hideout skip save and load cache files only for the specified stage.

Summary

This article shows the handling of caching objects in data processing projects and the caching strategies to reduce the development cost. First, we see the two inadequate solutions. In the end this article introduces a handy caching tool Hideout and shows how to apply to cache and keep the code clean.