Version control system to integrate analysis and simulation

TomHortons
22 min readFeb 24, 2023

--

This article describes the systems I use to build a private system from currency trading analysis to automated trading.

Originally, it was easy to get confused when repeating detailed analyses endlessly, and it tended to take time to manage the versions of the programs used in the analyses. Data Version Control (DVC) is a management method that I have considered implementing to solve such probilems. However, in private development, not only data but also “codes used for analysis” tended to become more complex, so the original idea was to manage the entire analysis program by imitating the DVC mechanism.

The three main components of this article are: 1. a version control method to efficiently develop applications using time-series data; 2. building a simulator; and 3. a mechanism to actually run the application.

1. Before reading

This article is about organizing the analytical infrastructure that I had personally built up discretely with various data before I started automated forex trading. I do not build it for all types of data analysis, but rather when I have a fluffy understanding of the target environment and do not know the evaluation function to measure performance.

The term “simulator” sounds like an exaggeration, but it is not the same as a Matlab. The basic purpose is to do a simple calculation or to check that you understand a rough environmental change. I do not envision a situation where you are creating a working program with a target you are already familiar with, but rather create simulations as a means of understanding the data in depth through your own trial-and-error process.

The building environment is Google Cloud Platform (GCP), the language is Python, and the way the analysis methods are combined is based on the open source Data Version Control (DVC).

The following targets are suitable for building the infrastructure of this article.

  1. Application decisions do not have a significant impact on the target environment
  2. The understanding of the target environment is ambiguous
  3. Undecided on approach to achieve objectives
  4. Long-term analysis required

(1) For example, an application could be based on changes in energy prices. Gas and electricity prices decided by enerty suppliers are not affected by the application decision.
(2) If you have some understanding of the target environment, but don’t understand the terminology or can’t immediately make assumptions such as “if A occurs, it will be affected by B and C and then D”, there is enough merit in creating your own simulation. You can break away from beginner who just thinks he/she understands.
(3) In an environment where unreliable methods are prevalent, there is great merit in making your own evaluations and drawing conclusions based on numbers.
(4) If you are in a situation where a large amount of trial and error is expected, application version control can help.

Conversely, the following targets are not suitable

  1. Application decisions affect the target environment
  2. You are familiar with the target environment and do not see the benefit of building your own simulator
  3. The approach to be run in the simulation has been determined and the only bottleneck is the calculation speed
  4. The analysis can be completed quickly and easily in a few months or so

(1) For example, a change in purchasing behavior due to price setting is not suitable, since the amount of the offer has a direct impact on the data. Even if purchasing behavior can be modeled using a probability distribution, it is inappropriate because it is necessary to run the simulation many times to evaluate its validity.
(2) If you are familiar with the target environment, the approach and evaluation function will be clear. A great application should be created just by simply continuing implementation, and version control for analysis is a waste of man-hours.
(3) Using off-the-shelf simulations or building them in a vendor-dependent language will speed up the computation.
(4) Version control is not necessary if you can get a obvious answer after just touching and graphing the data for a few months.

Creating your own automatic control environment or simulation takes a lot of time and effort, but they are useful for trying approaches that are not supported by commercial platforms and to come up with original ideas that are not commonly used. As a matter of course, Python is extremely slow compared to C or vendor-dependent languages. However, this is not a problem since the goal is not to develop computation speed. If you find a great approach in your Python analysis, you can then rewrite it in a vendor-dependent language. As I have said many times, I integrate from simulation to run as part of the means to deepen my understanding of the data and algorithms and get new ideas.

2. What you can do

A series of programs that process data consists of at least one or more modules (Python script). For convenience, I refer to this set of scripts as “Strategy” and the modules as “Logic”.
Here is what the system described in this article accomplishes

  1. Build versioned DAG applications (Strategy) from the Notebook
  2. Strategy can be evaluated by simulation using time series data as input
  3. Strategy used in simulations can be run in the cloud
Fig.1 Overall image

As a result of achieving 1–3, it is possible to easily manage applications in operation and past performance evaluations without having to worry about version control of codes (Fig. 1). The Data Version Control mechanism is used as a management method for the Strategy. No troublesome work is required to change the running Strategy; simply select the active Strategy in the database and the new Strategy will be built on the server automatically.

In the following sections, I will introduce the flow of strategy development for Strategy 1. Next, I will introduce 2 and 3 with specific examples.

3. Build version-controlled DAG applications from within the notebook

While it would be great if Strategy and Logic could be kept perfectly packaged in the same folder, this is not easy when you are dealing with the same data for a long period of time. The commit IDs of the logics change on a daily basis, and the modifications may affect the performance of the strategy for better or worse, and the number of logics that make up the strategy may increase or decrease depending on the day. Sometimes, forgetting the fact that changes were made to the source data to see the impact of input, and being disappointed with the false implession that the performance improves over the strategy of a few months ago.

A similar situation where these problems tend to occur is in analyses involving machine learning and deep learning.

3.1. DVC: Data Version Control

Unless the data has been prepared for study, such as in a tutorial, improving a machine learning model requires several steps of persistent analysis. Data preprocessing, feature engineering, modification of the evaluation function, source data versioning, and management of the evaluated results are required. If you find yourself asking, “Which version of this evaluation did I record a year ago?” time disappears by the day as you have to relearn and check (my experience as a student).

Fig.2 DAG example

DVC is an open source solution to the problem of version control that is common in machine learning. All processes required for analysis and learning, even with any external data, can be expressed in the form of a Directed Asynclic Graph (DAG), which can be simply described as “having a beginning and an end direction: Directed” and “no closed loop: Asynclic.” In Fig. 2, each node is a script, such as a .py file, and the order in which the scripts are executed is indicated by an arrow.
DVC designates the scripts used for analysis and training as DAGs, and records the DAG evaluations and input data together as a .dvc file. This .dvc is managed entirely using hash values, similar to .git, and also uses git information (blob, tree) tied to the DAG. These operations manage 1. the set of scripts used in the analysis, 2. the order in which the set of scripts are executed, 3. the performance of 2, and 4. the source data used to evaluate 2.

Small and short-term analysis work does not require this much tedious work. It is rather time-consuming to split the analysis procedure into separate files for each function, or to execute lengthy commands to specify DAGs. However, as the complexity of the work increases, it becomes more convenient to manage the codes by DAGs, and the process of returning the commit ID to the old DAGs is frequently used. For these reasons, DVC makes it possible to easily manage the learning process of machine learning, which is a time-consuming process to reproduce, in addition to general analysis.

3.2. Application of DVC to Strategy

Fig.3 Image of getting Logic from a blueprint

Unlike DVC, Strategy does not use a .dvc file, and all Strategy information is managed in Firestore. Each Strategy is assigned a unique ID automatically generated by Firestore, and the corresponding Logic git information is stored (Fig. 3). When the application is launched, the Strategy blueprint is loaded and the Logic code is retrieved from Github. Three logics and execution order exist in the blueprint in Fig.3, and the Logic is retrieved from Github and built into the Strategy.

(Note: Due to a problem that occurred during the specification change, pulling logics from Github are currently not available.

Fig.4 Information associated with Strategy ID

Managing by Strategy has many advantages in addition to ease of build. For example, when a Strategy is evaluated in a simulation, source data information and evaluation information are generated. If the source data is obtained from outside, the source information is a content of API request. If the source data is obtained from CSV files in S3 or CloudStorage, the source information is the path to the files. Since evaluation information differs greatly depending on the analysis target, I only link Notebooks filled with evaluation codes with commit IDs.

There are two points to note.
The system described in this article does not use DVC directly. It mimics the DVC management process using Firestore as a means of effortlessly managing analysis-related information (DAG).

Another point to note is the way Strategy is managed. In DVC, if you update an existing DAG, you can revert to a previous version by checking the commit ID. However, in the environment described in this article, I am either 1. creating a new Strategy or 2. overwriting an existing Strategy. If you really want to design the same way as DVC, you should version control the same Strategy itself. For example, if a new Logic is added to the existing Strategy#1, the old ID and the new ID can coexist under the name of Strategy#1. However, I am not developing that much because I already feel satisfied enough in terms of usage.

3.3 Development Flow of Strategy

The series of mechanisms described in this article are packaged together in a Python module called Exploratory. The basic flow of strategy development using Exploratory is as follows

  1. Analyze target data in Notebook
  2. Create Logic in Notebook or edit existing Logic file
  3. Register Logic and Strategy to Firestore
  4. Evaluate Strategy in simulation
  5. Return to 1

If you want to edit an existing strategy, skip step 3. Logic is designed to be as close to Google Cloud Functions as possible. Logic contains handlers with data and context as input, and is designed to be as close to Google Cloud Functions as possible.

Fig.5 Logic Input

Fig. 5 shows the actual code, where data is the input data and context is the environment information. In its current state of use as of January 2023, store, a collection of functions for reading and writing to Firestore and trained models, and logger, a collection of functions for output to Kubernetes logs, are also used as input. In the simulation environment, parts of store and logger are replaced by dummy functions.

Fig.6 Editing Strategy in Notebook

Fig.6 shows how to add a new Logic to an already created Strategy in Notebook. A Logic with the ID “dtUNJ0diwv9neflqYtV8” is added to the entries group of Strategy#7.

Fig.7 Image of work becoming independent

I create this system to separate the development of Logic, the development of Strategy, and the selection of Strategy to run by commit ID (Fig.7). For example, if a commit is made three times to Logic with different branches, the Strategy side does not need to manage the three different versions of Logic. If there are changes in the Logic commit, the Strategy side can be used as is without considering the edits. Also, if you want to compare the performance of the old Strategy with that of the new Strategy, you can easily reproduce the environmental conditions used in the performance evaluation of the old Strategy.

To be honest, if you are just running an application on a Python server, there is no need for such a cumbersome mechanism. However, after several years of analyzing various data in a private environment and trying various ways to improve the efficiency of version control of the small modules corresponding to Logic and the overall program corresponding to Strategy, I have settled on this situation.

4. Strategy can be simulated with time-series data as input

When you start working with new data, I personally find it most useful to build simulations. Not only does this facilitate the evaluation of strategies, but it also allows for a deeper understanding of the data through the reproduction of the environment under analysis.

Fig.8 General view of the simulation

Fig.8 shows an image of a simulation. The simulation is executed by the Process and Repository classes; the Process acquires conditions at startup, user information from Firebase, and time-series data for the simulation interval from external sources; the Repository consists of a Strategy and an Account that manages the internal state.

Repository is the class that invokes Strategy and executes it once per input. Internally, there is a Strategy that makes decisions and an Account that manages the state.
To illustrate this in a forex trading scenario, if Strategy buys USDJPY using 1% of the money in its account and the amount in the account is $10,000, then $100 is put at risk. Of course, if it loses, the account will be reduced to $9,900, and the next trade will be for $99. In this case, the account information in Account is referred to by Strategy.
Let me consider the well-known strategy, the martingale method. The martingale method is a strategy in which you double your stake every time you lose, and you get back the amount you lost after just one win (I personally do not recommend this strategy). In this case, the Strategy retrieves two states from the Account: the previous stake and the result of the win or loss.
Both examples receive data once per hour and output the results of the calculation. The class that provides this sequence of processing is Repository.

Process is the core class of the simulation, which executes loop processing and reproduces external services.
Take forex trading example. One year’s worth of data is acquired from the trade server (external service) via API, converted into hourly input data, and repeatedly input into Repository. At a certain time, when USDJPY is at 130.00, Repository outputs Buy (1 lot). If this is not a simulation, the trade server will execute the Buy, and if USDJPY rises thereafter, the unrealized profit in the account will increase.
Suppose Repository not only buys (1 lot) at USDJPY 130.00, but also outputs a setting to automatically cut losses when the price drops to 129.50. Unfortunately, when you look at the data 30 minutes later, you will see that the price of USDJPY has dropped to 129.00. If you actually have a position, the loss will be executed by the trade server in 30 minutes without waiting for the Repository to run in one hour. Process is also responsible for reproducing the process that occurs outside of the Repository execution time.
Thus, in addition to loop processing, Process reproduces the processing of external services that will be executed within an hour.

In the following sections, I will introduce two of the functions provided by Process in the simulation: 4.1. loop processing using vectorization and 4.2. parameter setting.

4.1. Loop Processing with Vectoring

Loop processing in simulations tends to complicate the design of inputs. Therefore, vectorization using a moving window is created in Numpy. This unifies the format of complex loop processing and makes it as manageable as possible.

Here is the link to the article I referred to for vectorization with a moving window. It is very clearly organized, using as an example time series data of the state of an electric circuit measured at 9 different locations.
Vectorization is only a means to unify the loop processing for each data, and is not intended to speed up the process. This is because the premise of this article is different from that of this environment. In this article, the processing speed of Strategy is the bottleneck, and no matter how fast the loop processing of Numpy itself is, if Strategy is slow, speedup cannot be expected. I will discuss this in detail in the next section 4.2.

If the time-series data to be looped is limited to one type, loop processing can be implemented very easily. Even if the target to be analyzed changes, it does not make a big difference in the specification. In a real-life simulation, however, it is necessary to split the time-series data so that it is handled in a multivariate analysis. Not only does the number of inputs increase, but the iterative step interval tends to want to be larger than 1 and the format tends to be distributed across environments. Therefore, I have unified the method of vectorizing the inputs and loading them into Repository.

Fig.9 Using a moving window for exchange information

Fig.9 shows the time series data of EURAUD in the exchange market. Time step on the horizontal axis is the number of price data arranged in chronological order. OHLC is the opening price (Open), highest price (High), lowest price (Low), and closing price (Close) within a certain timeframe (Timeframe). In reality, the number of inputs increases further because detailed information is calculated for each account currency.

The moving rectangle (window) is the input for one loop, the window size is 30 samples, and the sub window on the right side is the detail. For example, if Strategy calculates the average price over the last 30 hours, Sub window 1060 receives input for the 30 hours from 1030 to 1059.
The colored intervals are the looping interval (green) and the clearing interval (light blue). For example, if a simulation is run from 1030 to 1140, the clearing interval is less than 30 samples of the window size and is not looped, and the loop is executed 101 times from 1060 to 1140, the sample time of the looped interval.

Fig.10 Vectoring Code

The actual code used is shown in Fig. 10. S is the start position of the moving window, K is the window size, V is the step size, and T is the maximum time. sub_windows is a vectorized filter that can be applied to time series data to obtain the desired data. This allows me to loop over sub_windows, and to avoid bringing the original time-series data into the loop process as much as possible.

4.2. Parameter setting and parallel processing

Parameter validation is inevitable when aiming to improve performance based on analysis results. No matter how much analysis is done in Notebook, there will be few cases where the parameters are determined without any parameter fitting. In the simulations in this article, parameter verification is performed efficiently in parallel processing.

The assumption is that Python is much slower than C or vendor-dependent languages. This is not a matter of superiority or inferiority, but a difference due to the purpose of the language. In particular, in this environment, Strategy uses Pandas and for statements in a very heavy way, so speeding up the process is not a consideration. Naturally, if the number of parameter candidates increases to several tens or if the validation interval is very long, the simulation time increases exponentially. However, I have personally used this environment in no case where such slow computation speed was a fatal problem.
The strategy that has the potential to perform well is the one that is robust and performs above a certain level with realistic parameter candidates. This means that if performance deteriorates significantly by changing parameters or conditions, then it is not the right time to validate the parameters in the first place. In the case of forex trading, a strategy whose performance fluctuates wildly when the indicator values are changed slightly or the evaluation period is shifted by a few years is poorly designed.

The simulation environment introduced in this article is mostly used to seek a strategy at a level where parameter verification makes sense. Naturally, in such a situation, even the evaluation function that evaluates the performance of the Strategy is not yet set, and the definition of “performance” is considered while analyzing in the Notebook. Conversely, at the stage where the evaluation function and strategy design have been determined and calculation speed becomes a bottleneck, an environment that enables high-speed calculations should be created on an off-the-shelf platform. In such a case, Excel should be a good substitute for Python for analyzing data.

As described above, the environment is not so much concerned with calculation speed, but it is necessary to ensure a minimum level of practicality. However, this does not mean that I have created any special mechanism by myself; I am using the Python library “Joblib” to improve the calculation speed by using parallel processing.

Fig.11 Image of parallel processing of three parameters

In Fig. 11, three LogicB parameter verifications are entered in the configuration file that determines the simulation conditions. When the simulation is started, the configuration file is read, Strategy is built for the number of parameter candidates, and parallel computation is performed. As a result, the calculation can be done faster than if the strategy is calculated three times in a straightforward manner.
Note that parallel processing can only be executed on targets that have no internal state. To use an example from currency trading, parallel processing cannot be performed in one-year intervals when calculating using three years of USDJPY data. This is because the previous year’s trading results and the database status at the time the year changes will affect the next year’s trades. Parallel processing can only be performed for parameter-specific calculations that are not affected by internal conditions.

Fig.12 Loading parameters

In order to use the parameters specified by Strategy, they must be loaded in Logic beforehand. All data types supported by Firestore can be used as parameter types.

Supported data types | Firestore | Firebase firebase.google.com

Using the above mechanism, Logic with different algorithms is simulated in parallel processing.

5. You can create an application to run Strategy in the cloud

The built Strategy will run on a Python server. The access from outside to the Python server will be formatted with Encoder. The Repository checks the active strategy and gets the Logic from Github. All functionalities of the Strategy build, the internal state management of Account, and the Repository behavior share with the simulation. The Repository’s output is converted to Json format by Encoder.

Fig.13 Application of strategy

Fig. 13 shows an image of Repository building Strategy and executing the application. The Python server receives price information from the client every hour, which is formatted by the Encoder and input into the Repository. Once the Strategy decides to buy a position, the Encoder formats the information and sends it to the client. Once Strategy decides to buy the position, Encoder formats and sends the order to the client.

The mechanism is almost the same as in the simulator, except that the Process class does not exist. The simulator reads the strategy to be built from the configuration file, but this simulator builds a pre-specified strategy. In the process of running the strategy, I do not include any work other than selecting an active strategy. When I am concentrating on analysis, I do not want to interrupt even a single step of work other than analysis. With this system, I only need to touch the Firestore once when launching a new strategy.

This policy also applies to updating the Repository level where Strategy is running. The startup and monitoring process eliminates manual work from the infrastructure anyway. In the following sections, I will introduce 5.1. the CI/CD pipeline using Cloud Build and Kubernetes Engine and 5.2. notifications to the mobile app.

5.1. CI/CD pipeline with CloudBuild and Kubernetes Engine

The Strategy that has been evaluated in the simulation is running on Kubernetes. To change the Strategy, simply select the active strategy in the Firestore. But to change anything higher than the Strategy (Repository, Account, Encoder, etc.), the entire application needs to be updated. There are two repositories (repositories on Github) for this purpose: a repository of the source code for this environment and an env repository containing the Kubernetes manifest. To revert the application to a previous state, the env version is specified in Cloud Build and rolled back.

Prior to organizing the environment for this article, I was manually building the infrastructure environment in the AWS and GCP console screens for each analysis target. However, in organizing this environment, I decided to fully automate the process. At the beginning, I had no idea how large the application would be, so I unified and managed the infrastructure with a Kubernetes manifest that could be extended in any way I wanted.
In addition, before I started focusing on strategy development, I mainly developed around the Repository class, and even after the simulation was finished and the application was up and running, minor updates would be made. When I was looking for a design method in the GCP documentation that would allow for minimum test automation (CI) and efficient updating of the production environment (CD), I found the following article and decided to implement it.

GitOps-style continuous delivery with Cloud Build | Kubernetes Engine | Google Cloud cloud.google.com

The documentation uses Cloud Source Repository, but here I use Github instead. The detailed flow is described in the documentation, and anyone can easily build it by referring to it. Here is what I am happy to build with this CI/CD

  • Reviewing the Cloud Build history to distinguish between failed and successful deployments.
  • Check the production branch of the env repository to access the manifest currently in use.
  • Re-run the corresponding Cloud Build build to roll back to the previous version.
Fig.14 CI/CD pipeline including GitHub

Fig. 14 shows the Github and GCP configuration diagram. The blue background is the primary repository with manual modifications, and the orange background is the env repository with the Kubernetes manifest that is automatically updated.
For example, when you add functionality to the Account class, you push it to the primary repository on Github, where Cloud Build begins to build a new version and, if successful, saves the built image to the Artifact Registry. At the same time, Cloud Build updates the env repository and rewrites the Kubernetes manifest with the new image; Github sends the env repository trigger to Cloud Build, which updates GKE upon a successful build.

If there is a mistake in the Account with the newly added functionality, click on the version of the env repository you want to roll back to in Cloud Build’s history and run “Rebuild” (in the Applies Manifests section). As a result, GKE will automatically roll back and revert to the previous version.

A simpler design might have been possible, as I am not currently using a particularly complex infrastructure and do not need to scale. However, I am personally happy enough with the implementation and rollback preparation to simply update the Github repository.

5.2. Notifications to the smartphone application

After getting into automated forex trading, I wanted to have push notifications to my smartphone when an active application takes action. This is when I have a new position or when the conditions are right for a trade with a large expected value to occur. Any smartphone app was fine, so I send notifications to Line. The notifications are all packed together in the Exploratory module, which is just a toy that sends a message to an endpoint stored in Firestore, but it has been more useful than I had hoped.

Fig.15 Image of push notification

Two push notification functions are used (Fig. 15): one is a health check executed periodically by Cloud Functions, and the other is a trigger that fires when the output matches the condition after Strategy is executed. In both cases, the endpoints and tokens are stored in Firestore, and the destination is changed based on the terminal information in the Account class.

The health check monitors client access frequency, not the Python server on GKE. When a client accesses the Python server, the Python server updates the Firestore timestamp. If there is no update for more than a certain amount of time on a weekday, Cloud Functions will retrieve the notification settings from the Firestore and send an alert to the app.

Fig.16 Triggering by Notification Rule

Triggers of push notifications are executed after Repository execution (Fig. 16). This is used to send notifications when the Repository behaves unexpectedly or when you want to manually adjust the decision making.

An example of past use is a notification for a failed Firestore save. When I was using a Strategy that did not carry over positions on weekend, I added functionality to the Account class and updated the Strategy. If you have a position one hour before the end of trading on Friday, it will close the position at the appropriate price. At the time, testing and validation were still in their infancy, and the system stopped because of a prohibited format (an array in an array) that occur under special circumstances. I received an alert from Line and closed the position manually.

Conclusion

This article introduced a version control system that integrates “data analysis on Notebook,” “simple simulation,” and “implementation in a real environment. Strategy for decision making is shared among the three phases, and version management based on DVC keeps development and evaluation independent. As a result, I improved the complexity of analysis work and the management of built algorithms.

I wanted to make the system as packaged as possible, but the simulation and user state management mechanisms vary widely depending on the environment. As a result, I settled on separating them into several classes (Strategy, Account, Repository, and Process). Since I generally use this infrastructure when I analyze data in private, I intend to continue to develop the framework as much as possible.

--

--