In this tutorial, we start investigating the Robotika platform. As a representative example, we have chosen a pretty famous fraud detection data set from Kaggle (Credit Fraud Detection). The data set is relatively small (~ 100MB). However, it enables us to explain the main concepts and show you the main features of the Robotika platform.
- Creating a new account on Robotika.
There are two ways to get into the Robotika. The first way is to be invited by the company already in Robotika. In this case, you will get access to all the pipelines created by the company. So be aware that if you invite somebody to Robotika, they will have access to all the created pipelines. The other way is to register in Robotika yourself with the following link.
2. Pipeline editing
Let us suppose that we entered the Robotika successfully. In order to start deploying ML algorithms, we need to click on “Create pipeline” in the navigation menu. The new window “Edit Pipeline” will be opened. Here it would help if you fill in the pipeline’s name. In our case, we call it “Fraud_detection.” The name of the pipeline is the name by which your pipeline could be identified among the pipelines in your account. The basic description of the pipeline is also mandatory. You can call it the same name as the name of the pipeline, or any other description would suffice.
Here we would like to emphasize that there are two significant ways how Robotika reads the python files. The first way is to use git account.. In order to do this, you may provide access to the s3 bucket, or you may create the s3 bucket public. Another option for reading the python files would be to give access to the Github account via git token. After clicking the button “next” we precede to “Pipeline Component Config Step.”
3. Pipeline Component Config step.
Here is where the actual pipeline starts. You can see clickable boxes with an exclamation mark. The exclamation mark means that the specific component of the pipeline is not configured or is configured incorrectly. Some of the boxes in the pipeline are blurred. It indicates that we cannot modify them during the creation of the pipeline. However, we can observe each component’s execution results after running the pipeline. We will take a closer look at these components in the following tutorials. Let us briefly describe the main components of the pipeline:
- ExampleGen ingests data into a pipeline.
- StatisticGen generates the statistics for the ingested data ( only view mode available)
- SchemeGen uses a description of the data to generate a data ( only view mode available)
- Example Validator detects anomalies in the data ( only view mode available)
- Transform performs feature engineering with the data that entered on the ExampleGen step
- Trainer performs training of the Tensorflow model
- Evaluator performs the evaluation of the model from the training step.
- Infra Validator function is to check the model mechanically before bringing it to production.
- Pusher component brings the model to production after training or retraining.
We intentionally saved the names of the original TFX components for easier knowledge transfer(TFX tutorial).
4. ExampleGen Component.
Let’s click on ExampleGen Box. The main idea of this component is to load the data. Currently, Robotika supports several file types of data: CSV and parquet. The data can be downloaded from an S3 bucket.
In the field “Input Data location,” we should indicate the path to the files. Please note that Robotika reads all the files with the extensions (CSV, parquet) available in the indicated directory. So there is no need to tell a specific file, and the folder’s name would suffice.
The split configuration field indicates how data is stored. If the data is already split into training and validation datasets, you need to provide paths to the corresponding files within the “input data location.” Otherwise, the data will be split according to the Training/Evaluation ratio.
It is worth mentioning that Robotika needs to have access to the S3 bucket. So there are two ways how to handle this. First, you can give access to the S3 bucket to Robotika. In order to do this, we need to specify the following policy with your specified <BUCKET_NAME>:
Another option would be to make S3 bucket public. However, we do not recommend doing this from the security point of view.
5. TRANSFORM component
This is one of most important components where all the preprocessing appears. In Robotika console you have to indicate a path to the transform file located on github repository. The link to the Github repository was provided by you previously (please see “Editing pipeline”). Next, we would like to show the code for the transform.
We would like to emphasize several major points:
First, it is crucially important to name your function
preprocessing_fn, because the TFX framework looks for exactly this function name. Second, the input parameter to the
preprocessing_fn are the feature maps. The output of the
preprocessing_fn is the transformed feature map. In our example we only took the subset of the features for the final prediction. Your preprocessing could be more complicated.
6. Trainer component.
The trainer component is responsible for training the algorithm. Most likely, this part would be the time-consuming part unless you prepared the algorithm in TFX beforehand. You need to specify the path to the code and instance type. Both CPUs and GPUs are available as possible instances. You can also indicate a requirements file that consists of necessary python libraries. Here we present an example of the trainer code:
We trained the Deep Learning model with two deep layers. It is important to emphasize that the function
run_fncontains the main functionality. Robotika identifies the function with exactly this name. Therefore, we highly recommend not to change the name of this function =) . Overall, the code structure is very similar to the TFx tutorial. We tried to make the development maximally intuitive by inheriting the TFx code structure.
7. Evaluator component
In order for Robotika to evaluate your algorithm, you need to indicate the “Label Key.” This key should coincide with the label key used in the
training_fn function. In our case, we use an accuracy score. We can choose whether Robotika compares the results of the evaluation or not. It is possible for Robotika to check whether the selected metric is better than some threshold value. Thus, the system will deploy if and only if the metric for the trained satisfies the condition. For example, we indicated that an accuracy score should be higher than 0.95. Therefore, if the model on the evaluation step gives an accuracy score higher than 0.95, it will be deployed in production.
InfraValidator is a component that ensures that an ML model is deployed correctly. It is a formal part that checks the model before sending it to production. However, you still have control over the parameters here.
If the “operation type” is chosen as “Only test model loading,” then the model does not perform any testing after deployment. Otherwise, the tests would be performed. We advise setting the maximum loading time to at least 15 seconds since even the smallest model could take a significant time to deploy on the first attempt.
The parameter “Number of retries before failing” indicates how many attempts the system has before giving up. We highly recommend setting this parameter to at least 2. Indeed, model deployment is a complicated procedure that can significantly vary in time, even for the same model. Thus, we ensure that the model is properly deployed.
Additionally, to the described parameters, you can indicate Model Tag to automate model management and track billing more efficiently.
The Pusher component enables you actually to push the trained model into production. In Robotika, you can parallelize predictions so to have several concurrent predictions if necessary. For now, the parameter can be set up to 16. However, very soon, we will increase this parameter. The memory size field indicates the memory size for each prediction process. Finally, you can indicate your personal AWS Account to which Robotika will give access for the created endpoint.
Hopefully, all the components are checked, and you can press the button next. Now is the time to schedule the periodicity for retraining of your algorithm. Sometimes the data changes over time, and the algorithms have to be retrained often. For this purpose, we created a Scheduler. You can choose the period when the algorithm is retrained. In our example, the retraining appears to happen once each week. Please note that the first execution will start as soon as you save the schedule!
The execution of the algorithm happens equidistantly in time. For example, if you choose to retrain the algorithm twice a day, it happens every 12 hours after the scheduler is saved.
Moreover, you can indicate the schedule in cron (if you are familiar with it). In general, cron scheduler gives you slightly more flexibility in choosing the timing for your retraining process. For example, you can set your algorithm to be executed only on Wednesday at noon.
Congratulations, you are one click away from running the algorithm on Robotika. The only thing left is to click SAVE. The algorithm will start executing. On the page “PIPELINE RUNS” you are able to observe the results of the run. Please keep in mind that it usually takes some time to execute the entire pipeline depending on the size of the data and the algorithm’s complexity. For our specific Fraud data set, it took around 5 minutes for Robotika to read the data and execute the ML algorithm. If everything went well, you could see that the pipeline was COMPLETED. The endpoint would be created immediately, and the algorithm would be fully deployed in production.
However, if something goes wrong, you will see the red indication “FAILED.” Please do not be disappointed if your algorithm does not work on Robotika from the first attempt. In the next tutorial, we will explain to you the details of debugging and what could be done in order to avoid common pitfalls.