Data Apps: From Local to Live in 10 Minutes

Published in

talabat Tech

6 min readApr 4, 2022

In a typical Data Science workflow, it is quite common to build prototypes in the form of scripts (e.g. in Python). This applies to anything from analysis visualization tools to sophisticated algorithms or Machine Learning models. Prototype development takes a significant portion of Data Scientists’ time and effort. Hence, an easy way to deploy a prototype in a form of web apps is much needed. This would allow business users to interact with these models and analyses, collect feedback, and iterate, all before rushing into productization.

At Talabat, we have three common ways of bridging our data products to interface with the business and product teams:

Data Apps
Tables on our Data Platform (facts, dimensions, aggregates …)
Direct integration with Engineering services (S3, GCS, API …)

This post explains how the Talabat Machine Learning Ops team built this simple yet elegant pipeline that brings our Machine Learning models and analyses live in a few minutes with the least possible effort required by Data Scientists. This is made possible by turning their scripts into shareable web apps to be exposed and used by all business owners. Let’s explore how this can be done.

Design Principles

We’ve developed this approach with the following five principles in mind:

Versatility: Can host any kind of application including public web applications, APIs, and webhooks as well as private internal micro-services, data transformations, background jobs, potentially triggered asynchronously by Pub/Sub events or Cloud Tasks.
Seamlessness: Has minimal development effort, smooth Data Scientists experience, and a seamless CI/CD process.
Consistency: Having consistent results across the development and production environments.
Scalability: Scaling up and down depending on the usage.
Security: Highly secure.

Tools

We use the following tools in our approach:

Streamlit: An open-source Python framework for building web apps for Machine Learning and Data Science. You can instantly develop web apps and deploy them easily using Streamlit. Streamlit allows you to create an app the same way you write a Python script. Streamlit makes it seamless to work on the interactive loop of coding and viewing results in the web app. You can definitely use other frameworks that develop python-based web apps, for example, gardio which gained popularity in 2021. You can also use Dash or ipywidgets. Streamlit combined the best of both worlds of having more control yet being easy to develop in general, especially the UI.
Docker: Builds container images to be deployed at scale.
Cloud Run: A managed serverless compute platform provided by GCP that enables you to run containers that are invocable via requests or events, and abstracts away all infrastructure management. Our Data infrastructure is mainly built on top of Google Cloud Platform (GCP). You can use your own server to host your app.

End-to-end Flow

To have a seamless app hosting experience, that minimized overhead on our end-users, i.e. Data Scientists, we designed the following process. The first two steps are straightforward and the only ones that need to be done by the user:

Create a GitHub repo following a specific structure (more details below).
Push the app code merging with the branch designated for deployment.

This triggers the pipeline which then automatically:

Dockerizes the app and pushes the latest container image to the container registry.
Transfers dependent data files (if any).
Finally, deploys the app on cloud run.

Voila, the app gets deployed on a secured load-balancer and a few minutes later it can be accessed through a given URL, which we then add Apps Directory Portal, to be accessed by relevant stakeholders. Below is the high-level solution architecture.

Data Apps high-level solution architecture of the pipeline flow

Let’s dig deeper into each component separately.

1. Application Development

The first step is the actual development of the application itself. Every app has its own logic and code resting in a separate GitHub repository. It is important that we keep the same repo structure among all of our Data Apps GitHub repositories, following the best practices of coding and documentation. This makes it easy for anyone to debug the application. Below is our repo structure.

2. Application Dockerization

To proceed with Dockerizing the application, note that the Dockerfile must have the following steps:

1. Define the base image with FROM

Every Dockerfile must start with the FROM instruction. We need a starting point to build the image. We can start FROM scratch, where scratch is an explicitly empty image on the Docker store that is used to build base images like Alpine, Debian, etc. We can start Docker images from any valid image pulled from public registries. The image we start from is called the base image. In our case let’s add FROM python:3.7 to our Dockerfile.

2. Install image tools and set app environment

Adding any dependencies to be installed, updating all app or system required tools (such as Python), seting the time zone, adding user and set permissions:

Install python dependencies from requirements.txt file.
Add any environment variables needed.
Specifying run.sh which runs the app inside the container.

The sample Dockerfile below describes the build steps of the environment mentioned above, to build the image from the Dockerfile and start up the containers.

Sample Dockerfile

Then we can launch the Docker container using the following command:

streamlit run --server.baseUrlPath /apps --server.enableCORS=false app.py

More on building your own Dockerfile can be found here.

3. CI/CD Configuration

A CI/CD pipeline is a series of steps that must be performed to automate the delivery of a new version of software. There are tools available to setup a CI/CD process e.g. CircleCI. Feel free to pick whatever you feel comfortable with. In our case, we chose Cloud Trigger.

To set up the CI/CD pipeline, we need to:

Create the trigger action event on GCP.
Add the config file cloudbuild.yaml to the repo.

At Talabat, we manage our infrastructure using Terraform but for completeness, we will show how to create the trigger from the UI.

This can be done from the Cloud Build UI in the relevant GCP project, as per the screenshots below, by specifying the trigger name, description, branch name, and any possible environment variable to be resolved from within cloudbuild.yaml. Cloud Build provides default variables substitutions which can be found here.

As for cloudbuild.yaml:

Sample cloudbuild.yaml

This executes four main steps:

Transfers the data to a storage bucket. Cloud Run also supports CloudSQL, in case data operations get heavier.
Builds the container image using the Dockerfile.
Pushes the image in Google container registry.
Pushes the new app revision by deploying the image to Cloud Run.

Note: It’s important to grant the Cloud Run app service account permission to access the created storage bucket. When using the default account, it will use an account that looks like this:
012345678901-compute@developer.gserviceaccount.com

Once we’re done creating the trigger event and pushing cloudbuild.yaml to the repo, we can easily deploy new docker revisions of the app itself to Google container registry by simply merging the development feature branch to master.

We’ve generalized this pipeline to be our default way of hosting apps at Talabat. We have our own portal that lists a directory of all our data apps.

Acknowledgments

This process has proven to be very effective in enabling our Data Scientists to deliver their Data Products seamlessly. Thanks to Edzo for the development efforts behind this approach and for Fadi for being an early tester and adopter.