Building a Data Science App: Part 1 — Software Architecture

Published in

Empirisys

6 min readSep 22, 2022

What is the purpose of this blog?

As part of my current position as data scientist at Empirisys, I have been researching the best ways to create a data science based application. In this four part blog series I will take you through a long-list of principles needed to build such an application. The intention for these posts is to givce an introduction to anyone else looking to build an application build around a data science use case.

What is architecture?

When building a data science application there are two different architectures to consider; software and data architecture. These topics are fairly nebulous, with copious blogs/articles/books discussing these subjects. In this post I have summarised many of the architecture principles identified by different practitioners and from my experience working as a data scientist with Empirisys, to hopefully give a comprehensive long-list of the things to first consider before building a data science application.

Software Architecture

This includes:

· How the application relates to its users

· How the separate parts of the application interact with each other

Data Architecture

This includes:

· Data lakes

· Databases

· Database schemas

· Data warehouses

· Dashboards

· …and how these things interact with each other and their end users

In order to build any piece of software, there needs to be a set of guided principles which can be used to build it. These principles need to be defined for both the software and data architecture. In this four-part blog series, I will take you through a non-exhaustive list of these principles, starting with software architecture.

Software Architecture Principles

There are a number of different ways of describing software architecture, but the method I partly discuss here will draw on the C4 model for building architecture. I found this method was very useful in identifying how the components of software application is build and how it interacts with its front-end and back-end users.

The starting point in this model is the context in which your application sits in the world, and how the components relate to each other.

Taking the banking app below as an example:

Looking at this diagram we can see that there are a number of components. The central square shows the actual banking app which the users will interact with. Below and to the side of this are the e-mail system and Mainframe Banking system which both ‘talk’ to the internet banking system.

In the table below I propose some basic questions to ask at this point, along with some reasonable answers!

If you want to be able to maintain, develop, and scale your data science app then the above may be some good answers to these questions.

In this first part of the blog series I will take you through some of the software principles needed to meet above requirements, drawing on the so called ‘Bezos API Mandate’.

Principle 1: Efficient and Scalable

Every piece of software contains one or more services, which are essentially reusable components in a software application which provides some sort of functionality.

Anyone who has coded as a data scientist/engineer etc. will be familiar with functional programming vs. open code. We know it is generally best to create functions which work independently and focus on a fairly specific task; so it is with software architecture.

Software architecture which achieves this more ‘functional’ approach uses microservices as explained by AWS. Essentially, instead of having one big application which does one thing (Monolithic architecture), it is more prudent to create lots of little applications (or microservices) which provide some sort of independent functionality and can be used by other services.

When services are independent like this they can be more easily deployed, updated, and scaled compared to monolithic architecture (think functional coding vs. open code in python!). Moreover, given this flexibility to update the software will give developers more creative freedom when designing/updating the application. See here for more.

Principle 2: Ease of understanding by developers

Having microservices is a good start, but we need an effective way for all the services to communicate with each other. This can be done using an API First Design. There are essentially three principles to this method as discussed here.

1. The API is a user interface

The users are the developers working on each of the separate services. Design with these users in mind — any developer (internal or external) should be able to understand the API. An API shows the functionality of a service to a developer, so this needs to be well defined and share a common vocabulary with other microservices for an efficient user experience.

2. The API is designed before the software implementation

The API is a very fundamental part of the software architecture which should essentially be separate from the implementation, and should not be changed very often. Instead this should be considered a fairly static part of the application which should only be changed if an essential piece of functionality is needed.

3. The API is well described

Any developer (internal or external) must be able to understand the documentation of the API. Be consistent with the way the documentation is presented, and avoid using jargon and/or explain the jargon where needed. Log any bugs in the documentation, or subtleties.

Principle 3: Security

The ‘cloud’ is an excellent place for this — the security measures that cloud providers go to are likely to be more robust than measures the average user who stores their files locally.

Below is a list of recommended things to look for when selecting a cloud provider as identified by Norton:

1. Encryption of data

Scrambling data will make it harder for hackers to steal it.

2. Back-up data

In case of deletion of data etc. Make your own back-ups too, in-case the cloud provider loses all your data.

3. Two factor authentication

Providing two pieces of data when logging into a site, e.g. username+password and a code sent to your phone.

Principle 4: Ease of understanding by users

The front-end users of your data science application need to be able to easily understand the data presented to them; an excellent example of this principle being applied would be a speedometer dashboard in a car. Below are some of the things to consider to help with this understanding:

1. Provide the right Interfaces for users to consume the data

How are they likely to use the application? Will they be familiar with certain styles of tech interfaces/functionality you can utilize (e.g. executives will want to see quick summaries like bar charts, whereas a data scientist user may be familiar with more complicated analytics like network diagrams).

2. User Testing

Engage with the users in an iterative testing process to understand their needs.

3. Platforms

Determine on which platforms the users are likely to use the app — mobile, web, desktop etc.

4. Tooltips

Having tooltips appear within your application will help to reduce the need to train users.

Hopefully this first post gives you a flavour of the software principles out there which help in building an application. In the next post in this four part series, I will discuss some of the data architecture principles which pertain to data pipelines. See you there!

If you found this useful, please let us know by getting in touch, give us a clap or a follow. You can find more about us at empirisys.io or on Twitter at @empirisys or on LinkedIn. And you can drop us an e-mail at info@empirisys.io, or directly to the author of this article, alex.white@empirisys.io.