Why Every Data Scientist Should Learn Streamlit?

8 min readFeb 3, 2023

The CRoss Industry Standard Process for Data Mining (CRISP-DM) is a process model that serves as the base for a data mining and data science process. This process was conceived in 1996. Through the years alternative models have been proposed like SEMMA, KDD, KDDS, etc to improve the life cycle of data mining and data science projects.

While the alternative methods were proposed, we only focused on improving the technical parts of the process. But we didn’t pay that much attention to the most important aspect: how to share our progress in an agile way to create a feedback, learning, and improvement loop with our stakeholders, product managers, and customers through the life-cycle of data science and machine learning projects and products.

In this article, I will demonstrate Streamlit and how big an impact it can create in data science and machine learning projects and products.

How NOT to share your data science project with your stakeholders

Jupyter Notebooks

Data scientists love Jupyter Notebooks. Indeed it is very powerful and useful while you are working on exploratory steps in your project. But when it comes to sharing your findings during a presentation, sending it to stakeholders for deep dive analysis, or going to production it sucks! Let me give some experience-based examples.

You are in a presentation and you are asked to share the process of the data science project with the data product manager and key stakeholders. You share your screen and your beloved Jupyter Notebook with 2000 lines of cells are ready to rock! You press “Restart Kernel and Run All” and boom! The CSV file is not accessible or a library is missing or Pandas is suddenly resisting to cast to the string column to an integer. How did it happen? You have just checked it a couple of hours ago! You try to debug it during the presentation. You start sweating, your leader starts sweating and everyone else starts playing with their cell phones to check their social media updates. You lost the audience!

Even if everything goes well when one of your stakeholders suddenly asks you, “how can we analyze your findings?” or “can we test the model to understand how it behaves?”, you are doomed! How are you going to share your Jupyter Notebook with a non-technical stakeholder? You can’t say it works on your computer.

BI Tools

There are so many great open-source and commercial BI tools in the market; e.g. Tableau, Metabase, etc. These tools are great for company-wise analytics and reporting functionalities. They are mostly operated by data analytics and data governance teams to set certain quality aspects of what has been produced.

In data science projects, especially in the early stages before going into production you need to be agile. You need to test new features, and their impact on the models, get feedback from stakeholders, on how the model will look in the product UI, get feedback from stakeholders, on how the users are going to interact with the product and get feedback from stakeholders. There are so many dynamic components in these stages. Unfortunately, BI tools and company-wise analytic processes are too slow for presenting data science projects. Also connecting your BI tool with your REST APIs etc will generate too much trouble.

Presentation Slide Tools

These tools are great to present static objects and adding some pictures, text fields, and notes. But you are not making a presentation about the new regulations and how they are going to change the payrolls in the next few months, right?

Imagine that you are talking about how did you implement your AI model into the product, how the users made interactions with it, and how it is going to change the future of the organization and you are asking your slide operator buddy, “Next slide please”. Eh, it doesn't sound charming.

What is Streamlit

Streamlit is an open-source application framework in Python. It allows developers to build interactive web applications without any web development and Front-End development experience. You just need to know Python!

Let’s Get Started

It is quite easy to install and start Streamlit;

pip install streamlit
streamlit hello

It is that easy! You have started Streamlit on your computer. Now you can access its demo sections, technical documentation, etc.

Your first web application at Streamlit

I would like to demonstrate how easy to build an application. We are going to write a Python script in our local and save it, basic-demo.py

import pandas as pd
import streamlit as st

df = pd.read_csv("~/Downloads/cars_final.csv")

# streamlit will build a line chart
st.line_chart(df.Price)

Then we are going to start our terminal to run the script above.

streamlit run basic-demo.py

When you execute the command above, Streamlit automatically starts your web application in your browser. It is quite cool and easy, ha?

Used Car Data Analysis and Price Prediction

In my previous article, I demonstrated FastAPI on how to serve our Machine Learning models as data products. Today I want to extend that use case by building a web application in Streamlit to not only serve the model but also add extra capabilities to understand the data set.

You can easily consider this use case as your data science project at your company. You have been working on the data set, your product manager and stakeholders are willing to interact with your findings and you need an easy solution to tell the story. Even further you need to provide them with a solution so that they can continue their analysis without needing your support!

Streamlit Basics

Title, Markdown, and Header

Stremlit supports developers to add different kinds of text to be visualized differently in the application. Below I will demonstrate the differences between the title, markdown, and header.

st.title("User Car Prediction Project")
st.markdown(
    "In this project we analyse the dataser, make predictions and visulise them with Streamlit"
)
st.header("Top 10 rows of the data set")

Streamlit Dataframe

Streamlit supports Pandas, Spark, Snowflake, and many more data frame objects and serves them as interactive forms. The key benefits of this feature are; e.g. sorting, scrolling, extending the page, etc.

st.dataframe(df.head(10))

Static Tables

Static tables are different from data frames in the sense that they don’t allow user interactions.

st.header("Most frequent top 5 company")
# most frequent top 5 company
st.table(
    df.groupby("company")["name"]
    .count()
    .reset_index()
    .sort_values(by="name", ascending=False)
    .head()
)

Number Inputs

Number inputs gather numeric inputs from the users. With this widget, you can allow your users to interact with the numeric fields that may change other components in your application. If you want you can put the widget in the sidebar and create a more effective UX for the main parts.

# enabling slider to collect min and max values to analyse price distribution
with st.container():
    st.sidebar.header("Price range selector")
    minimum = float(st.sidebar.number_input("Minimum", min_value=float(df.Price.min())))
    maximum = float(
        st.sidebar.number_input(
            "Maximum", min_value=float(df.Price.min()), value=float(df.Price.max())
        )
    )
    if minimum > maximum:
        st.error("Please enter a valid range")
    else:
        df.query("@minimum<=Price<=@maximum")

Plotly Charts

Plotly is a well-known chart library in Python. The core benefit of this library is its supporting interactive components and charts. Streamlight directly supports Plotly charts.

st.header("Price distribution")
f = px.histogram(
    df.query(f"Price.between{(minimum, maximum)}"),
    x="Price",
    nbins=15,
    title="Price distribution",
)
f.update_xaxes(title="Price")
f.update_yaxes(title="Number of cars")
st.plotly_chart(f)

The beauty of the web application is to combine different components and widgets together. As an example, I will use a numeric input widget to get the minimum and maximum car price values to pass them to the bar chart.

Radio Buttons

Radio buttons, checkboxes, and many other components allow the users to filter out only the selected information. This is widely used in BI tools to optimize the focus of the charts and reports.

Below I will add a radio button to the side panel, and use the selected input from it to generate price descriptive statistics for that company.

with st.sidebar.container():
    st.header("Company selector")
    companies = st.radio("Companies", df.company.unique())

# based on the radio button selections, price descriptive statistics will be updated
def get_availability(companies):
    return df.query(f"""company==@companies""").Price.describe().to_frame().T

st.header("Company price descriptive statistics")
st.table(get_availability(companies))

Combining FastAPI and Streamlit to Serve Car Prediction Model

The cherry on the cake is using FastAPI to serve the machine learning models via Streamlit in a web application.

def car_price_prediction():
    st.header("Car Price Prediction")
    name = st.selectbox("Cars Model", df.name.unique())
    company = st.selectbox("Company Name", df.company.unique())
    year = st.number_input("Year")
    kms_driven = st.number_input("Kilometers driven")
    fuel_type = st.selectbox("Fuel type", df.fuel_type.unique())

    data = {
        "name": name,
        "company": company,
        "year": year,
        "kms_driven": kms_driven,
        "fuel_type": fuel_type,
    }

    if st.button("Predict"):
        response = requests.post("http://127.0.0.1:8000/predict", json=data)
        prediction = response.text
        st.success(f"The prediction from model: {prediction}")

car_price_prediction()

Conclusion

The main problem of being a data scientist is not building machine learning models, data preparation, going production, cloud services, Tensorflow, etc. The main problem is the communication gap with your stakeholders.

In data science, we try to tame uncertainty to predict acceptable certainty based on probabilities. For non-technical stakeholders, these concepts can be too much to handle. As a data scientist, you need to create empathy and help them to understand how you can bring value to their projects and products.

Being transparent about what you are working on, building solutions that your stakeholders can interact with, and creating a feedback loop will grant you the missing support with your stakeholders. Streamlit is a great tool that you can start this cycle in your data science projects.

Thanks a lot for reading 🙏

For my other posts, you can visit my Medium home page.

You can find me on Linkedin and Mentoring Club