Data Science Salary Prediction using Streamlit

…how I used machine learning to get a raise!

Ulrik Thyge Pedersen
6 min readMar 23, 2023
Photo by Cytonn Photography on Unsplash

The art of salary negotiations in tech

The technology industry is booming, with high demand for skilled professionals and competitive compensation packages. However, negotiating a salary in the tech industry can be a challenging process, even for experienced professionals.

Many factors contribute to the difficulty of negotiating a fair salary, including the rapidly changing landscape of the industry, the scarcity of top talent, and the prevalence of non-standard compensation packages.

I used to struggle with advocating for myself in professional settings due to my non-confrontational and introverted nature. This made it difficult for me to negotiate for better working conditions, favors, or a higher salary.

However, during my most recent job interview, I decided to take a unique approach that leveraged my expertise in data science.

I created a machine learning model that could communicate on my behalf and make the case for a fair and competitive compensation package.

This approach allowed me to showcase my technical skills and communicate my value to the company in a way that felt authentic and true to my personality.

Motivation behind the idea

As my wise colleague often reminds me:

“Machine Learning models made in the basement, stays in the basement.”

It’s a powerful statement that highlights one of the biggest challenges facing Data Science today: how can we make it accessible and useful for solving real-world problems?

To tackle this challenge, I set out to create a frontend interface that would display the features of my model and use it to make predictions based on those features. And that’s where Streamlit comes in — it’s an incredibly useful tool for making machine learning models more accessible to a wider audience.

By using Streamlit, I was able to create a user-friendly interface that allows anyone to interact with my model, even if they have no prior experience with data science. This kind of accessibility is essential for creating meaningful solutions to real-world problems, and I’m excited to see what kind of impact my model can have in the hands of more people.

Photo by Riccardo Annandale on Unsplash

What is Streamlit?

Streamlit is an open-source Python library used for building custom web applications for data science and machine learning projects. It simplifies the process of building and sharing data-centric web apps by allowing developers to create interactive web apps using just a few lines of Python code.

With Streamlit, data scientists and developers can quickly create interactive visualizations, data exploration tools, and machine learning applications without having to worry about the underlying web infrastructure. Streamlit automatically takes care of the web server, user interface, and data flow, allowing users to focus on building the data analysis or machine learning model.

Overall, Streamlit is a powerful tool for creating data-centric web apps and is becoming increasingly popular in the data science and machine learning communities.

Our dataset

The inspiration and data for this project come from the Danish Data Science Association. They conducted a comprehensive survey that captured important information about:

  • Current salary
  • Work sectors
  • Years of experience
  • Company size
  • Preferred tools
  • Geographic region
  • Job title
  • Education level

The dataset gathered through this survey provides a valuable resource for exploring the relationship between these different factors and understanding how they contribute to overall job satisfaction and career success. I’m grateful to the Danish Data Science Association for providing this rich dataset. The job is to train a model to predict salaries of Data Scientists, lets get modelling!

House Keeping

First, lets import the required libraries to build, train and validate our model:

Next, we will read in our dataset. Since the data contains sensitive personal information, here is a slice of our dataframe:

Image by Author

Data Preparation

We have a lot of categorical features in our dataset. One way to improve model performance is by encoding our features. There are many different ways of encoding features, but the one I chose i sci-kit learn’s OrdinalEncoder.

This encoder assigns values to each category and tranforms the dataset to a more machine learning friendly format:

In addition, we manually OneHotEncoded our most_used_tools feature, since this feature can vary in count and value.

Train/Test split

Now that our data is ready for modelling, the first step is to split our data into a train, test and validation set:

Model training

Because its my favorite model, I used a XGBoost Regressor! Is there a better model? Most likely!

Is this approach good enough for now? Yes! (because I said so)

To find the best hyperparameter configuration I defined a parameter grid and fitted the model to it.

The validation scores are: Mean Absolute Error: 789.02 DKK and a R2 Scoreof 0.93. Good enough for now, time to make use of our trained model!

Building the app

Streamlit works by reading code from a public GitHub repository. So to make the app work, we need to have a repo with the following:

  • A .py file that creates and defines our Streamlit app.
  • A requirements.txt file that specifies the Python packages Streamlit needs to install for the app to run.
  • The .JSON file that is our model.

You can find my repository for the app here. And my Streamlit app here. Let’s get into how to build the app!

First we need a requirements file so Streamlit knows which dependencies are required to run the app:

Once our libraries are installed, we can import them into our app.py file. In addition we importing our model and enabling caching for our app, so it can load faster with each prediction:

Next step is to define a function that will predict our label (salary) based on the features we input into Streamlit. To do this we define the function, call model.predict on our input features and return the final prediction:

We have one little issue to fix…. Since we have encoded our categorical features, and the input we get from Streamlit is not, we need to encode the input from Streamlit to match the model:

Designing the Streamlit Frontend App

The model, data preparation and backend is done, nice! However we still need to design our frontend app. We need a title, a pretty image and a header:

Next, interactive boxes to receive our input features. Streamlit offers a variety of boxes, since we have different features, we are gonna take advantage of this and use a few of them! Lets start with the job title that best matches our daily work:

Now we have a shiny and nice frontend that can take in our features. very last thing is to make a button, that when pushed, will call our predict function, and return the model’s salary prediction based on the receive input features:

And there we have it! With a few lines of code we made a Streamlit application to deploy our Data Science Salary predicting model. To deploy the app, Streamlit has a website and documentation where you can showcase your app for the whole world to see. The final app looks like this, and I’m kind of proud of my first frontend creation!

Image by Author

Closing Thoughts

The ability to create interactive tools quickly and easily is hugely impactful for both beginners and seasoned professionals in the field of data science.

For those who are just starting out, it’s incredibly rewarding to take your code and turn it into a real-life application.

And for experienced practitioners, presenting your work in an easy-to-understand format can help you sell your ideas to stakeholders and decision-makers.

Streamlit is a powerful tool that allows us to do all of these things and more. Its speed and flexibility, combined with its native integration with Python, make it an ideal choice for prototyping, mock-ups, MVPs, internal tools, and personal projects. While it may not be robust enough for large-scale production deployments, it’s certainly worth your time to learn the package and explore its capabilities.

In conclusion, I hope this article has given you a better understanding of the value of Streamlit in the field of data science!

Thank you for reading my story!

Subscribe for free to get notified when I published a new story!

Find me on LinkedIn and Kaggle!

…and I would love your feedback!

--

--