A Day In The Life Of A Data Scientist At High Peak

Team High Peak
High Peak AI
Published in
5 min readFeb 11, 2019

It’s been only five months since I joined High Peak as the Lead Data Scientist but it feels like a year, already! This is what my typical day looks like:

As I walk down to the workplace from the metro station, 10 hours into a new day, I run through the tasks for today in my mind. Of course, the task list depends on what transpired the day before (a couple of days ago, if it’s a weekend!).

If I expect to be given a new dataset today, then I plan to start right at the beginning: data cleansing, structuring and visualization. If it’s an old dataset, then I know that I have done the preprocessing work and left the model training overnight. In this case, today’s work will be to test the model and analyze the results.

But data can wait. My tea, however, cannot!

Brewing Tea and Battling Big Data

I stride into the room greeting everyone who’d pitched their tents before me: well begun is half done! After setting up my desk, I check my email and Skype for any important or relevant messages and get to work downloading the new dataset or the trained model from the server. Now’s the time for a cup of steaming hot green tea: after all, health is wealth

By the time everything’s downloaded and ready for surgery, I have enjoyed my hot cup of tea and am ready to battle datasets with the data science and deep learning team. We have a discussion/meeting on what would be the tasks for today and then get started on our respective schedules.

At High Peak Software, we work on a variety of projects, which include classification, regression, forecasting and clustering.

For the non data scientist readers: classification and regression are forms of predictive analysis, in which a target column is predicted by analyzing a series of input columns called features. Forecasting is a form of time series analysis, where we arrive at the future values of a series given its history. Clustering is the basic method to handle unsupervised learning, where we do not know the target column, but group the features according to their behavior.

One of the projects I work on involves binary classification: the target is either 0 or 1. I build machine learning models to learn the patterns from the input values (features) and predict the target column. The raw data for this project involves tens of millions of spreadsheet rows!

Huge datasets, right?

I generally use Python with its extremely helpful array of libraries, but it falls short when faced with mountains of data as in this case! This calls for Apache Spark engine, which is the gold standard as far as big data processing is concerned. Once I have cleaned and understood the distribution of the data, I put a machine learning model in training. Training takes anywhere between a couple of hours to ten hours, so that’s it on binary classification for the day!

Clusters, Collaborations and Convolutional Neural Networks

I also work on forecasting. This data is even bigger, as it is recorded every two seconds! Forecasting needs a lot of data visualization, so I write codes to plot graphs and make sense of the data. The rows are then grouped (clustered) according to their behavior so that the same forecasting algorithm can be applied to multiple rows with similar behavior. Once I know which row belongs to which cluster, I forecast the values for that row for the next month, then run a model to predict whether the combination of the forecasted features leads to a 0 or a 1!

An interesting thing I am doing at the moment is image classification. I take a set of images and classify them into five classes. For simplicity, I will give an example of binary image classification here. Imagine a folder with photos of cats and dogs only; i.e., there is no picture in the folder with anything other than a cat or a dog in it. I want to build a model which learns to classify an image as either containing a dog or a cat, so that when future images come in, they can be readily shifted to the appropriate folders.

I use convolutional neural networks (CNNs) for this purpose. CNNs involve a series of mathematical operations on the input data such as convolution and pooling followed by the matrix mathematics of the traditional neural networks. It’s not feasible to code everything from scratch, so I use libraries; but it always helps to know the basics. CNN processes take 8 to 10 hours for training, so I generally leave them to run overnight.

Machine learning and data science are built on mathematical and statistical foundations. Whenever I am in doubt, I visit my statistician colleagues for a discussion. The environment at High Peak Software is extremely conducive to learning, people are always willing to have a chat and exchange information. Technical discussions and collaborations are greatly encouraged here, and it’s great fun to catch up over another steaming cup of tea or dataset!

All Work and More Play

Once I’ve put the models on training runs, and there’s nothing else to be done, I grab my table tennis racquet and head upstairs for a couple of hours of intense TT! I have a training run going on almost everyday, so there’s a big chance of seeing me in the vicinity of the TT table post 4 PM.

In between pauses, I run down to my desk and check the progress of the training run. Usually, it’s fine; but in case there’s any error, back to work it is.

Training runs started in the morning usually end by 7.30–8 PM. I do a bit of analysis to see whether the model learnt well or not; and depending on that, either put it on a re-run or pack up for the day.

So that’s it! A typical day in my life as a data scientist at High Peak Software. A day full of learning, work, and fun!

If you’d like to publish us in your publication, please reach out to us at marketing@highpeaksw.com

--

--

Team High Peak
High Peak AI

Articles, case studies, ebooks on AI and technology, owned and published by High Peak Software (www.highpeaksw.com).