People are often surprised to learn that my background is chemistry. Data science job adverts almost always specify physics, mathematics, or computer science. So how does a chemist become a data scientist? Here’s what worked for me.
I started my undergraduate degree in chemistry at Imperial College London in 2007. I mostly took physical chemistry courses that covered a decent amount of mathematics. I didn’t recognise it at the time, but the all the linear algebra turned out to be particularly important — it’s fundamental to most applied mathematics.
In summer 2010 I did a research placement and taught myself MATLAB. Up until then I only used Excel, so learning MATLAB had immediately levelled-up my data analysis skills.
For my final year project I made quantum dots with microfluidic reactors. My group automated the reactors using optimisation algorithms in MATLAB so that we could produce nanoparticles with the specific properties. This was my first taste of computers making “intelligent” decisions.
In 2011, I started a combined MRes and PhD programme at Imperial. As part of the MRes, students were expected to take a lecture course in any department at Imperial, but instead I enrolled on Andrew Ng’s then brand new online Machine Learning course. It’s a superb course and an essential introduction to machine learning.
Around this time I started learning Python. Gradually it replaced MATLAB for all my work. It turned out to be an excellent decision as it’s language of choice for most data scientists (with R in second place). I’d recommend newcomers go straight to Python.
I was keen to build up a Github profile, so I open-sourced pumpy, a Python package to control syringe pumps. I also started using Arduino microcontrollers in my experiments. These are programmed using a subset of C/C++, so that gave me experience of programming in a lower level language.
With my reactors producing more and more data, I learnt more statistics, so I could design my experiments and analyse my results properly. I mostly used linear regression models and significance tests to determine what reactor conditions really mattered. I don’t use statistics much in my current role, but I think a good understanding of fundamentals is crucial.
In the final year of my PhD I decided I wanted to pursue a career in data science. I absorbed as much data science as I could from books, blogs, podcasts, and meetups. What I realised from this was that whilst I had strong problem solving, programming, and data analysis skills, I lacked practical experience with machine learning and SQL.
To address these gaps I embarked on a few projects. I taught myself SQL using pgexercises and kept my experimental results in a SQLite database. I now use SQL every day.
To get experience building machine learning models, I trained classifiers using the Kaggle Titanic dataset. In hindsight, I wish I had spent less time reading about machine learning and more time building models using datasets freely available online.
In late 2015 I applied for data science jobs in London. Interview questions about the complexity of functions and data structures came up a few times, so I bit the bullet and ploughed through Khan Academy Algorithms course. I recommend it for two reasons. Firstly, engineers love to ask these questions in interviews (whether they should is another issue). Secondly, the “wrong” data structure can have a huge impact on memory and CPU usage.
In the end, it all paid off and in November 2015 I accepted a job offer from DueDil. So, can chemists start a career in data science? There’s a lot to learn, but absolutely. Identify the data science skills and knowledge you lack, then use your research as a means to fill in the gaps. Rather than listing technologies and models on your CV, list projects, either from your research or spare time, to show employers that you’re a data scientist. Data science is an exciting and intellectually stimulating career and I’m glad I made the leap!
Got questions? (Update 24th August 2018)
Many people contact me to ask more questions about starting a career in data science. I don’t have the time to answer all of them. Please get in touch only if you’re a person from an underrepresented group in technology.