Lessons For New Data Scientists We Wish We Had Known
The start of your career can often feel like a whirlwind — from your anxious initial days, to your first presentation and eventually your first promotion. Often, it is only after you have fully settled into an industry that you pause to take stock and realise how much you have actually learned — and how valuable these lessons would have been if you had known them from the start.
With this in mind, we have considered the key tips we would have found most useful at the beginning of our careers. Hopefully these will be helpful — regardless of whether you are settling into a new role, searching for that perfect first position or embarking on your studies.
Before we deep dive into these, a few words about us. We are both data scientists at QuantumBlack. Diana came to the UK from Romania for her studies; she did a BSc in Computer Science at the University of Birmingham, followed by an MSc in Advanced Computing at Imperial College. She joined QuantumBlack after a couple of internships — at Goldman Sachs and Expedia. Outside of work, she is a theatre and literature nerd and started QuantumBlack’s book club.
Margaux is Franco-British and spent most of her studies in France (Telecom Paristech), did an MSc in Statistics at Imperial College and started working in data science consulting in Paris, at BearingPoint (HyperCube team). She joined QuantumBlack in Sydney for 1 year, before moving to London. Margaux is a language nerd and tries to seize each business trip as an opportunity to learn a new language.
1. Data Scientists Are Not Unicorns
There is no ‘catch-all’ definition of a data scientist. Unlike many jobs, your job description will vary depending on the company and the industry which you settle in. Your responsibilities will overlap with those of data engineers, business analysts and machine learning engineers — you might even be expected to cover all these positions interchangeably. We found that a great way to find out more about the actual role you will play is to reach out to companies that you might be interested in or to speak to friends from university.
To give you a high-level idea, we use the following succinct definitions at QuantumBlack:
- Data Scientist — Writes and optimises Machine Learning and Statistical algorithms, explains findings and predicts results
- Data Engineer — Extracts, wrangles, cleanses and links data, presenting it in a form ready for analysis — builds data pipelines
- Machine Learning Engineer — Links Data Scientists and Data Engineers, writes scalable and highly efficient code developed by the Data Scientists in a production environment
2. Expect — But Don’t Fear — the Imposter Syndrome
The fluid definition of ‘data scientist’ means that you’ll likely meet a range of software developers, mathematicians and statisticians working under the same title.
To give you a flavour, our London team alone features a range of academic (STEM) backgrounds, including:
- Master of Science in Physics followed by a PhD in Astrophysics
- Statistical genetics, bioinformatics
- Bachelor of Science in Computer Science (began as Liberal Arts with Major in Computer Science and minor in Literature) followed by a MSc in Machine Learning
- Bachelor of Science in Economics, a Master of Philosophy in Econometrics followed by a Master of Science in Computational Statistics and Machine Learning
Project discussions can sometimes become overly complex and technical, so it is easy for anyone new to the industry to feel overwhelmed and suffer from the imposter syndrome. This is natural — try to use it as an opportunity to learn and develop. Seek out senior colleagues and ask them for guidance on specific subjects. You will be surprised how open they are to share their expertise. You will also find how comfortable they are with not knowing everything themselves. In our experience, we found that people were always happy to help if we reached out to them. At QuantumBlack, we are encouraged to practice the ‘see one, do one, teach one’ approach — learn new things from those around you, apply them independently and then speak about them or teach them to others.
You will likely find yourself mixing with people with incredibly specialist knowledge and education backgrounds. You might not initially realise it, but others will also seek for your expertise and advice if you have studied a specific subject during your MSc or PhD and you will in turn be able to share a fresh perspective with your team.
Finally, realise that it is ok to be wrong, if you acknowledge it. If we only speak out when we are 100% confident, we might miss these micro-moments that have a ripple effect. We should view meetings as discussions from which we can learn. We all learn from both our successes and failures. You will spend about 40 years in the workplace, so you will have the time to enjoy both!
3. Write Production-Ready Code
As a new data scientist, you may expect that any code you write will not see the light of day and that you could survive with proof-of-concept standard code. However, this is not the case. Production-ready models are increasingly becoming key deliverables for many data science projects.
You will collaborate frequently with other data engineers and scientists on the same codebase. Remember that you will be far more productive — and earn the thanks of your grateful new team — if you write robust (i.e. modular) code, document it, and become comfortable with version controlling tools (e.g., git) and the command line.
The below offers some useful coding starting points, but is by no means an exhaustive list:
- Make small commits, apply linting (Pylint) and leverage formatting libraries, such as black
- Pull from develop at least every day; this will reduce the amount of merge conflicts that you will end up having
- If you commit notebooks, strip out their output
- Try to have sessions of pair programming, to share code understanding with fellow data scientists — this is also a great way to learn from one another
- Plan peer code review sessions, to go through the pipeline with the full team — this enables to spot bugs and code inconsistencies.
4. Anticipate Messy Data
The datasets you will process will rarely be as clean as those encountered on Kaggle. Data is often siloed across a range of systems and collection practices can be very variable, so prepare to spend time mitigating for this. Learn how to deal with this early — master a few imputation methods and outlier handling policies.
As your model performance will be capped by the quality of your data, the time you spend on data processing and feature engineering will never be wasted. Our team has sometimes spent months meeting with different departments across a client organisation, simply to identify which people knew the most about the raw data gathered from various different data systems.
We hope that these lessons provide a useful overview of how best to navigate your initial days as a data scientist. Remember to maintain an open mind and never stop learning. If you would be interested in beginning your journey with QuantumBlack, do consult our current career opportunities.