Interview with Kaggle Grandmaster, Data Scientist at Point API (NLP startup): Pavel Pleskov
Index and about the series“Interviews with ML Heroes”
Today I’m honored to be interviewing a Kaggle Grandmaster from the ods.ai community.
I’m excited to be talking to Competitions GrandMaster (Ranked #4, Kaggle: @ppleskov) and Discussions Expert: (Ranked #29): Pavel Pleskov
Pavel has a background in Math and Economics. Currently, he is working as a Data Scientist at Point API (NLP startup). He has worked as a Financial Consultant and as a Quant Researcher earlier.
Sanyam Bhutani: Hello Grandmaster, Thank you for taking the time to do this.
Pavel Pleskov: My pleasure!
Sanyam Bhutani: Currently, you’re one of the Top 5 ranked Comp GrandMasters and are actively working on Data Science Projects.
You have a background in Economics and Math. When and how did Kaggle first come into the picture for you?
Pavel Pleskov: The first time I learned about Kaggle was four years ago when I was hiring quants to my proprietary HFT trading firm. This platform seemed like an excellent place for finding outstanding researchers. We even thought about holding our own competition. Thank God, I did not know how much it costs back then :) By the way, it was the first time I got to know Stanislav Semenov (former Kaggle #1), and we even attempted to hire him. Little did I know that trading firms all around the world were trying to do the same thing. Of course, he didn’t join us, but we began communicating.
I started actively participating in Kaggle a year and a half ago after deciding to switch my career path from trading to data science. Like many others, I began the journey by studying popular machine/deep learning courses on Coursera including the great Andrew Ng course. What was typical of all these courses is a lack of practice, which Kaggle had no shortage of. Since childhood, I have participated in numerous math olympiads, so the competitive part of Kaggle was also appealing. Additionally, from my time spent in trading, I gained a great understanding of how to find abnormalities (aka leaks) in data and how to come up with creative solutions. It’s all played a role in my active interest in solving puzzles on Kaggle.
Regardless of all achievements, there is one thing I regret the most: not starting to compete earlier. It is a typical rookie mistake, waiting until you are “ready” to fight against such monsters as Giba or bestfitting. So, my advice to everyone: don’t waste precious time and begin practicing as soon as possible — you will close the gaps in theory later on.
Sanyam Bhutani: You’re currently working as a Data Scientist at Point API.
How is Kaggle related to the projects at your job?
Pavel Pleskov: It’s an NLP startup, so participation in NLP-related competitions was very beneficial. Sometimes I have the privilege of using pipelines from the competition, for example, for binary classification with unbalanced classes. However, on a daily basis competition experience is narrowed down to finding and understanding model errors and building workarounds for them, which is the primary skill of a data scientist.
Sanyam Bhutani: Can you tell us more about the projects you are currently working on?
Pavel Pleskov: Our main product is called Point Scribe for Gmail — it’s a productivity tool which allows you to write emails faster by using autocompletion and the power of AI. Like Google’s Smart Compose, only better :) It may be useful for recruiters, tech support, sales people, or anyone who routinely writes similar answers to emails. The task of predicting the next sentence in the arbitrary text could be challenging, but for emails, it can be solved quite well. Feel free to check out product at the chrome web store!
Sanyam Bhutani: Contrary to how one might think — Pavel isn’t a complete geek, he is a digital nomad and has already traveled to 55/195 countries, is a great photographer and has a fantastic ig profile.
How do you find the time to do activities outside of Kaggle and why do you pick to dedicate time to these as well over Kaggle?
Pavel Pleskov: Thanks for your undisguised flattery of my photography skills. When two years ago I decided to change my career, the first idea that came to mind was to become a travel blogger and a popular instagrammer. After spending some time using Instabot and gaining 30K followers (you didn’t expect me to grow organically, did you?) I quickly gave up on it. Luckily, I found a much better fit in DS.
One of the reasons I chose DS over everything else is because it gives you the freedom of self-expression and allows for exercising creativity. Freedom is my core value, and I’m looking for it in every aspect of life. That’s why my favorite hobbies are traveling, motorcycling and kitesurfing. Freedom is also the reason I chose remote work over working in the office. By the way, I’m writing these lines from Mui Ne, Vietnam — one of the rare kitesurfing spots with a steady wind for nearly six months, warm weather in winter and fast internet connection.
Honestly, there are a lot of drawbacks to being a digital nomad they don’t write about in lifestyle blogs. Among those is a lack of communication with friends and family, unreliable and weak internet connection, inconvenient working spaces, safety issues, inadequate health facilities, etc. It’s not just palms and beaches as one might think. Worst of all, returning to the office after being location independent for four years is not an easy task either. Nevertheless, in a couple of years from now, especially after having children and finally visiting 100 countries, I will probably settle down. Hopefully, somewhere in the Bay Area ;)
Let’s get back to your original question. Obviously, it’s tough to perform consistently well on Kaggle while traveling, working full-time and doing all kinds of sports. I’m not even talking about spending time with family and friends. Luckily, my wife is also working remotely in the DS industry as a PM and event organizer. Thus, we have the luxury of supporting each other and spending a lot of time together. “Behind every great man there’s a great woman” — that’s my biggest secret!
Following the advice from your previous guest and an intelligent man, Dr. Vladimir I. Iglovikov, I cannot help but include a photo from Burning Man 2018 where I proposed to my wife. Special thanks to Vladimir for hosting us in San-Francisco before and after the event.
Sanyam Bhutani: You’ve had many high finishes. Can you name one or maybe a few competitions that were an excellent experience for you?
Pavel Pleskov: I had a couple of first place finishes on a team with a lot of work done and emotions experienced. However, the most vivid memories came from the third place solo at Text Normalization Challenge — Russian Language. There is quite a story behind this achievement. During the competition, my wife (still girlfriend at that time) and I were traveling around California. On our way to Sequoia National Park, we stayed via Couchsurfing with Robert Osak, a teacher from Visalia. Rob is a super friendly host, with a big house and a great family. It turned out that previous guests from Spain, a family with two young children, had to stay longer because their VW van broke down. We also extended our stay because my wife got sick. So, I remember it as if was yesterday: two days before the competition ended, I was spending the entire day in front of an annoyingly blinking laptop screen surrounded by screaming kids while adding more and more rules to my purely heuristics-based solution. Long story short, we got the results on the way to the Bay Area during a stop at some shitty Denny’s restaurant with wi-fi. You can only imagine my surprise of winning $5000 under such circumstances :)
Sanyam Bhutani: You’ve had great results-both in solo finishes and team finishes.
For a noob Kaggler, what tips do you have when forming a team or not?
Pavel Pleskov: It’s a straightforward strategy. You always start solo and try to get as close to the top as possible, let’s say the top 10%. You try out all the hypotheses and methods you are familiar with since it significantly boosts your hands-on skills. Closer to the last week of the competition you will probably run out of ideas, and this is the ideal moment to start looking for a team on Kaggle’s forum. Before merging you would most likely want to share some details about your solutions, such as individual model CV/LB scores and methods you have tried and failed. Make sure that potential teammates will do the same to ensure diversity in your approaches. Merging with somebody who just blended public kernels usually is not a good idea :) If all is done right then, there is an excellent chance to end up in the top 1% of LB.
Sanyam Bhutani: What kind of challenges do you look for today? How do you decide whether to enter a new competition?
Pavel Pleskov: Frankly speaking, I try to enter and spend some time solving every competition. Even just running public kernels locally helps you to improve practical skills. It allows you to try out different deep learning frameworks, makes you aware of modern libraries people use, and so on. Of course, given an 8-hour work day, it is hard to spend a lot of time on Kaggle. However, being a Competition Grandmaster helps here a lot, because you can join a top performing team closer to the end of the competition without showing outstanding results beforehand. Even in this case during the last week of the competition, I’m able to contribute substantially to the team performance.
Sanyam Bhutani: What best pieces of advice do you have for beginners who want to score well in the Deep Learning Competitions, but do not have a beefy GPU box setup with them?
How can you do well against the others that are doing tremendous stacking at times?
Pavel Pleskov: Let’s first split the issue: CV competitions require a lot of GPUs for training CNNs, and tabular data competitions demand a lot of RAM for stacking. Both types of competitions can be solved with little computational power including Kaggle kernels. For images, it is usually advised to work with a data sample rather than with the entire dataset; otherwise testing multiple hypotheses will take ages regardless of the number of GPUs. If you want to win the competition closer to the end, you most likely should join a team with a beefy GPU box. However, for just improving your practical skills any setup is sufficient. As for stacking, you can always rent a high-performance Google Cloud setup. Google gives you $300 worth of credits for free, which is enough for running stacking for the last couple of days before the competition ends.
I believe that beginners on Kaggle should not be intimidated by the experienced “zookeepers.” There are numerous examples of when smart algorithms have overperformed brute force solutions. Since I built my 4 GPUs/110Gb RAM/32 cores dev box, I began to have fewer incentives for finding smart algorithms. I’m not even talking about people having access to considerable clusters in universities or IT companies. Why bother yourself with thinking when you can run an enormous k-NN?
Sanyam Bhutani: What are your first steps and go-to techniques when starting on a new competition?
Pavel Pleskov: Here are several simple steps on how to start on the new competition and not to regret it later:
0) Read the rules carefully, learn about evaluation, timelines, how big the dataset is, and so on. It might not be a good time to participate in the competition, because, for example, you are not eligible for winning the prize due to the country of origin.
1) Read EDA from @artgor (Andrew Lukyanenko).
2) Look at best scoring non-blending kernels, try to run them locally and fix errors. You will quickly understand what the data science job is all about — installing packages and the right CUDA drivers or setting up the proper TF or PyTorch version.
3) Set up local cross-validation and check if there is a correlation with LB. It might also be an excellent moment to stop participating in the competition if CV is not correlated with LB. It could be due to two reasons: either your validations sucks, or the competition sucks. Either way, you will be highly disappointed after the final shake-up.
4) Aim for improving your CV/LB score every day by building new features or tuning parameters. Try not to use a lot of LB submissions. First, they provide no additional information when your is validation strongly correlated with LB. You only need to check from time to time that the correlation holds as the score grows. Second, while merging into a team, there is a restriction on the total number of submissions made by all members. You should not exceed the total number of submissions that are possible to make from the very first day of competition. It is very disappointing not to be able to merge because of this rule, trust me.
5) If you are stuck then try to find old similar competitions. Also, read the forum carefully every day — there are plenty of insights there.
6) Look at the data and model errors. There is no way to win by blindly doing “fit-predict.” All the juicy stuff is hidden within the data itself.
7) Merge with another team, or even better, invite an experienced GM to your team ;)
Sanyam Bhutani: You’re one of the faces of the ODS community. Can you tell us more about ODS?
Are beginners that are non-Russian speakers welcome in the community as well?
Pavel Pleskov: ODS.ai is the largest active DS community in the world (almost 30K members at the writing of this post), primarily Russian speaking. As it follows from the name, Open Data Science, everybody is welcome We have plenty of English speaking members there, including me. Come and join the party; you will not regret it!
Sanyam Bhutani: For the readers and noobs like me who want to become better Kagglers, what would be your best advice?
Pavel Pleskov: Look at the data. Human beings evolved a great ability to recognize visual patterns, and ML/DL is all about finding the signal in the noise. Use your most potent Neural Network (aka brain), which was trained for thousands of years and allowed you to read this text rather than your ancestors being eaten by a tiger.
Sanyam Bhutani: Given the explosive growth rate of ML, how do you stay updated with the recent developments?
Pavel Pleskov: Twitter definitely helps. It is quite easy to subscribe to a couple of dozens of famous data scientists and to keep up with the latest news. My favorite, of course, is Jeremy Howard, the founder of fast.ai. He is always on the edge of modern developments in DL.
Also, reading after competition write-ups is a great way to stay updated. People share what works and what doesn’t, sometimes even with the code attached. It’s even better than reading papers because it’s more practical.
Sanyam Bhutani: Which developments in the field do you find to be the most exciting?
Pavel Pleskov: Rapid improvements in NLP like ULMFiT, BERT, GPT-2, etc. Some language tasks still seem to be hard enough for ML algorithms, but feeding algorithms with more and more data appear to be working well for now. The only thing I wish is that NLP was as sexy as CV. I never saw people being as excited about topic summarisation or end-to-end language translation as they were about Prisma, for example.
Sanyam Bhutani: What are your thoughts on Machine Learning as a field? Do think it is overhyped?
Pavel Pleskov: Let me quote here a great Kaggle Master and one of the strongest DS managers, Valeriy Babushkin: “ML is hype. CS is forever.”
Sanyam Bhutani: Before we conclude, any tips for the beginner who aspires to be like you someday but feels entirely overwhelmed even to start competing?
Pavel Pleskov: Just do it! Put aside this interview (thanks for getting through it, by the way), go to kaggle.com, register (if you still aren’t) and make your first submission to any competition. Then improve it. Then do it again tomorrow. Next thing you know you are as addicted to Kaggle as I am :)
Sanyam Bhutani: Thank you so much for doing this interview.
Pavel Pleskov: My pleasure!
If you’re interested in reading about Deep Learning and Computer Vision news, you can check out my newsletter here.
If you’re interested in reading a few best advice from Machine Learning Heroes: Practitioners, Researchers, and Kagglers. Please click here