In these problems, competitors had a fantastic opportunity to experience the work of a quant trader first hand. You guys were given 85,000 masked features and managed to produce signals for the series of target variables we set.
The three main stages people had to go through were:
- Data cleaning
- Feature engineering
- Finding and optimising the best model
This was a tough challenge and there were lots of good attempts made. Here today we have the overall winner, Abhinav, talking about the way he approaches these types of problems.
He has some really good advice about building a base case model and submitting it early, so we would recommend giving it a read below!
So Abhinav, what’s your background?
Firstly, I’m not a student, I graduated from IIT Roorkee in 2016. I studied engineering but didn’t like motors and that stuff and became interested in applied statistics. Whilst at university I used to participate and compete in the WorldQuant program, which is an offline trading and research program where you get a stipend for how well they do.
Then post-college I took up a job with a data science healthcare consultancy for a year and a half. After that, I was part of a startup for 6 months where we worked on an IPL based idea that helped the teams in auctions. We met two of the big teams and tried to help them. We realised in the end that the teams weren’t willing to spend any money, so it became more of a hobby thing. I think it would have made money in the UK where there are betting companies.
Since there I’ve been working on ideas in statistical arbitrage around improving trading signals. The competition is very similar to what we actually do in real life. Currently, I’m working for a firm in Mumbai, where I’ve just moved to from Bangalore.
How and when did you start making steps from engineering into data science?
At college, I studied a combined 5yr BSc and MSc, so I told myself that each year I would try something new (from my second year onwards). I had 2 semesters each year and 8 things to learn. For example, I originally tried computer vision before deep learning became popular and Fourier transformations were big.
Eventually, I found that stats are actually really useful, it’s not just MBAs making stuff up. On top of this, I did an internship where I was working in electrical engineering, which I realised wasn’t something I could do for 20
years. So after that, I decided to become a fully-fledged developer and realised that stats could be applied in lots of different domains which would keep things interesting.
I got started by doing courses on Edx. I slowly did more and more until I paid more attention to it than my own degree. Then I did the WellQuant stuff and by the time I graduated it was becoming really popular, but I managed to win a couple of competitions before that. I’ve won an iPhone and iPad in the past.
How did you find this competition?
It was mainly a learning experience for me, I was getting back into competitions. What I liked though was that a lot of competitions are far fetched and a long way from real-life problems, but this competition was very similar to what I do in real life.
After a while, it stops making sense to take part in competitions just for vanity reasons, which is why I haven’t done any in a long time. Here there was a lot of features and that meant there must be an interesting approach — I had a hunch and I wanted to try it out.
Was there any particular part of the problem that stood out as being hard?
I think the biggest thing is that there was a lot of features and the data was super noisy. This means many approaches like deep learning will overfit to the data and just won’t work straight away. So you have to build a model that is much more robust to noise. On the other hand, there were so many features that something like linear regression would have problems with multi colinearity, so that was the other side of the challenge.
The way I overcame this was to reduce the variance and bias using a random forest approach. You take lots of trees and average them out, this reduces the variance and keeps the bias the same. In this case, you can use this approach but with a series of linear regressions, not decision trees. The only thing you need to make sure is that the models are all independent!
Was there a particular skill you felt you’d practised or learnt during the competition
For me there’s no direct, off the shelf approach that will always work, instead, you have to approach it from a purely statistical point of view. This competition made me scratch that part of my brain and be creative. It was fun actually, I really enjoyed doing this competition.
Was there anything that made you join this competition in particular?
I am one of those people who try out everything.
I think part of the reason I did this competition was that it was super quick to be able to get something that you could test. It took me like 45mins from end to end, you didn’t have to use the toolbox and you[Auquan] provide a ready-made template so you only have to work on the core part of it.
What I’ve found in myself is that if I get the first iteration done quickly then I get hooked. If it takes time then I often get distracted by something and won’t come back. Even if it works or doesn’t work it will get my attention and I’ll sit down and figure things out.
What do you mean by getting the first iteration down quickly?
In this case, by the first iteration, I mean seeing your name on the leaderboard. Whether you do well or not well you see yourself up there and you want to do better. I think it’s really important to get a base case that you can build from. Plus, the first iteration has the most inertia, if you get over that then you want to improve.
How long should it take you to produce a good answer and your first iteration?
I think you should be able to get something down in 15mins, so you have a base case submission you can compare with. In this competition, it just took me 15mins. Then it actually only took me 45–50mins to get to my final submission as the script was very small so it could be run really quickly.
Being honest though I only got it that quickly because I have this book called introduction and elements of statistical learning. I’d read and understood this independent models approach, which just clicked for this problem. The implementation never takes that long so the fact I had a hunch with the noisy data meant it was really quick.
What do you think is the most important part of getting started then?
For me, it is just making sure you have this base case quickly. If you don’t you end up going down… It’s like in cricket right, you don’t want to come out to the crease and start hitting 6’s straight away, you want to get off strike and build on that.
Unless your Stokes in the world cup final, then you can just punch the ball away?
Yeh yeh, it was a crazy match. Seriously. There will never be a match like it, getting that close and then tieing on the super over was insane.