GSoC`2020 with Shogun — Phase 1

Phase 1 of Google Summer Of Code 2020 with Shogun.

Tej Sukhatme
5 min readJul 1, 2020

The first month of GSoC with Shogun has been quite a journey for me. Originally, I had some worries in my mind that my university may keep my final examinations during the GSoC period and so I began a week before the actual coding period started.

Deliverables

Here is a basic summary of what I have accomplished until now:

  • Out of all the regression models available in the Shogun library, I chose the one giving the best result (Random Forest) and trained it using the data available.
  • The web-tool is ready and also uploaded the Docker image to Docker Hub. You can find that here. All you have to do is pull the image and run the web-tool.

Project-related pull requests:

Meta-example code pull requests:

Meta Example data pull requests:

Issues opened:

Experience

Before you read up about my experience, make sure you get a background about what this project is all about:

The plan for this project was that I should get the basic pipeline ready and then fit in the other features in one by one. Originally I had planned to get the entire thing ready before the phase 1 of coding but I couldn’t as it also depended on what data pre-processing methods I would plan to use. I built the website backend and a really basic frontend before the GSoC coding period and then started working on the notebooks which cleaned, combined, and normalized the data. Once I had figured out all I had to do for preparing the data from the notebooks, I finished the rest of the pipeline that would systematically collect the data, combine all the data files together, clean it, apply all the necessary pre-processing methods like one-hot encoding the categorical variables, and scaling the data. Once this was ready, this pipeline would take the data from the git submodule prepared directly from the Zenodo dataset and then produce a single final cleaned data file.

For the data normalization, I used the Yeo Johnson Transform. This is in fact a really cool way of normalizing the data as it checks that the data gives the best result (least skewness) for which power and transforms accordingly. I had to use Scipy for this as these forms of power transformations like Box-Cox and Yeo Johnson haven’t been implemented in Shogun yet. I added an issue for this and plan to work on implementing this if time permits later on.

Once this was ready, all that I had to do was make the machine learning model that would learn from this data and apply it to real-time page views data.

The first thing I tried was to run a simple Linear Ridge Regression model on the data. This seemed to give good enough results.

On later applying the Random Forest model, I got an even greater accuracy, and I proceeded to applying Random Forest to the actual dataset in the web-tool.

Once I got done with that the ML part of the project was ready.

Following this, I started working on the web tool. We had decided to use Flask for the backend as it would make it really simple to have Python for the data processing as well as the other server-side code. Also, the other python framework alternative, Django seemed too heavy a framework for this project. This was my first time using Flask for a real-life project, but fortunately, the documentation available on the official Flask website is really exhaustive.

For the server-side code, one thing I was a little unsure about was querying the data from Wikipedia. his data is returned in JSON format and we have to parse this data to be able to use it. However, again, to my luck, I found a python dependency pageviewsapi which did all the heavy lifting for me. All I had to do was pass the query parameters as function parameters and it returned the data as a python dictionary. It’s really cool how someone thought to make a python API, especially for this purpose.

Getting the querying out of the way, what I had to do was build the basic architecture of the server. I created several wrapper classes like the DataGateway and the WikiGateway which greatly helped organize the code. I also created a class called ModelGateway which would train the model and apply it to the real-time data.

Lastly came the frontend, and I can say this was a greatly overwhelming experience from learning CSS formatting and getting puzzled as to which Javascript framework to use and whether to even use a framework and then about using Ajax or not. Realizing that Javascript offered a lot of utility that I wasn’t really looking for, I settled for using Bootstrap along with Jinja for the frontend code.

In the later phases, once I am done with writing the code for the Generalised Linear Machine and adding it to my model, I plan on working a little more on the frontend. It is still a static website and I can surely add several features to it using front end scripting tools.

Conclusion

The most important thing that I have learned is that writing bug-free and maintainable code is no doubt a challenging feat. But when it comes to running it, that’s a difficult task as well. Getting the work environment setup, installing everything needed and making sure every requirement is met is quite challenging. Even when you look for help online, if it’s help related to code or an algorithm, you will be able to help yourself very easily. But these issues, you can’t really do much as every system is different. What works in someone else’s system may not in yours and then figuring out which library is actually missing takes hours.

Besides this, it has been an amazing experience so far, my mentors as well as the entire Shogun community has been super helpful whenever I faced any difficult situation. I’m sure I have a lot more to learn in the 2 months to go and I hope I come across some really tough obstacles so that at the end of this summer break I emerge as a phoenix from the ashes having learnt tons more than I did before.

^The Phoenix aka me^

--

--