CODEX
Science Shorts #2: Bengali Character Recognition, Perceptible Colour Maps & Python Newsletters
Kaggle Grandmasters tackle character recognition for the fifth most popular native language, how to choose the right colour scheme for your plot and useful Python newsletters.
Scope
The Nvidia Machine Learning (ML) Grandmasters took part in the Kaggle competition for the character recognition of the World’s fifth most popular native language, Bengali. The team address some of the “unwritten” rules of the language in tuning there models. Rather than choose either the default colour scheme from your favourite plotting library or personal preference, a more scientific basis can be found in the Colorcet library. Finally, two Python based newsletter that I recently subscribed to.
Introduction
The following three articles were randomly selected from my Pocket list, which I’ve curated over the past 5 years in the field of Data Science; the motivations and background are discussed in a previous post: Data Science Shorts: An Introduction to my Pocket List.
Bengali Character Recognition
Summary
Between December 2019 and March 2020, Kaggle ran the Bengali.AI Handwritten Grapheme Classification challenge. The article describes the challenges with the training set data and subsequent strategies to mitigate the short comings. The highest position for the team was fifth but most members finished in the top 30, which is very impressive. The approach to the learning rate was as critical as the choice of model.
Context
For many people (including me), the majority of contact with Machine Learning begins (and ends with) the excellent sci-kit learn, keras and pyTorch. What this blog shows, is some of the “art” involved with fine tuning existing models for new applications. At the same time, shows some of the privilege that English speakers enjoy in the level of research conducted into our shared language.
Modern Colour Maps for Plots with Colorcet
Summary
The two images below compare similar colour maps, hot
from matplotlib
and its alternative from Colorcet called fire
using 256 colour gradient:
The right hand side of both images shows the added fidelity with the fire
colormap
relative to the historical default for matplotlib
. Notice also the differences at the middle of the spectrum, which makes it easier to discern the change with the fire
map. These proposals have been adopted by various plotting libraries including Matplotlib, which now has extensive documentation on the subject. The article shows a great example using DataShader plot for the U. S.
Context
We’ve all been there when that plot doesn’t quite look right, or we’ve spent hours choosing our categorical
colours in a line plot or bar chart. If you’ve experimented with colour customisation, then you know it’s an easy productivity trap to fall into. The reason is that’s easy to get caught up in the moment and lose track of the original purpose.
As an experienced Data Scientist, you learn quickly to pick an existing colour (or color) map and move on. What Colorcet does, is make choosing a sensible i.e. easy to read and understand colormap
nearly effortless. Rather than manually fine tuning a colour scheme or choosing a default one that doesn’t quite look right, Colorcet allows you to pick colours that independent of individual taste, at least works well.
Python Newsletters
Summary
Recently I’ve come across two newsletters that have introduced me to new concepts in Python, pandas
and the related Data Science ecosystem. First of all PyCoders, which is an excellent resource that covers the wide range of Python applications. The second is of course the Real Python newsletter PyTricks, which focuses on Python language snippets.
Context
I enjoy the code snippets from Real Python:
However, since signing up I’ve had 12 emails from Dan of Real Python, of which 7 were related to PyTricks and the rest were about joining Real Python; that’s more than 40% of the emails on advertising. It’s good to receive these snippets as it makes email less boring and you want to keep the messages unlike most newsletters.
PyCoders are however more subtle with their sponsored material. Each link is clearly tagged as such, and about 3 of the 17–20 links are sponsored so about 15%. In fairness, both have a role — the PyTricks emails are a real treat and stand alone, whereas the PyCoders email is more in depth.
Conclusion
Three varied topics from my Pocket list. The first shows the depth of experience required and the empirical nature of tuning existing Machine Learning models for new but similar applications in the field of handwritten character recognition. The second article shows that using a library that has researched the use of effective colour schemes can potentially enhance any visualisation. Given that for any Data Scientist, communication of the results is a critical element of the role, Colorcet should be the default. It should be noted that many tools within the PyViz ecosystem have already adopted these maps. Finally, two sources for Python news and the trade-off with sponsored material.