Why Python is the most popular language used for Machine Learning
By- Prince Patel
Back in 1991 when Guido van Rossum released Python as his side project, he didn’t expected that it would be the world’s fastest growing computer language in the near future. If we follow the trends, Python turns out as a goto language for fast prototyping.
A statement from the StackOverflow developer survey 2017
“Python shot to the most wanted language this year”.
Why this trend ?
If we look at the philosophy of the Python language, you can say that this language was built for its readability and less complexity. You can easily understand it and make someone understand very fast.
You can read it for yourself. Just use below command in Python
Why in Machine Learning ?
Now let’s understand why would anyone want to use only Python in designing any Machine Learning project. Machine learning, in layman terms, is to use the data to make a machine make intelligent decision. For example — You can build a spam detection algorithm where the rules can be learned from the data or an anomaly detection of rare events by looking at previous data or arranging your email based on tags you had assigned by learning on email history and so on.
Machine learning is nothing but to recognise patterns in your data.
An important task of a Machine learning engineer in his/her work life is to extract, process, define, clean, arrange and then understand the data to develop intelligent algorithms.
So for a Machine learning engineer/Computer Vision Engineer like me or a budding Data Scientist/Machine Learning/Algorithm Engineer/Deep learning engineer why I would recommend Python, because it’s easy to understand.
Sometimes the concepts of Linear Algebra, Calculus are so complex, that they take the maximum amount of effort. A quick implementation in Python helps a ML engineer to validate an idea.
Data is the key
So it totally depends on the type of the task where you want to apply Machine learning. I work in computer vision projects. So the input data for me is the image or video. For someone else it would be a series of points over time or collection of language documents spreaded across various domains or audio files given or just some numbers.
Imagine everything that exists around you is data. And it’s raw, unstructured, bad, incomplete, large. How Python can tackle all of them ? Lets see.
Packages, Packages everywhere !
Yes you guessed it right. It’s the collection and code stack of various open source repositories which is developed by people (still in process ) to continuously improve upon the existing methods.
Want to work with images — numpy, opencv, scikit
Want to work in text — nltk, numpy, scikit
Want to work in audio — librosa
Want to solve machine learning problem — pandas, scikit
Want to see the data clearly — matplotlib, seaborn, scikit
Want to use deep learning — tensorflow, pytorch
Want to do scientific computing — scipy
Want to integrate web applications — Django
Want to take a shower …. Well
The best thing about using these packages is that they have zero learning curve. Once you have a basic understanding of Python, you can just implement it. They are free to use under GNU license. Just import the package and use.
If you do not want to use any of them, you can easily implement the functionality from scratch(which most of the developers do).
Yes it’s not fast and takes more space but …
The main reason or the only reason why Python will never be used very widely is because of the overhead it brings in. But to clear the case, it was never built for the system but for the usability. Small processors or low memory hardware won’t accommodate Python codebase today, but for such cases we have C and C++ as our development tools.
In my case, when we implement an algorithm(Neural network) for a particular task, we use python(tensorflow). But for deployment in real systems where speed matters we switch to C.
Easter egg — Cython is in development for many years. [https://en.wikipedia.org/wiki/Cython]. You get the readability of Python, but efficiency of C.
Okay enough of the talk, now show me the way.
Now we know the Why. Lets see the How.
- Understand the basic concepts of Data structure.
Before jumping into any field of computer science, its very important to understand how the machine perceives the data. The atomic unit of value in C is 1 byte. Using the same byte we can code each input from the universe. If I were to make a list of things then, go to each and every implementation of data structures. This tutorial (https://www.geeksforgeeks.org/data-structures/) would be a good starting point.
- Learn python the hard way.
Once you get a understanding of the basics, jump into tutorial series of Learn Python the hard way by Zed Shaw. One of the statements from the book tells you that the hard way is easier. The foundation should always be strong.
- Machine Learning — Implementation matters.
The implementation of a clustering algorithm will open your insights more about the problem then just reading the algorithm. Here when a user implements the things in Python, it is going to be much faster to prototype the code and test it. One simple case of K means clustering is explained in following blog — K means in Python
- Simplicity is the best
Whenever you implement a piece of code, always keep in mind that an equivalent optimised code is always there. Keep asking your peers that whether they can understand the underlying functionality by just seeing the code stack. Use of meaningful variables, modularity of code, comments, no hard coding are keypoint areas which make a piece of code complete.
What about others ?
World’s most popular frameworks for data scientist are Excel and SAS. The problem of using them is they can’t handle large datasets and less community support for wide variety of usage i.e. You can’t use Excel to handle a company’s raw data.
MATLAB also provides great libraries and packages for specific tasks of image analysis. You can find great number of toolboxes for the given task. The main con of using MATLAB is that it is very slow(execution time is slow). It can’t be used in deployment, but only for prototyping. Also it’s not free to use, unlike python which is open.
Another great tool is R. It’s open source, free and made for statistical analysis. In my view, Python is a great tool for the development of programs which perform data manipulation whereas R is a statistical software which works on a particular format of dataset. Python provides the various development tools which can be used to work with other systems.
R has a learning curve to it. The predefined functions need predefined input. In Python you can play around the data.
Well if we focus on the overall task which is needed to train, validate and test the models — as far as it satisfy the aim of the problem, any language/tool/framework can be used. Be it extracting raw data from an API, analyzing it, doing an in depth visualization and making an classifier for the given task.
But the main reason for using Python would be its readability, versatility and easiness.
Enroll into Python Foundation Nanodegree here
About the Author | Prince Patel
Prince Patel is a Machine Learning Engineer by profession