I’ve been in the Data Science field for more than 6 years and have tried and tested different tools from programming in terminal to text editors and cloud platforms. Also, I’ve used both Python and R, but I now work in python only for the past few years.
In this article, I’ll write about
- Why I choose Python over R
- My preferred text editor
- And other tools I use
Why I prefer Python to R?
Python is a general-purpose programming language and can be used for all purposes like — web scraping, automation, building websites, building APIs, and of course machine learning models. It has good support in the cloud platforms like GCP along with Node.js.
Python is easy to learn (not that R is difficult) and we can apply object-oriented programming like- classes and objects, abstraction, parent-child relationships, etc...
Community support for python is immense. I didn’t have much problem resolving errors or warnings and StackOverflow or GitHub issues are good places to search for.
Machine Learning Libraries
You have all kinds of ML libraries available in python from basic linear algebra to libraries pertaining to deep learning and reinforcement learning using tensors / GPU support.
I’ve written an article on setting up a python environment in macOS using Homebrew, Pyenv, and Pipenv — link below.
Setting up python environment in macOS using Pyenv and Pipenv
We often have a problem when working on different projects in local system
Text Editor or IDE?
Now that we have seen the advantages of Python, let’s look at how we can use different tools to take productivity to next level.
While there are a lot of data scientists who prefer the cloud for data science work, but it is generally expensive and requires a lot of knowledge on cloud resources to be optimal and reduce bills. A lot of times, a local system is just sufficient for data science work, unless we are dealing with image processing or deep learning tasks.
When coming to choose between Text Editor or IDE (Integrated Development Environment), I don’t see any great reasons to use IDEs like — JetBrains or Eclipse. I prefer to use simple Text Editor with a lot of plugins and this was sufficient for a lot of projects that I worked on personally.
My preferences are—
Visual Studio Code
Well, technically, Visual Studio Code is not an IDE and just Text Editor, but for data science work, it is more than sufficient.
- It has support for Jupyter Notebook — this is a major plus, as I do not want to work in a browser.
- It has a terminal (or a command prompt for Windows) integrated to run code right from the editor.
- Ability to browse files from explorer in left.
- Support for Source Version Control like — GitHub or BitBucket.
- A lot of plugins for handling different files like — Python, YAML, docker, CSV previewer, and a lot more
- Auto-suggestions / IntelliSense
- Comparing different versions of the same file (when using Git)
If you don’t want to install software and can compromise a lot of these features, then I would suggest Jupyter Lab and work in the browser. A lot of posts talk about PyCharm, but I love VSCode a lot better than PyCharm CE, not sure of the licensed version.
I know people who liked Vim for everything I said above. I also hear Atom and Sublime Text as the next best alternatives. The other reason why I like VSCode above the other text editors is — it supports cross-platform app development using Flutter and I’m a little biased towards VSCode, as it is my one-stop solution for all needs.
Google Colab is not a tool, but a ready environment to explore python packages. It is linked to Google Drive and if you are working in a team and files are shared in Google Drive, then this is one tool for collaborative working.
Apart from collaborative work, I use this while taking any training or learning new stuff, for example — I learned PyTorch and PySpark using Google Colab, as I had some issues installing PySpark locally (though I resolved it later).
So, Google Colab is a ready environment without the hassle of installing required libraries to explore and learn.
Additionally, if you are doing Kaggle competitions, I suggest using Kaggle Notebooks, as they have versioning and are an easy way to refer to previous versions without any Git versioning.
Now let’s dive into some important Python packages. I’ll divide them into 5 categories —
- Data Exploration & Processing
- Modeling and Evaluation
- Big Data Analytics
- Model Monitoring
Data Exploration & Processing
- Pandas — no additional introduction is required for this package. This is for all kinds of data exploration and processing from text files or most other data sources. It can even do some basic plotting.
Modeling and Evaluation
- Scikit-Learn — Be it building Linear Regression or Random Forest, this package has almost all algorithms and is actively developed. Only a few algorithms like XGBoost or LightGBM needs to be installed separately. But this package has most of the needed ones.
- Seaborn — I like the customization this gives and color schema is better than using Matplotlib or ggplot. These are worth adding to presentations (I’ve done that).
- Plotly — If you need more interactive plotting, then Plotly is best. I don’t know if there is any other package that does interactive plotting in Python.
Note: Most data science work can be done with the above few packages and their dependencies (NumPy, SciPy, Matplotlib). Below are some advanced and nice to have and I’ve used them quite frequently.
Big Data Analytics / Deep Learning
- Apache Spark / PySpark — when you have huge data to process or working in a cluster setup, PySpark is best. Reading a dataset is 10x faster compared to Pandas.
- TensorFlow & PyTorch — These 2 packages developed by giants (TensorFlow by Google and PyTorch by Facebook) are basically used for Deep Learning like image processing or text analytics. I personally like PyTorch, but if you are working on cloud platforms, you might find TensorFlow is more compatible, including exporting into mobile apps.
- Dash by Plotly — At times I had to present in a more interactive format to stakeholders and I found initially Dash by Plotly is great. But this requires a lot of HTML knowledge along with Python and many might not find it easy. Then I came across the next package — Streamlit.
- Streamlit — This is very useful to develop an ML app in few days (not kidding, I built one in just a week). The only problem is — this doesn’t allow for a lot of customization, whereas Dash allows for a lot of customizations.
Model Monitoring in Production
- ML Flow — A bit underrated package. This is especially useful if there are a lot of ML models deployed in production and need monitoring on regular basis. Using this package, one can log parameters used in the model, metrics for each run (example: accuracy or f1 score), and model artifacts for later reference. For building an end-to-end solution, this is one important package for production.
There are other numerous packages that are important like — NumPy, Matplotlib, SciPy, but these packages will be installed. I’ll talk about a few interesting ones.
- Airflow — This isn’t exactly a Data Science related library, but a nice task scheduler to schedule python scripts to run at regular intervals. This also gives a nice view of which step has failed in the flow and how many times it failed along with detailed logs.
Though I do most of my stuff using the tools I mentioned above, there are other tools I use at times. These are optional tools and may not be related to data science.
- Terminal — to access Heroku or GCP (Google Cloud Platform) or establishing a local server.
- Bitbucket — to version control my code. I started using Bitbucket when GitHub wasn’t acquired by Microsoft and when there is no option for private repositories under the free version, though now GitHub has this option.
- SourceTree — A GUI tool developed by Bitbucket’s parent company Atlassian, to track local vs remote Git repositories and is compatible with GitHub, Bitbucket, and GitLab.
- Postman — This has no relation to Data Science, but is a great tool when working with APIs and often to extract data from website APIs.
- Oracle SQL Developer — Can’t end this article without touching on some database querying tools when writing about Data Science setup. Though, we all use CSVs or Big Query in Google Cloud or other cloud platforms, at times it might be the case that data is in SQLite or PostgreSQL or any other traditional RDBMS. In that case, I use Oracle SQL Developer with relevant JDBC extensions.
- Cloud Platforms — All popular Cloud Platforms provide one or another type of ML capabilities. I prefer to use Google Cloud Platform (GCP) and it has inbuilt ML tools (AutoML, AutoML Vision, AutoML Translate, AutoML Natural Language)
Python is a versatile programming language that can be used for a multitude of areas like — data science, automation, building websites, web scraping. The innumerous packages available for all the tasks at your hand can’t deny that Python is becoming the de facto language for data science (until Julia or Scala overtakes it). And it has every kind of package to carry data science work. We have seen some of the popular packages and some that I use for some additional work.
VSCode is also my go-to tool as it has integrated a lot of things like — terminal, Git, Jupyter Notebooks. And can be extensible using their vast set of plugins.
In the end, it’s all about personal preference and comfort. I find these tools very productive and intuitive to use. Like I said before, people have productively used Terminal or Vim for projects. So, go ahead and try various tools and find the tool that makes you productive.