Python Development Setup for Data Scientists in 2022
There are a lot of useful tools and libraries appearing in recent years. Some don't seem to be famous among data scientists, while engineers often use them. Thus, I want to introduce some tools to data scientists new to Python or software development. In this article, I will show my favorite Python development tools to do data science.
I intend to introduce data scientists who want to …
- use both Mac and Windows (WSL)
- deploy code to cloud services like Google Cloud Run
- handle several projects simultaneously
- manage environmental setting by Git
Table of Content
- Visual Studio Code(vscode); free and useful editor
- Peacock; color schema manager [Recommended]
- Rainbow CSV; coloring CSV file
- autoDocstring; document generator
- pyenv; version manager
- Poetry; powerful package manager [Recommended]
- Black, Flake8, isort, and Mypy; formatter and linter
Visual Studio Code(vscode); free and useful editor
https://code.visualstudio.com/
Visual Studio Code(vscode) is one of the most famous editors.
Vscode is also for data scientists because we can use Jupyter Notebooks in vscode and Python files. You don't have to code in browsers anymore.
Peacock; color schema manager
Peacock is one of my favorite extensions in vscode.
You can change the color schema with Peacock by the following steps.
- "Ctrl(Command) + Shift + P" in vscode
- type "Peacock: Change to a Favorite Color"
- select your favorite one
Of course, you can set up your color schema by typing "Peacock: Enter a Color" and inputting the hex code.
Advantages for data scientist:
When you work on several projects simultaneously, peacock is quite dependable.
It is because you distinguish the project by its looking so that you can prevent mix-up projects.
In addition, you can control the color schema with Git so you can use the same color with different computers.
Rainbow CSV; coloring CSV file
https://marketplace.visualstudio.com/items?itemName=mechatroner.rainbow-csv
If you are a data scientist, you have a lot of chances to see CSV files. Rainbow CSV can colorize your CSVs in each column. Excel is a good tool for seeing CSV, but it takes much time to open the files. Try this extension if you want to see CSV at a glance.
autoDocstring; document generator
https://marketplace.visualstudio.com/items?itemName=njpwerner.autodocstring
autoDocstring is a document generator that helps you to write maintainable code. Once you define the arguments and return values in your method, this extension generates the document template.
pyenv; version manager
https://github.com/pyenv/pyenv
pyenv is a famous version manager for Python. To install on Mac, you can use brew install pyenv
command. If you are a Windows user, try the following commands.
git clone https://github.com/pyenv/pyenv.git ~/.pyenv
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
echo 'eval "$(pyenv init -)"' >> ~/.bashrc
Then install a specific version(e.g., 3.9.11) of Python.
pyenv install 3.9.11
I recommend that you designate the version in the working directory by this command.
pyenv local 3.9.11
You will find the file generated by the command so that you can control the Python version in Git.
Poetry; a powerful package manager
Poetry is a Python library manager that can solve between libraries. Compared to pip, Poetry can manage libraries more smartly. This separates libraries into two types; one is the list you want to install, and the other is the list of whole libraries used by the former. (Just like npm module in Javascript)
For instance, if you install pandas with poetry, it is defined in the former file, and whole packages are described in the latter.
These files are automatically updated when you install new packages. You don't need to do the pip freeze command anymore.
Moreover, Poetry can generate a virtual environment so that you can execute Python in an isolated environment. Therefore, you don't need to worry about unintended dependencies.
Here is a quick start to Poetry.
$ pip install poetry # install Poetry
$ poetry config virtualenvs.in-project true --local # generate venv in working directory
$ poetry init # initial settings of Poetry
$ poetry add pandas # install package e.g. pandas
$ poetry shell # launch virtual environment
If you've installed Poetry, don't forget to set Poetry's virtual environment as the default interpreter of your vscode.
Once you’ve set up poetry and control pyproject.toml
, poetry.lock
, and poetry.toml
by Git, you can use and share with your teammate the same environment you’ve created.
Black, Flake8, isort, and Mypy; formatter and linter
These packages faster your coding and realize neat programs.
These are only used in a development environment so that you can install them with -D
option.
poetry add -D black flake8 isort mypy
Then modify vscode settings via settings.json. You can enable the above linters and formatters explicitly.
"python.formatting.provider": "black",
"python.linting.flake8Enabled": true,
"[python]": {
"editor.codeActionsOnSave": {
"source.organizeImports": true
},
"python.linting.mypyEnabled": true,
Conclusion
I've introduced several valuable tools for data scientists to set up a Python environment. I uploaded sources in this repository(https://github.com/koyaaarr/python-setup).
I hope this article is helpful to you.