One step further to improve the Conda environment — Numpy implementation[Part-2]

Mostafa Farrag
Hydroinformatics
Published in
10 min readDec 22, 2022

In This article, we will have an in-depth look at the installation directory of miniconda, we will also reduce the size of the environment by cleaning the cache, and tarballs. then we will change the implementation of BLAS from MKL to OpenBLAS the dependency of Numpy and see how that will reduce the size of the environment.

The content of the article is arranged as follows

  • An in-depth look at the installation directory of miniconda (site-package/pkgs - lib - envs - scripts)
  • Size of environment
  • Clean cache and tarballs (from 2.4 GB to 1.4 GB)
  • Numpy Installation (900 MB depending on implementation)

An in-depth look at the installation directory of miniconda

  • First, to check where conda is installed, you can use the which command in Linux or the where command in Windows
(base) ubuntu@Mufasa:~$ which conda
WSL Ubuntu Linux terminal
Windows terminal

then you can browse to the directory you will get.

(base) ubuntu@Mufasa:~/miniconda3$ ls -laF
total 112
drwxr-xr-x 16 ubuntu ubuntu 4096 Dec 21 22:03 ./
drwxr-xr-x 8 ubuntu ubuntu 4096 Dec 21 23:31 ../
-rw-r--r-- 1 ubuntu ubuntu 10721 Apr 21 2022 LICENSE.txt
drwxr-xr-x 2 ubuntu ubuntu 4096 Dec 21 22:03 bin/
drwxr-xr-x 2 ubuntu ubuntu 4096 Dec 21 22:03 compiler_compat/
drwxr-xr-x 2 ubuntu ubuntu 4096 Dec 21 22:03 conda-meta/
drwxr-xr-x 2 ubuntu ubuntu 4096 Dec 21 22:03 condabin/
drwxr-xr-x 3 ubuntu ubuntu 4096 Dec 21 22:34 envs/
drwxr-xr-x 4 ubuntu ubuntu 4096 Dec 21 22:03 etc/
drwxr-xr-x 8 ubuntu ubuntu 4096 Dec 21 22:03 include/
drwxr-xr-x 15 ubuntu ubuntu 4096 Dec 21 22:03 lib/
drwxr-xr-x 268 ubuntu ubuntu 36864 Dec 21 22:50 pkgs/
drwxr-xr-x 10 ubuntu ubuntu 4096 Dec 21 22:03 share/
drwxr-xr-x 3 ubuntu ubuntu 4096 Dec 21 22:03 shell/
drwxr-xr-x 3 ubuntu ubuntu 4096 Dec 21 22:03 ssl/
drwxr-xr-x 3 ubuntu ubuntu 4096 Dec 21 22:03 x86_64-conda-linux-gnu/
drwxr-xr-x 3 ubuntu ubuntu 4096 Dec 21 22:03 x86_64-conda_cos6-linux-gnu/
  • the most important folders in the conda installation directory are the site-package, envs and scripts.

site-package/pkgs

  • The site-packages folder in Windows or pkgs folder in Linux is a directory that contains third-party Python packages that are not part of the default package set that comes with the base Python installation.
  • These packages are typically installed using the conda install command or the pip install command, which downloads and installs the package from the Python Package Index (PyPI) or other package repository.
  • The site-packages folder is usually located in the lib folder within the root directory of the Python installation, and it is typically included in the PYTHONPATH environment variable so that Python can find and import the packages it contains.
site-packages folder from a miniconda installation directory in Windows
  • In the pkgs folder in Linux, Each package subdirectory contains the files and resources for the package, including the executables, libraries, documentation, and other files that are needed to use the package. The pkgs folder also contains a cache a subdirectory, which stores cached copies of package files that have been downloaded from the package repository.
  • The pkgs folder is used by conda to store and manage the packages that are installed on the system
  • The location of the pkgs folder depends on how you installed Conda and how you created your environment. If you installed Conda using the Miniconda distribution, thepkgs folder is located in the root directory of the Miniconda installation. For example, on a Linux system, the pkgs folder might be located at ~/miniconda3/envs/<envname>/pkgs, where ~/miniconda3 is the root directory of the Miniconda installation and <envname> is the name of the Conda environment
  • The location of the pkgs folder depends on how you installed Conda and how you created your environment. If you installed Conda using the Miniconda distribution, the pkgs folder is located in the root directory of the Miniconda installation. For example, on a Linux system, the pkgs folder might be located at /opt/miniconda3/envs/<envname>/pkgs, where /opt/miniconda3 is the root directory of the Miniconda installation and <envname> is the name of the Conda environmen

lib

  • The lib folder in Miniconda is a directory that contains shared libraries and other dynamically loadable code that is used by the programs installed in Miniconda.
  • The lib folder in Miniconda contains the shared libraries and other code that is needed by the Python packages and tools that are included in the distribution.

envs

  • The env folder is a directory that contains the files and subdirectories for a specific conda environment. A conda environment is a self-contained Python environment that includes a specific set of packages and their dependencies.
  • By default, the env folder is located in the root directory of the conda installation, and it contains a subdirectory for each conda environment that you have created.
envs folder from a miniconda installation directory in Windows

scripts

  • The scripts folder is a directory that contains scripts that are specific to the environment. it is typically located in the root directory of the conda environment, and it can contain any scripts that you want to use within the environment.
  • The executables for Jupyter and JupyterLab are usually located in this folder in case you want to create a desktop shortcut for these applications.
  • When you activate a certain environment all the scripts in this folder can be reached in the terminal.

Now that we have a good understanding of the folder structure of the conda environment let's check how we can reduce the size of the environment

Size of environment

  • To check the size of the base environment in Linux and sort descending by size.
du -shc ~/miniconda3/* | sort -rh
  • Now that we have one environment (pyramids) beside the base environment and the total size is 2.4 GygaByte.
  • The size of folders inside the pyramids environment looks as follows
  • you can notice that the Lib folder is the biggest in size as it has all the installed packages.

Clean cache and tarballs (from 2.4 GB to 1.4 GB)

Now let's try to clean up the leftover from our installation files in order to free up space and improve performance using the conda clean command.

  • The conda clean command has several options that allow you to specify which types of files and packages to remove. For example, you can use the --all option to remove all files and packages that are not needed by any environment on your system. You can also use the --tarballs option to remove cached package files, or the --index-cache option to remove the package index cache.
conda clean -afy
  • As we see the envs folder did not change, and the pkgsfolder in the base environment disappear which was 1.1 Gigabytes

Numpy Installation (900 MB depending on implementation)

NumPy is a Python library for working with large, multi-dimensional arrays and matrices of numerical data. It provides efficient operations on arrays and matrices, as well as functions for performing mathematical operations on these data structures.

Numpy

NumPy has a few dependencies that are required for it to work properly. These dependencies include:

  • Python: NumPy is a Python library, so you need to have a working Python installation in order to use NumPy.
  • NumPy depends on a low-level library called BLAS (Basic Linear Algebra Subprograms) to perform some of its numerical operations. There are several implementations of BLAS available, including the reference implementation called “Netlib BLAS” and optimized implementations like “OpenBLAS” and “Intel MKL”.

MKL & OpenBLAS

  • NumPy can be built to use one of these BLAS implementations.
  • These low-level libraries are installed with numpy you don’t need to install them separately

Based on the Numpy website [Link]

  • The NumPy wheels on PyPI, which is what pip installs, are built with OpenBLAS. The OpenBLAS libraries are included in the wheel. This makes the wheel larger, and if a user installs (for example) SciPy as well, they will now have two copies of OpenBLAS on disk.
  • In the conda defaults channel, NumPy is built against Intel MKL.
  • In the conda-forge channel, NumPy is built against a dummy “BLAS” package. When a user installs NumPy from conda-forge, that BLAS package then gets installed together with the actual library — this defaults to OpenBLAS, but it can also be MKL (from the defaults channel).
  • The MKL package is a lot larger than OpenBLAS, it’s about 700 MB on disk while OpenBLAS is about 30 MB.
  • MKL is typically a little faster and more robust than OpenBLAS.

Check Which dependency package you installed

  • To check which package we have installed with numpy in the pyramids package (it has numpy as a dependency)
conda list | grep numpy
  • The first part of the previous command will display all the installed packages then pass this list as input to the grep function which will filter it and only display when lines have the word numpy.
  • As we see we have installed numpy from the conda-forge channel, which based on the numpy documentation website uses OpenBLAS ad a dependency, and it is a lightweight package (30 MB)
  • If you check the openblas in the lib folder to see the size of the shared library installed with numpy
  • The size is 36 MB.
  • libopenblasp-r0.3.21.so is a shared object file that contains compiled code for the OpenBLAS library. The “r0.3.21” part of the filename indicates the version of the library. The “so” extension indicates that this is a shared object file, which means that it can be dynamically loaded by a program at runtime.

Switch BLAS implementation

  • Now we want to check the size of the environment if we installed numpy with the mkl implementation of BLAS [Link]
  • you can switch the BLAS implementation without removing the numpy installed version using the following command.
conda install "blas=*=mkl"
## Package Plan ##
environment location: /mnt/c/MyComputer/miniconda3/envs/pyramids
added / updated specs:
- blas[build=mkl]
The following packages will be downloaded:
package | build
- - - - - - - - - - - - - -| - - - - - - - - -
blas-1.0 | mkl 6 KB
ca-certificates-2022.10.11 | h06a4308_0 124 KB
certifi-2022.12.7 | py39h06a4308_0 150 KB
fftw-3.3.9 | h27cfd23_1 2.3 MB
intel-openmp-2021.4.0 | h06a4308_3561 4.2 MB
mkl-2021.4.0 | h06a4308_640 142.6 MB
mkl-service-2.4.0 | py39h7f8727e_0 59 KB
mkl_fft-1.3.1 | py39hd3c417c_0 182 KB
mkl_random-1.2.2 | py39h51133e4_0 309 KB
numpy-1.23.4 | py39h14f4228_0 10 KB
numpy-base-1.23.4 | py39h31eccc5_0 6.7 MB
openssl-1.1.1s | h7f8727e_0 3.6 MB
scikit-learn-1.1.3 | py39h6a678d5_0 7.0 MB
scipy-1.9.3 | py39h14f4228_0 22.2 MB
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Total: 189.5 MB
The following NEW packages will be INSTALLED:
blas pkgs/main/linux-64::blas-1.0-mkl
fftw pkgs/main/linux-64::fftw-3.3.9-h27cfd23_1
intel-openmp pkgs/main/linux-64::intel-openmp-2021.4.0-h06a4308_3561
mkl pkgs/main/linux-64::mkl-2021.4.0-h06a4308_640
mkl-service pkgs/main/linux-64::mkl-service-2.4.0-py39h7f8727e_0
mkl_fft pkgs/main/linux-64::mkl_fft-1.3.1-py39hd3c417c_0
mkl_random pkgs/main/linux-64::mkl_random-1.2.2-py39h51133e4_0
numpy-base pkgs/main/linux-64::numpy-base-1.23.4-py39h31eccc5_0
The following packages will be REMOVED:
libblas-3.9.0–16_linux64_openblas
libcblas-3.9.0–16_linux64_openblas
liblapack-3.9.0–16_linux64_openblas
The following packages will be SUPERSEDED by a higher-priority channel:
ca-certificates conda-forge::ca-certificates-2022.12.~ → pkgs/main::ca-certificates-2022.10.11-h06a4308_0
certifi conda-forge/noarch::certifi-2022.12.7~ → pkgs/main/linux-64::certifi-2022.12.7-py39h06a4308_0
numpy conda-forge::numpy-1.24.0-py39h223a67~ → pkgs/main::numpy-1.23.4-py39h14f4228_0
openssl conda-forge::openssl-1.1.1s-h0b41bf4_1 → pkgs/main::openssl-1.1.1s-h7f8727e_0
scikit-learn conda-forge::scikit-learn-1.2.0-py39h~ → pkgs/main::scikit-learn-1.1.3-py39h6a678d5_0
scipy conda-forge::scipy-1.9.3-py39hddc5342~ → pkgs/main::scipy-1.9.3-py39h14f4228_0
Proceed ([y]/n)?
  • you can see the list of packages that will be installed as part of the mkl implementation of blas
  • And the openblas implementation will be removed.
  • So the downloads and tarballs for the mkl package only is 142 MB.
  • So check again the size of the whole environment after cleaning all the cache, the index cache and the tarballs.
  • as you can see the environment size jumped from 1.2 GB to 2.1 GB, 900 MB difference because of the MKL implementation of blas.
  • However, the Numpy documentation has also mentioned that there are differences between both Openblas and MKL in terms of performance, which favors MKL, so you might take this into consideration to decide which one to install

Installing numpy for the first time

  • As we saw that installing numpy in miniconda from conda-forge will come by default with the OpenBLAS implementation, however, if you are not using miniconda, and want to install numpy without MKL you can use the following command
conda install nomkl numpy
  • So now that we have seen how much cleaning the cache and the tarball can save disk space, and then how much the blas implementation of numpy will affect the size almost 900 MB of disk space

You might want to know:

what is a shared Object?

  • Shared object files are used to package compiled code that can be used by multiple programs. When a program uses a shared object, the operating system loads the shared object into memory and resolves the symbols (function and variable names) that are needed by the program. This allows programs to use code from shared objects without having to statically link the code into the program at compile time.

--

--