Difference between revisions of "Managing Python environments and Jupyter kernels"

From UFRC
Jump to navigation Jump to search
Line 52: Line 52:
 
=== <code>conda</code> environment location ===
 
=== <code>conda</code> environment location ===
  
<code>conda</code> puts all packages installed in a particular environment into a single directory. By default _named_ <code>cond</code> environments are created in the <code>~/.conda/envs</code> directory tree. They can quickly grow in size and, especially if you have many environments, fill the 40GB home directory quota. For example, the environment we will create in this training is 5.3GB in size. As such, it is important to use _path_ based (conda create -p PATH) conda environments, which allow you to use any path for a particular environment for example allowing you to keep a project-specific conda environment close to the project data in `/blue/` where you group has terrabyte(s) of space.
+
<code>conda</code> puts all packages installed in a particular environment into a single directory. By default ''named'' <code>cond</code> environments are created in the <code>~/.conda/envs</code> directory tree. They can quickly grow in size and, especially if you have many environments, fill the 40GB home directory quota. For example, the environment we will create in this training is 5.3GB in size. As such, it is important to use ''path'' based (conda create -p PATH) conda environments, which allow you to use any path for a particular environment for example allowing you to keep a project-specific conda environment close to the project data in `/blue/` where you group has terrabyte(s) of space.
  
 
You can also change the default path for the _named_ environments (<code>conda create -n NAME</code>) if you prefer to keep all <code>conda</code> environments in the same directory tree. To do so, add or change the <code>envs_dirs</code> setting in the <code>~/.condarc</code> configuration file e.g.:
 
You can also change the default path for the _named_ environments (<code>conda create -n NAME</code>) if you prefer to keep all <code>conda</code> environments in the same directory tree. To do so, add or change the <code>envs_dirs</code> setting in the <code>~/.condarc</code> configuration file e.g.:

Revision as of 15:40, 24 May 2022

Managing project-specific application Python environments

Background

Many projects that use Python code require careful management of the respective Python environments. Rapid changes in package dependencies, package version conflicts, deprecation of APIs (function calls) by individual projects, and obsolescence of system drivers and libraries make it virtually impossible to use an arbitrary set of packages or create one all-encompassing environment that will serve everyone's needs over long periods of time. The high velocity of changes in the popular ML/DL frameworks and packages and GPU computing exacerbates the problem.

<img src="python_environment.png" alt="Python environment conundrum" width='200' align="right">

The problem with pip install

Most guides and project documentation for installing python packages recommend using pip install for package installation. While pip is easy to use and works for many use cases, there are some major drawbacks. If you have spent any time working in Python, you will likely have seen (and may have run) suggestions to pip install ____, or within Jupyter !pip install ____, to install one ore more package. There are a few issues with doing pip install on a supercomputer like HiPerGator, though:

  • Pip by default installs binary packages (wheels), which are often built on systems incompatible with HiPerGator. If you pip install a package and attempt to import it you might see an error about missing symbols or GLIBC version.
  • Pip install of a package with no binary distribution (wheel) will attempt to build a package from source, but that build will likely fail without additional configuration.
  • If you pip install a package that is already installed or will be later installed in an environment provided by UFRC, your version will take precedence over the packages installed in an environment provided by an environment module (or Jupyter kernel). Eventually package dependencies will become incompatible and you will encounter installation errors, import errors, or missing or wrong function calls (API changes). An innocuous pip install of a single package can result in a drastic change of the environment rendering it unusable.
  • Different packages may require different versions of the same package as dependencies leading to impossible to reconcile installation scenarios. This becomes a challenge to manage with pip as there isn't a method to swap active versions.
  • On its own, `pip` installs **everything** in one location: ~/.local/lib/python3.X/site-packages/. All packages installed are in the same location for any given version of Python.

Conda and Mamba to the rescue!

<img src='https://mamba.readthedocs.io/en/latest/_static/logo.png' alt='Mamba logo' width='200' align='right'>

conda and the newer, faster, drop-in replacement mamba, were written to solve some of these issues. They represent a higher level of packaging abstraction that can combine compiled packages, applications, and libraries as well as pip-installed python packages. They also allow easier management of project-specific environments and switching between environments as needed. They make it much easier to report the exact configuration of packages in an environment, facilitating reproducibility (recreation of an environment on a different system). Moreover, conda environments don't even have to be activated to be used. In most cases adding the path to the conda environment's 'bin' directory to the $PATH in the shell environment is sufficient for using them.

Check out the [UFRC Help page on conda](https://help.rc.ufl.edu/doc/Conda) for additional information.

A caveat

conda and mamba get packages from channels, or repositories of prebuilt packages packages. While there are several available channels, like the main conda-forge, not every Python package is available from a conda channel as they have to be packaged for conda first. You may still need to use pip to install some packages as noted later. However, conda still helps manage environment by installing packages into separate directory trees rather than trying to install all packages into a single folder that pip does.

Getting started: Conda Configuration

The ~/.condarc configuration file

conda's behavior is controlled by a configuration file in your home directory called .condarc. The dot at the start of the name means that the file is hidden from 'ls' file listing command by default. If you have not run conda before, you won't have this file. Whether the file exists or not, the steps here will help you modify the file to work best on HiPerGator. First load of the conda environment module on HiPerGator will put the current best practice .condarc into your home directory.

conda package cache location

conda caches (keeps a copy) of all downloaded packages by default in the ~/.conda/pkgs directory tree. If you install a lot of packages you may end up filling up your home quota. You can change the default package cache path. To do so, add or change the pkgs_dirs setting in your ~/.condarc configuration file e.g.:

pkgs_dirs:
  - /blue/mygroup/share/pkgs

or

  - /blue/mygroup/$USER/pkgs

Replace mygroup with your actual group name.

conda environment location

conda puts all packages installed in a particular environment into a single directory. By default named cond environments are created in the ~/.conda/envs directory tree. They can quickly grow in size and, especially if you have many environments, fill the 40GB home directory quota. For example, the environment we will create in this training is 5.3GB in size. As such, it is important to use path based (conda create -p PATH) conda environments, which allow you to use any path for a particular environment for example allowing you to keep a project-specific conda environment close to the project data in `/blue/` where you group has terrabyte(s) of space.

You can also change the default path for the _named_ environments (conda create -n NAME) if you prefer to keep all conda environments in the same directory tree. To do so, add or change the envs_dirs setting in the ~/.condarc configuration file e.g.:

envs_dirs:
  - /blue/mygroup/share/envs

or

  - /blue/mygroup/$USER/envs

Replace mygroup with your actual group name.

Expand this section to view instructions for editing your ~/.condarc file.

One way to edit your ~/.condarc file is to type: nano ~/.condarc`

If the file is empty, paste in the text below, editing the env_dirs: and pkg_dirs as below. If the file has contents, update those lines.

Your ~/.condarc should look something like this when you are done editing (again, replacing group and user in the paths with your group and username).
channels:
- conda-forge
- bioconda
- defaults
envs_dirs:
- /blue/group/user/conda/envs
pkgs_dirs:
- /blue/group/user/conda/pkgs
auto_activate_base: false
auto_update_conda: false
always_yes: false
show_channel_urls: false

Create your first environment

Load the conda module

Before we can run conda or mamba on HiPerGator, we need to load the conda module:

module load conda

Create your first environment

Create a _name based_ environment

To create your first _name based_ (see path based instructions below)conda environment, run the following command. In this example, I am creating an environment named hfrl:

mamba create -n hfrl

Here's a screenshot of the output from running that command. Yours should look similar.

Mamba create.png
**Note:** You do not need to manually create the folders that you setup in step 1. `mamba` will take care of that for you.

Create a _name based_ environment

To create a _path based_ cond environment use the '-p PATH' argument: mamba create -p PATH e.g. mamba create -p /blue/mygroup/share/project42/conda

    1. 3. Activate the new environment

To activate our environment (whether created with `mamba` or `conda` we use the `conda activate env_name` command. Let's activate our new environment:

`conda activate hfrl` or `conda activate /blue/mygroup/share/project42/conda`

Notice that your command prompt changes when you activate an environment to indicate which environment is active, showing that in parentheses before the other information:

> `(hfrl) [magitz@c0907a-s23 magitz]$ `