How to Use Jupyter
Jupyter is a frequently used platform to do data science and HEP is no exception. Using Jupyter on the cluster here can be done without too much hassle; however, if the amount of data you are analyzing is small enough to fit on your desktop/laptop, you will see better response performance if you simply copy the data there to analyze.
Start Up
First you need to install various python
libraries that will allow you to run jupyter.
This should be done on the cluster where jupyter will be run. You will see improved performance of JupyterLab
if you install it into a python virtual environmenton /export/scratch
.
This is because /export/scratch
is not network mounted.
cd /export/scratch/users/
mkdir $USER
cd $USER
python3 -m venv pyvenv --prompt jlab
source pyvenv/bin/activate
pip install --upgrade pip
pip install --upgrade setuptools
pip install --upgrade jupyterlab
Next, on your desktop/laptop update the SSH configuration settings to connect the port on the cluster that jupyter will talk to to the same port on your laptop.
# inside of ~/.ssh/config on your desktop/laptop
Host <shortname> :
User <umn-username>
HostName <full-computer-name>.spa.umn.edu
LocalForward 1234 localhost:1234
Finally, go to the cluster and launch jupyter to the same port that you put into your SSH configuration.
ssh <shortname>
cd <working directory>
# if you installed it to /export/scratch you need to re-enter the python virtual environment
source /export/scratch/users/$USER/pyvenv/bin/activate
jupyter lab --no-browser --port 1234
This last command will print out a few links which you can click on and open in the browser on your desktop/laptop.
Comments
- Ports are shared between all users on a given computer, so the specific port you choose may be used by another person. In this case, jupyter will add one to the port number until an available port is found. This will cause you to not be able to connect to the jupyter session anymore since your SSH configuration points to one port while jupyter points to a different one. You can resolve this by either changing nodes or updating yoru SSH configuration.
- Jupyter runs from within a working directory and makes a lot of I/O operations in order to save progress and render
images. This means running jupyter from a network-attached filesystem will be noticeably slow. It is suggested to
put your jupyter notebooks into
/export/scratch/...
and back them up to GitHub similar to other code.
Newer Python
There are two methods available on the cluster for obtaining a different version of Python compared to the version installed at the system level. Either may fit your use case.
Since Jupyter is running on the cluster, the other steps from the start-up section above are not changed. The only steps that are changed are how jupyter is installed and how it is run.
CVMFS
The cluster has access to CVMFS which is a way to distribute pre-built software
that was developed at CERN for the large experiments like CMS.
This is a helpful method because it contains a variety of pre-built Python versions while also allowing you
to have access to the system libraries that do not interfere with this pre-built Python version.
One example is LaTeX: one could use this method to enable plotting using a newer version of matplotlib
(or some other Python package) while also allowing for matplotlib to access the system installation
of latex
for constructing any equation-based labels.
The major downside of this method is that it is often difficult to find what Python verisons are
available and how to activate them - below is just an example to help guide you and may not work
out of the box.
# setup Python 3.9.14 from CVMFS
. /cvmfs/cms.cern.ch/el8_amd64_gcc12/external/python3/3.9.14-4612d00f9f0430a19291545f1e47b4a4/etc/profile.d/init.sh
# initialize a python venv with this python version
python3 -m venv venv
# make sure the original init is always source when the venv is activated
# the follow prepends the source command above into the venv activation file
sed -i.bak \
'1s|^|. /cvmfs/cms.cern.ch/el8_amd64_gcc12/external/python3/3.9.14-4612d00f9f0430a19291545f1e47b4a4/etc/profile.d/init.sh\n|' \
venv/bin/activate
Then you can install jupyterlab (along with any other python packages you want) after activating the venv.
. venv/bin/activate
pip install jupyterlab # and anything else
and run it only after activating the venv.
. venv/bin/activate
jupyter lab --no-browser --port 1234
Containers
We can use containers in order to aquire a newer python version that isn't currently available on the cluster. The example below uses denv similar to the case study which focused more on using command line tools.
The benefit of using containers is that they provide a truly isolated environment and, specifically for Python,
an image is built for each Python release so you can pick whatever Python release you wish.
No packages from the system clutter the environment which is nice for reproducibility;
however, it may mean you "lose" access to certain packages from the host system (like latex
in the example above).
Rather than using a virtual environment in /export/scratch
, we can create a denv within /export/scratch
referencing the newest python version available. Note: before using denv
(or any containers), make
sure to move the caching directory to a larger directory than your home (e.g. with export APPTAINER_CACHEDIR=/export/scratch/users/${USER}
).
cd /export/scratch/users/${USER}
denv init python:3 # or whatever version you want https://hub.docker.com/_/python/tags
denv python3 -m pip install --user --upgrade jupyterlab
Then, whenever you wish to launch jupyter lab you just need to prefix the program with denv
.
cd /export/scratch/users/${USER}
denv jupyter lab --port 1234