Recent Changes - Search:

JupyterHub Installation

Objective

We aim at deploying a JupyterHub for providing a multi-user / multi-group Jupyter notebook service supporting multiple ipython kernels (from different python toolchains) running in a distributed memory HPC system for accessing different computing resources (large memory systems, multi-core computing nodes, etc.).

In particular, we want to deploy different instances of JupyterHub (with different user's access lists) through different python toolchains, each one capable of running each user's ipython kernel as a job in our HPC system (based in slurm) under particular resources allocations. All this having in mind the security concerns raised by running ipython kernels in compute nodes, since jupyter notebooks give users access to a shell console on the compute node.

Rationale

We want to deploy the JupyterHub software in a dedicated server, which is configured as a submitter node of your target HPC system. This means the JupyterHub server shall have access to all HPC filesystems, private network and authentication services (LDAP for instance). Also it shall have access to the software toolchains deployed in the HPC system (at least all of them related to python). In simple words, a user shall be capable of submitting a python based job into the cluster from this server by using their preferred python toolchain. This lead us to have each python toolchain properly configured to run a JupyterHub instance (meaning python2 and python3 with all dependencies installed). For instance, The CMM's astrolab has its software toolchain in the HPC system (accessed via environment modules). So, this toolchain provides all the python 2/3 libraries and dependencies to run the notebooks self-contained onto this toolchain. In this way, other user's groups may have their own JupyterHub notebooks bonded to the python they use to work in the HPC system.

As the Jupyter notebook API gives users access to a shell console at the computer where the ipython kernel is running, we shall prevent the access to this API from external networks, so that, we protect the system against URL exploit attacks. Therefore, the JupyterHub service shall run locally in the server, bonded to localhost or any other private network interface, and we use the reverse proxy feature of apache2 to provide SSL based access to the JupyterHub service. So, security will be handled two fold: jupyterhub does the authentication (using PAM), and apache takes care about secure network access security. In addition, we want to provide an additional security layer to the different jupyterhubs served by the apache service by using client SSL certificates for controling who is establishing a SSL connection to the service, so that we can use the common names of the certificates (CN) to control and differentiate the access to the Secure Socket Layer associated to each jupyterhub. So, an attacker will require to have a valid certificate in order to run an url exploit attack. The drawback of this approach is to administrate a PKI (Public Key Infrastructure) in order to issue user's certificates. We believe that this is not that bad when thinking in the security level this approach provides.

Requirements

From the rationale, we state the following requirements:

  • Multi-instance jupyter hub installation and control.
  • JupyterHub integration with slurm.
  • Customizable kernel job submission.
  • a Public Key Infrastructure (PKI).
  • Easy way for users to request signed certificates.
  • fast and easy-to-use signing tool.
  • Apache SSL frontend for reverse proxying jupyterhubs.
  • Directory of available hubs.

Setup

Base installation

The JupyterHub server base configuration is the following:

  • CentOS 7.3.
  • Public network configured.
  • Private HPC network configured in a single interface.
  • Slurm client configured for submitting, cancel and querying jobs.
  • Environment modules access to HPC toolchains.

Basic JupyterHub installation.

In order to bind a JupyterHub instance to a particular toolchain (and so permit multiple instances of JupyterHubs), we need to ensure that it will provide all required software to run JypyterHub. So, for this case, we name "astro" to the toolchain used for this instance. Then, we ensure that Python3 and nodeJs is installed properly. Then, we use pip3 to install jupyterhub.

$ module load astro
$ pip3 install jupyterhub
$ pip3 install jupyter
$ pip2 install jupyter

we install jupyter notebooks in both python 2 and 3 in order to provide both kernels for users.

Following, we install the slurm spawner for JupyterHub from the batchspawner JupyterHub Git Repository.

Then, we create a configuration directory for different hubs, let us say /etc/jupyterhub where we will put the hubs configuration files.

Here we have the template used for configuring a single instance. The PRIVATE_IP is the ip of the private network interface (participating in the private HPC system network). All compute nodes should have access to this ip.

port and hub_port should be unique for each JupyterHub instance, so here we use the defaults used by JupyterHub, but for setting an additional hub, you shold change these ports.

statsd_prefix, cookie_secret_file and db_url are also related to an specific instance of JupyterHub, so, they need to point unique files.

Finally, url and proxy_ath_token are given to allow the hub being server under a URL like https://jupyter.nlhpc.cl/astro. and secure the nodeJS proxy (used by JupyterHub).

$ cat /etc/jupyterhub/jupyterhub_astro.pl
## Example JupyterHub
c.JupyterHub.hub_ip='PRIVATE_IP'
c.JupyterHub.port = 8000
c.JupyterHub.hub_port = 8081

c = get_config()
c.JupyterHub.spawner_class = 'batchspawner.SlurmSpawner'

c.JupyterHub.statsd_prefix = 'jupyterhub-astro'
c.JupyterHub.cookie_secret_file = '/var/run/jupyterhub/jupyterhub-astro_cookie_secret'
c.JupyterHub.db_url = 'sqlite:////var/run/jupyterhub/jupyterhub-astro.sqlite'

c.JupyterHub.base_url = '/astro'
c.JupyterHub.proxy_auth_token = 'astro-hub'

c.BatchSpawnerBase.req_nprocs = '4'
c.BatchSpawnerBase.req_partition = 'slims'
c.BatchSpawnerBase.req_memory = '4b'
c.BatchSpawnerBase.batch_script = """#!/bin/bash
#SBATCH --partition={partition}
#SBATCH --output={homedir}/myjupyterhub_%j.log
#SBATCH --job-name=my_jupyterhub
#SBATCH --workdir={homedir}
#SBATCH --mem={memory}
#SBATCH --reservation=harvard
#SBATCH --export={keepvars}
#SBATCH --uid={username}
#SBATCH --get-user-env=L
#SBATCH {options}

module load astro

which jupyterhub-singleuser
JUPYTER_RUNTIME_DIR=$HOME/.jupyter/rt {cmd}
"""

c.Authenticator.admin_users = {'user1','user2'}
c.Authenticator.whitelist = set(['user1','user2','student01','student02'])

Following, we configure the batchspawner module in order to set the job used to run the jupyter notebook kernel in a compute node. More information about the options provided by this spawner can be found at https://github.com/jupyterhub/batchspawner.

Important is we need to load the proper module (astro in this case) before to call the jupyterhub-singleuser script, so we call the correct one. We set the Jupyter runtime directory under the user's home account since by default, jupyter uses /var/run as path which is not shared by all compute nodes (we need this directory be shared by the JupyterHub server and compute nodes.

Finally, we set the administrator and the set of users allowed to use this hub instance.

Testing

For testing our configuration, we can run the JupyterHub manually bonding the hub ip to the public ip of the server in the following way:

$ module load astro
$ jupyterhub --no-ssl --ip PUBLIC_IP -f /etc/jupyterhub/jupyterhub_astro.pl
...

you can access the jupypter hub under the url http://PUBLIC_IP:8000/astro/

Note that the --no_ssl flag is intentionally used in order to avoid the complexity of setting the SSL layer at this point. As mentioned in the rationale, we will provide this at the apache level configured further on.

At the moment, the only kernel available is python 3, so if a user want to use a python 2 kernel, you can access the terminal from the hub and issue the following command:

$ module load astro
$ python2 -m ipykernel install --user

Then, the user ill have available the Python 2 notebook available. (that why we need to install jupyter in python2 as well)

Starting a JupyterHub as a service

As we are using a CentOS 7 based server, we will configure our JupyterHub instance a a systemd service. The configuration template are the following:

$ cat /lib/systemd/system/jupyterhub-astro.service 
[Unit]
Description=Jupyterhub-astro
After=syslog.target network.target

[Service]
User=root
EnvironmentFile=/etc/default/jupyterhub_astro
ExecStart=jupyterhub --no-ssl --ip 127.0.0.1 -f /etc/jupyterhub/jupyterhub_astro.pl
WorkingDirectory=/var/run/jupyterhub_astro
ExecStartPost=/usr/bin/bash -c "/usr/bin/echo \"$URL,$DESCRIPTION\" >> /var/www/html/ws/hubs/astro.hub"
ExecStopPost=/usr/bin/rm /var/www/html/ws/hubs/astro.hub

[Install]
WantedBy=multi-user.target
$ cat /etc/default/jupyterhub_astro
PATH=/home/apps/astro/bin:/home/apps/astro/sbin:/usr/bin:$PATH
LD_LIBRARY_PATH=/home/apps/astro/lib:$LD_LIBRARY_PATH
ASTRO_HOME=/home/apps/astro
PKG_CONFIG_PATH=/home/apps/astro/lib/pkgconfig
DESCRIPTION="CMM's AstroInformatics Lab Hub"
URL="astro/"

On this configuration file, we need to note the following points:

  1. We use a environmental file setting the same variables provided by the enviromental module astro. For some reason we did not determined, just loading the module in this environmental file does not worked properly.
  2. We bond locally (127.0.0.1) the JupyterHub in order to keep it hidden from the public access. As mentioned before, we will use an apache with SSL reverse proxy to access it.
  3. We create an remove a description file in order to determine from a webpage which hub is up and running.

Once configured, you set the service to start automatically at booting time:

$ systemctl enable jupyterhub_astro.service

And start/stop the service as following:

$ service jupyterhub-astro start
$ service jupyterhub-astro stop
$ service jupyterhub-astro restart
$ service jupyterhub-astro status

Public Key Infrastructure

There are many ways of setting a PKI. So, we aim at keeping it simple as possible. So, here we provide the skeleton we used to create our PKI here. All these files should be located at /etc/pki.

From this template, we configure the pki.mk file with the information we want to have in our certificates. In additional, the pki/tls/openssl.cnf file should be reconfigured with our data. This file is used by the web-based certificate signing request.

$ cat /etc/pki/pki.mk 
DN="University of Chile"
ORG="Center for Mathematical Modeling"
OU="NLHPC"

ROOT_CN="JupyterHub Root CA"
SIGNING_CN="JupyterHub Signing CA"

PKI_ROOT=/etc/pki
CA_ROOT=$(PKI_ROOT)/ca

VALID_DAYS=3560
$ cat /etc/pki/tls/openssl.cnf
...
[ user_dn ]
0.domainComponent       = "University of Chile"
organizationName        = "Center for Mathematical Modeling"
organizationalUnitName  = "NLHPC"
...

Once configuration files are done, we install the required files and create the root (self-signed) certificate from the /etc/pki directory.

$ make install
$ make root-ca

Then, we create our signing certificate, which we will use to sign server and users certificates.

$ make signing-ca

Your root and signing certificate will be located at /etc/pki/ca/root-ca and /etc/pki/ca/signing-ca respectively.

Following, you need to create a server certificate in order to configure the SSL in apache. For that, you use:

$ make server

You will be asked for the Full Qualified Domain Name (FQDN) of the server, in our case it is jupyterhub.nlhpc.cl. The generated certificated signing request (CSR) will be located at /etc/pki/certs. Then you will be asked for the Signing-CA password in order to sign the CSR. After you provide the signing password, the server certificate and key will be located at /etc/pki/certs/ under the FQDN of the server with .csr and .key extensions.

Once created the server certificate, we can configure the apache ssl reverse proxy to server our jupyterhubs.

Apache Configuration

We will setup apache http and https in order to allow users to download the root-ca and signing-ca bundle via http, and just then, allow the users to request their personal certificates via https.

$ cat /etc/httpd/conf.d/jupyterhub.conf
<VirtualHost *:80>
    ServerName jupyter.nlhpc.cl

    ErrorLog "logs/jupyter_error_log"
    TransferLog "logs/jupyter_access_log"
    CustomLog "logs/jupyter_access_log" combined
    LogLevel warn

</VirtualHost>

<VirtualHost *:443>
    ServerName jupyter.nlhpc.cl

    ErrorLog "logs/jupyter_ssl_error_log"
    TransferLog "logs/jupyter_ssl_access_log"
    CustomLog "logs/jupyter_ssl_access_log" combined
    LogLevel warn

    SSLEngine on
    SSLProtocol all -SSLv2

    SSLCipherSuite ALL:!ADH:!EXPORT56:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP:+eNULL
    SSLCertificateFile /etc/pki/certs/jupyter.nlhpc.cl.crt
    SSLCertificateKeyFile /etc/pki/certs/jupyter.nlhpc.cl.key
    SSLCertificateChainFile /etc/pki/tls/certs/ca-bundle.trust.crt
    SSLCACertificateFile /etc/pki/tls/certs/jupyter-bundle.crt

    ProxyRequests off
    ProxyPreserveHost On

    <Location />
        SSLVerifyClient      optional
        SSLVerifyDepth       10
        SSLOptions           +FakeBasicAuth +StrictRequire +ExportCertData +StdEnvVars
        SSLRequireSSL
        SSLRequire           %{SSL_CIPHER_USEKEYSIZE} >= 128
        SSLRequire           %{SSL_CLIENT_S_DN_OU} eq "NLHPC"
        SSLRequire           %{SSL_CLIENT_I_DN_CN} eq "JupyterHub Signing CA"

        Satisfy              any

    </Location>	

    <Location /certs>
        SSLVerifyClient      none
        SSLVerifyDepth       3
        SSLOptions           +FakeBasicAuth +StrictRequire +ExportCertData +StdEnvVars
        SSLRequireSSL
        Satisfy              any
    </Location>	

    <Location /status>
        SSLVerifyClient      none
        SSLVerifyDepth       3
        SSLOptions           +FakeBasicAuth +StrictRequire +ExportCertData +StdEnvVars
        SSLRequireSSL
        Satisfy              any
    </Location>	

    <Location /status/cert-check>
        SSLVerifyClient      require
        SSLVerifyDepth       10
        SSLOptions           +FakeBasicAuth +StrictRequire +ExportCertData +StdEnvVars
        SSLRequireSSL
        SSLRequire           %{SSL_CIPHER_USEKEYSIZE} >= 128
        SSLRequire           %{SSL_CLIENT_S_DN_OU} eq "NLHPC"
        SSLRequire           %{SSL_CLIENT_I_DN_CN} eq "JupyterHub Signing CA"
    </Location>	

    # CMM's AstroInformatic Lab Hub
    <Location /astro>
        ProxyPass http://127.0.0.1:8000/astro
        ProxyPassReverse http://127.0.0.1:8000/astro

        SSLVerifyClient      require
        Satisfy all
    </Location>

    <LocationMatch "/astro/(user/[^/]*)/(api/kernels/[^/]+/channels|terminals/websocket)(.*)">
        ProxyPassMatch ws://127.0.0.1:8000/astro/$1/$2$3
        ProxyPassReverse ws://127.0.0.1:8000/astro/$1/$2$3

        SSLVerifyClient      require
        Satisfy all

    </LocationMatch>	

     # AstroStudents Hub
     <Location /astro-students>
        ProxyPass http://127.0.0.1:8010/astro-students
        ProxyPassReverse http://127.0.0.1:8010/astro-students
        SSLVerifyClient      require
        Satisfy all
     </Location>

     <LocationMatch "/astro\-students/(user/[^/]*)/(api/kernels/[^/]+/channels|terminals/websocket)(.*)">
        ProxyPassMatch ws://127.0.0.1:8010/astro-students/$1/$2$3
        ProxyPassReverse ws://127.0.0.1:8010/astro-students/$1/$2$3
        SSLVerifyClient      require
        Satisfy all
     </LocationMatch>	
</VirtualHost>

This configuration is mainly composed by three sections:

  1. The base site configuration.
  2. the status and check urls that require a different configuration
  3. the jupyterhubs reverse-proxy configuration

The first one is similar to any other apache SSL configuration. Note that you need to create a bundle file for both root-ca and signing-ca (in the SSLCACertificateFile configuration parameter). This file shall be located under /etc/pki/certs only to keep the configuration clean.

There are urls that require ssl client authentication, there are others than no, just ssl, and the root of the site requires both only for obtaining the user's DN and show it in the webpage. This can be improved by separating the sites into two sub-sites, one with a ssl client auth subsite and the other without client ssl authentication. We handled both at the same time with RequireSSL optional, but if you think that this is too relaxed, you can separate the sites without any problem. The status and cert-check url are required to determine whether or not the root-ca is installed and whether or not the user-cert is installed. If you try to access the https site without installing the root-ca and you comply with the security exception, you will access the ssl site, but the user-certificate verification should fail in some browsers. That why it is important to enforce the installation of the root and signing ca as first step. At the moment, we have no way to do this, but as soon as we figure out a way to enforce this, we will update this page.

WebPage integration

The web page intents of providing a simple front-end for users to get their certificates and provide a sort of "index" of which hubs are up-and-running.

(WIP)

User Signing Request.

(WIP)