JupyterHub on Kubernetes 部署教程

概述

JupyterHub 是一个多用户 Jupyter Notebook 服务器,允许团队共享计算资源。本文将介绍如何在 Kubernetes 集群上部署 JupyterHub,并集成 Spark、Hail、GenoDig 等生物信息学工具。

项目结构

项目包含三个主要仓库:

  1. jhub_on_k8s: JupyterHub 核心部署配置
  2. kubernetes-aws: AWS 基础设施配置
  3. genodig-deploy: 基因组发现应用部署

部署架构

核心组件

  1. JupyterHub Hub: 用户认证和调度中心
  2. Configurable HTTP Proxy: 路由代理
  3. Single-user Notebook Servers: 用户独立的 Notebook 环境
  4. GitLab OAuth 认证: 用户身份验证

环境准备

1. Kubernetes 集群要求

  • Kubernetes 1.14+
  • Helm 2.14.3+
  • 存储类支持(如 AWS EBS)
  • Ingress 控制器

2. 依赖工具安装

# 安装 kubectl
curl -LO https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl
chmod +x kubectl
sudo mv kubectl /usr/local/bin/

# 安装 Helm
curl https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 | bash

详细部署步骤

步骤 1: 创建命名空间和服务账户

apiVersion: v1
kind: Namespace
metadata:
  name: jhub
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: jhub
  namespace: jhub

步骤 2: 准备配置文件

2.1 生产环境配置 (config.prod.yaml)
proxy:
  secretToken: "your-secret-token"
  nodeSelector:
    kops.k8s.io/instancegroup: nodes

hub:
  nodeSelector:
    kops.k8s.io/instancegroup: nodes
  db:
    type: sqlite-pvc
    pvc:
      storage: 1Gi
      storageClassName: "gp2-topology"
  extraConfig: |
    import os
    c.JupyterHub.authenticator_class = 'oauthenticator.gitlab.GitLabOAuthenticator'
    c.GitLabOAuthenticator.oauth_callback_url = "http://your-domain.com/hub/oauth_callback"
    c.GitLabOAuthenticator.client_id = "your-client-id"
    c.GitLabOAuthenticator.client_secret = "your-client-secret"
    
    async def add_auth_env(spawner):
      auth_state = await spawner.user.get_auth_state()
      if not auth_state:
          return
      spawner.environment['GITLAB_ACCESS_TOKEN'] = auth_state['access_token']
      spawner.environment['GITLAB_USER_LOGIN'] = auth_state['gitlab_user']['username']
    c.KubeSpawner.pre_spawn_hook = add_auth_env

auth:
  state:
    enabled: true
    cryptoKey: "your-crypto-key"

ingress:
  enabled: true
  hosts: ['your-domain.com']
  tls: []

singleuser:
  defaultUrl: "/lab"
  nodeSelector:
    usage: jhub
  image:
    name: your-registry/jupyterhub/k8s-singleuser-sample
    tag: latest
  lifecycleHooks:
    postStart:
      exec:
        command:
          - "sh"
          - "-c"
          - |
            # 初始化脚本
2.2 测试环境配置 (config.test.yaml)

与生产配置类似,但资源限制较低,使用不同的域名和认证配置。

步骤 3: 构建自定义 Notebook 镜像

3.1 基础镜像 (Dockerfile_base)
FROM jupyter/minimal-notebook:base
USER root

# 安装系统依赖
RUN apt-get update && apt-get install -y \
    openjdk-8-jdk \
    vim \
    less \
    g++ \
    git

USER $NB_USER

# 安装 Python 包
RUN pip install jupyterlab==1.2.4 \
    && jupyter labextension install jupyterlab-drawio \
    && pip install torch \
    && pip install nbgitpuller \
    && jupyter serverextension enable --py nbgitpuller --sys-prefix

WORKDIR /home/jovyan/work
3.2 完整生产镜像 (Dockerfile)
FROM your-registry/jupyterhub/k8s-singleuser-sample:base_01

# 设置环境变量
ARG glow_version=0.2.0
ARG delta_version=0.5.0

USER root

# 创建目录结构
RUN mkdir -p /opt/soft/{hadoop,spark,hail} \
    && mkdir -p /usr/share/aws \
    && mkdir -p /mnt/s3

# 复制 Hadoop/Spark 组件
COPY ./jars/hadoop /opt/soft/hadoop
COPY ./jars/spark /opt/soft/spark
COPY ./spark/spark-env.sh /opt/soft/spark/conf/

# 安装 Python 包和扩展
USER $NB_USER
RUN pip install --upgrade hail \
    && pip install glow.py==${glow_version} \
    && pip install --upgrade genodig \
    && pip install --upgrade genodig_core

# 设置环境变量
ENV SPARK_HOME=/opt/soft/spark \
    HADOOP_HOME=/opt/soft/hadoop \
    PATH=$PATH:$SPARK_HOME/bin:$HADOOP_HOME/bin \
    PYSPARK_PYTHON="/opt/anaconda/bin/python" \
    PYSPARK_DRIVER_PYTHON="/opt/conda/bin/python"

WORKDIR /home/jovyan/work

步骤 4: 部署 JupyterHub

4.1 使用 Helm 部署

创建部署作业 (deploy.yaml):

apiVersion: batch/v1
kind: Job
metadata:
  name: jhubstart-job{{version}}
spec:
  ttlSecondsAfterFinished: 300
  template:
    spec:
      serviceAccountName: jhub
      containers:
      - name: jhubrestart
        image: lachlanevenson/k8s-helm:v2.14.3
        args: ["upgrade", "--install", "jhub", "/tmp/jupyterhub", "--namespace", "jhub", "--version=0.8.2", "--values", "/tmp/config.yaml"]
      restartPolicy: Never
4.2 节点调度补丁

为 hub 和 proxy 组件添加节点选择器和容忍度:

patch_hub.yaml:

apiVersion: batch/v1
kind: Job
metadata:
  name: jhubstartpatchhub-job{{version}}
spec:
  template:
    spec:
      serviceAccountName: jhub
      containers:
      - name: jhubstartpatchhub
        image: lachlanevenson/k8s-kubectl:v1.14.10
        args: ['-n', 'jhub', 'patch', 'deployment', 'hub', '--patch', '{"spec":{"template":{"spec":{"nodeSelector":{"kops.k8s.io/instancegroup":"nodes"},"tolerations":[{"key":"sjzn-proj","effect":"NoSchedule"}]}}}}']
      restartPolicy: Never

步骤 5: 配置 Spark 环境

5.1 Spark 环境配置 (spark-env.sh)
export SPARK_HOME=/opt/soft/spark
export HADOOP_HOME=/opt/soft/hadoop
export HADOOP_CONF_DIR=/opt/soft/hadoop/etc/hadoop/
export SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/jovyan/.ivy2/jars/*
export PYSPARK_PYTHON='/opt/anaconda/bin/python'
export PYSPARK_DRIVER_PYTHON='/opt/conda/bin/python'
5.2 Spark 额外配置 (spark-extra.conf)
spark.jars /opt/soft/hail/hail-all-spark.jar
spark.pyspark.python /opt/anaconda/bin/python
spark.pyspark.driver.python /opt/conda/bin/python
spark.speculation true
spark.speculation.interval 100

步骤 6: 生命周期管理

6.1 启动后脚本 (poststart.sh)
#!/bin/bash
# 同步 S3 工作空间
AWS_CLI=/home/jovyan/.local/bin/aws
if [ -f "$AWS_CLI" ]; then
    $AWS_CLI s3 sync s3://your-bucket/workspaces/jupyterhub/${JUPYTERHUB_USER}/ /home/jovyan/work/
fi

# 设置 Git 配置
if [ -x /usr/bin/git ]; then
    echo "https://oauth2:${GITLAB_ACCESS_TOKEN}@git.23cube.com" > ~/.git-credentials
    git config --global credential.helper store
    git config --global user.email "${GITLAB_USER_EMAIL}"
    git config --global user.name "${GITLAB_USER_LOGIN}"
fi
6.2 停止前脚本 (prestop.sh)
#!/bin/bash
# 备份工作空间到 S3
if [ -x "$AWS_CLI" ]; then
    ${AWS_CLI} s3 sync /home/jovyan/work/ s3://your-bucket/workspaces/jupyterhub/${JUPYTERHUB_USER}/ --exclude "hail*.log"
fi

高级配置

用户配置文件列表

profileList:
  - display_name: "Spark && Hail && GD environment"
    description: "包含 Spark、Hail 和 GenoDig 的完整环境"
    default: true
  - display_name: "Large resource environment"
    description: "大内存环境,28GB RAM"
    kubespawner_override:
      mem_limit: "28G"
      mem_guarantee: "28G"
  - display_name: "Minimal environment"
    description: "最小化 Python 环境"
    kubespawner_override:
      image: jupyter/minimal-notebook:latest

资源配置

singleuser:
  cpu:
    limit: 2
    guarantee: 0.5
  memory:
    limit: 10G
    guarantee: 2G
  startTimeout: 600

监控和维护

1. 检查部署状态

# 检查 Pod 状态
kubectl -n jhub get pods

# 查看日志
kubectl -n jhub logs deployment/hub
kubectl -n jhub logs deployment/proxy

# 检查服务
kubectl -n jhub get svc

2. 升级部署

# 更新配置
kubectl apply -f deploy.yaml

# 重启组件
kubectl -n jhub rollout restart deployment/hub
kubectl -n jhub rollout restart deployment/proxy

3. 故障排除

# 查看事件
kubectl -n jhub get events

# 检查 Ingress
kubectl -n jhub get ingress

# 检查 PVC
kubectl -n jhub get pvc

安全考虑

1. 密钥管理

  • 使用 Kubernetes Secrets 存储敏感信息
  • 定期轮换 OAuth 客户端密钥
  • 使用 IAM 角色而非硬编码的 AWS 密钥

2. 网络策略

networkPolicy:
  enabled: true
  egress:
    - to:
        - ipBlock:
            cidr: 0.0.0.0/0

性能优化建议

1. 资源调度

  • 为不同用户组配置资源配额
  • 使用节点亲和性和容忍度优化调度
  • 实现自动扩缩容

2. 存储优化

  • 使用 SSD 存储类提高 IO 性能
  • 实现用户工作空间的定期备份
  • 配置适当的存储配额

总结

本文介绍了在 Kubernetes 上部署 JupyterHub 的完整流程,包括:

  1. 环境准备和依赖安装
  2. 自定义 Notebook 镜像构建
  3. Helm 配置和部署
  4. Spark 和生物信息学工具集成
  5. 用户认证和工作空间管理
  6. 监控和维护策略

该方案提供了可扩展、多用户的 Jupyter 环境,特别适合数据科学和生物信息学团队使用。通过合理的资源配置和调度策略,可以确保系统的稳定性和性能。

参考链接

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐