Kubernetes Plugin

Handle CVMFS

A few possible ways:

Install CVMFS on every Kubernetes node and mount.
At CERN, use cvmfs-csi as a k8s pvc and sc to mount in container.

Harvester server

Installation procedures

After finished Installation and Configuration, please install kubernetes python client:

pip install kubernetes

Setup and system configuration files

YAML file

Using the Job kind to prevent Pod from occupying resources after it completed. The following list shows parameters need to be modified:

Name	Description
metadata.name	Job name
spec.template.spec.containers.name	Container name
spec.template.spec.containers.image	atlasadc/atlas-grid-centos7

The command content will be executed to setup container environment and then pull pilot to run. For example,

apiVersion: batch/v1
kind: Job
metadata:
  name: atlasadc-job
spec:
  template:
    spec:
      containers:
        - name: atlas-grid-centos7
          image: atlasadc/atlas-grid-centos7
          volumeMounts:
            - name: cvmfs
              mountPath: /cvmfs
          imagePullPolicy: IfNotPresent
          command:
            - "sh"
            - "-c"
            - >
              curl -4 https://bootstrap.pypa.io/get-pip.py -o /root/get-pip.py;
              if [ -s /root/get-pip.py ]; then
                 python /root/get-pip.py;
                 pip install requests subprocess32;
              fi;
              echo -e "export computingSite=$computingSite\nexport pandaQueueName=$pandaQueueName\nexport resourceType=$resourceType\nexport proxyContent='$proxyContent'\nexport workerID=$workerID\nexport logs_frontend_w=$logs_frontend_w\nexport logs_frontend_r=$logs_frontend_r\n" > /etc/profile.d/job-setup.sh;
              groupadd -g 1308 atlasprd;
              useradd -u 41000 -g 1308 atlasprd;
              su - atlasprd -c "cd /home/atlasprd; curl -4 https://raw.githubusercontent.com/HSF/harvester/master/pandaharvester/harvestercloud/k8s_startup_script.py -o /home/atlasprd/k8s_startup_script.py; python /home/atlasprd/k8s_startup_script.py";
          securityContext:
            allowPrivilegeEscalation: false
      restartPolicy: Never
      volumes:
        - name: cvmfs
          hostPath:
            path: /cvmfs
          type: Directory

Set more than one container in YAML also works, but note that one worker is mapped to one container. The second container (or more) setup in YAML will be ignored.

If you run job on the Kubernetes cluster at CERN, please refer to CERN cloud document to setup CVMFS. See: http://clouddocs.web.cern.ch/clouddocs/containers/tutorials/cvmfs.html An example of setting CVMFS on nodes,

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: csi-cvmfs-atlas
provisioner: csi-cvmfsplugin
parameters:
  repository: atlas.cern.ch
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: csi-cvmfs-sft
provisioner: csi-cvmfsplugin
parameters:
  repository: sft.cern.ch
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: csi-cvmfs-grid
provisioner: csi-cvmfsplugin
parameters:
  repository: grid.cern.ch
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: csi-cvmfs-atlas-pvc
spec:
  accessModes:
  - ReadOnlyMany
  resources:
    requests:
      storage: 1Gi
  storageClassName: csi-cvmfs-atlas
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: csi-cvmfs-sft-pvc
spec:
  accessModes:
  - ReadOnlyMany
  resources:
    requests:
      storage: 1Gi
  storageClassName: csi-cvmfs-sft
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: csi-cvmfs-grid-pvc
spec:
  accessModes:
  - ReadOnlyMany
  resources:
    requests:
      storage: 1Gi
  storageClassName: csi-cvmfs-grid

and job YAML,

apiVersion: batch/v1
kind: Job
metadata:
  name: atlasadc-job
spec:
  template:
    spec:
      containers:
        - name: atlas-grid-centos7
          image: atlasadc/atlas-grid-centos7
          volumeMounts:
            - name: atlas
              mountPath: /cvmfs/atlas.cern.ch
            - name: sft
              mountPath: /cvmfs/sft.cern.ch
            - name: grid
              mountPath: /cvmfs/grid.cern.ch
          imagePullPolicy: IfNotPresent
          command:
            - "sh"
            - "-c"
            - >
              curl -4 https://bootstrap.pypa.io/get-pip.py -o /root/get-pip.py;
              if [ -s /root/get-pip.py ]; then
                 python /root/get-pip.py;
                 pip install requests subprocess32;
              fi;
              echo -e "export computingSite=$computingSite\nexport pandaQueueName=$pandaQueueName\nexport resourceType=$resourceType\nexport proxyContent='$proxyContent'\nexport workerID=$workerID\nexport logs_frontend_w=$logs_frontend_w\nexport logs_frontend_r=$logs_frontend_r\n" > /etc/profile.d/job-setup.sh;
              groupadd -g 1308 atlasprd;
              useradd -u 41000 -g 1308 atlasprd;
              su - atlasprd -c "cd /home/atlasprd; curl -4 https://raw.githubusercontent.com/HSF/harvester/master/pandaharvester/harvestercloud/k8s_startup_script.py -o /home/atlasprd/k8s_startup_script.py; python /home/atlasprd/k8s_startup_script.py";
          securityContext:
            allowPrivilegeEscalation: false
      restartPolicy: Never
      volumes:
        - name: atlas
          persistentVolumeClaim:
            claimName: csi-cvmfs-atlas-pvc
            readOnly: true
        - name: sft
          persistentVolumeClaim:
            claimName: csi-cvmfs-sft-pvc
            readOnly: true
        - name: grid
          persistentVolumeClaim:
            claimName: csi-cvmfs-grid-pvc
            readOnly: true
  backoffLimit: 0

Queue Config

Before start Kubernetes plugin, the module and class name should be set in $PANDA_HOME/etc/panda/panda_queueconfig.json. Also some parameters need to be adjusted:

Name	Description
proxySecretPath	Path of the proxy file inside container. Can work with the k8s secret managing proxy
x509UserProxy	Proxy file path on Harvester node to pass to container. Only works if proxySecretPath NOT set
cpuAdjustRatio	Set ratio to adjust resource of CPU before pod creating (default is 100)
memoryAdjustRatio	Set ratio to adjust resource of memory before pod creating (default is 100)
k8s_yaml_file	YAML file path which creates a kubernetes job
k8s_config_file	Configuration file path for Kubernetes client authentication
k8s_namespace	If you want to distinguish multiple teams or projects on cluster. See:https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/#when-to-use-multiple-namespaces

For example,

"ANALY_TAIWAN_TEST": {
…...
                "submitter": {
                        "name":"K8sSubmitter",
                        "module":"pandaharvester.harvestersubmitter.k8s_submitter",
                        "x509UserProxy": "/root/atlas-production.proxy",
                        "cpuAdjustRatio": 90,
                        "memoryAdjustRatio": 100
…...
                "monitor": {
                        "name":"K8sMonitor",
                        "module":"pandaharvester.harvestermonitor.k8s_monitor"
                },
                "sweeper": {
                        "name": "K8sSweeper",
                        "module": "pandaharvester.harvestersweeper.k8s_sweeper"
                },
                "common": {
                       "k8s_yaml_file": "/home/harvesteruser/atlas_job.yaml",
                       "k8s_config_file": "/home/harvesteruser/.kube/config",
                       "k8s_namespace": "default"
                }
        },

Special Cases:

Utilize K8S secret credential manager for proxy files

Now Harvester has a credential manager plugin k8s_secret_cred_manager to create/update a k8s secret object.

One can thus export proxy files into containers via k8s secret, and configure harvester to use k8s_secret_cred_manager to update the proxy periodically.

Configuration of k8s_secret_credmanager

One needs to create a configuration file in JSON format for k8s_secret_credmanager.

Important keys are k8s_namespace, k8s_config_file, proxy_files.

Note the proxy files listed in proxy_files must be updated periodically by harvester no_voms_credmanager or in other ways. Thus, k8s_secret_credmanager can update the newest proxy files to k8s secret.

Example of config json file of k8s_secret_credmanager (/opt/harvester_k8s/k8s_secret_cred_manager_config.json in the example above):

{
  "k8s_namespace": "",
  "k8s_config_file": "/opt/harvester_k8s/kubeconf",
  "proxy_files": ["/data/atlpan/atlas.prod.proxy", "/data/atlpan/atlas.pilot.proxy"]
}

Harvester configuration

In panda_harvester.cfg, one needs to add lines of k8s_secret_credmanager in credmanager block.

Here moduleName is pandaharvester.harvestercredmanager.k8s_secret_cred_manager and className is K8sSecretCredManager.

Put the path of configuration file of k8s_secret_credmanager mentioned above at certFile.

Many other attributes are useless for k8s_secret_credmanager.

Example of credmanager block in panda_harvester.cfg:

[credmanager]

# module name
moduleName =
 ...
 pandaharvester.harvestercredmanager.k8s_secret_cred_manager

# class name
className =
 ...
 K8sSecretCredManager

# original certificate file to generate new short-lived certificate
certFile =
 ...
 /opt/harvester_k8s/k8s_secret_cred_manager_config.json

# the name of short-lived certificate
outCertFile =
 ...
 useless_string

# voms
voms =
 ...
 useless_string

# sleep interval in sec
sleepTime = 1800

K8S job yaml

When using k8s_secret_cred_manager, a k8s secret object with name proxy-secret will be created.

In the yaml file, one needs to add a volume of secret with secretName: proxy-secret , and mount it in the container, with a proper mountPath.

Example of yaml file of k8s job:

apiVersion: batch/v1
kind: Job
metadata:
  name: atlasadc-job
spec:
  template:
    spec:
      containers:
        - name: atlas-grid-centos7
          image: atlasadc/atlas-grid-centos7
          volumeMounts:
            - name: atlas
              mountPath: /cvmfs/atlas.cern.ch
            - name: atlas-condb
              mountPath: /cvmfs/atlas-condb.cern.ch
            - name: atlas-nightlies
              mountPath: /cvmfs/atlas-nightlies.cern.ch
            - name: sft
              mountPath: /cvmfs/sft.cern.ch
            - name: grid
              mountPath: /cvmfs/grid.cern.ch
            - name: proxy-secret
              mountPath: /proxy
          imagePullPolicy: IfNotPresent
          #resources:
          #  requests:
          #    memory: "1.5Gi"
          command:
            - "sh"
            - "-c"
            - >
              curl -4 https://bootstrap.pypa.io/get-pip.py -o /root/get-pip.py;
              if [ -s /root/get-pip.py ]; then
                 python /root/get-pip.py;
                 pip install requests subprocess32;
              fi;
              echo -e "export computingSite=$computingSite\nexport pandaQueueName=$pandaQueueName\nexport resourceType=$resourceType\nexport proxySecretPath=$proxySecretPath\nexport proxyContent='$proxyContent'\nexport workerID=$workerID\nexport logs_frontend_w=$logs_frontend_w\nexport logs_frontend_r=$logs_frontend_r\n" > /etc/profile.d/job-setup.sh;
              groupadd -g 1308 atlasprd;
              useradd -u 41000 -g 1308 atlasprd;
              su - atlasprd -c "cd /home/atlasprd; curl -4 https://raw.githubusercontent.com/HSF/harvester/flin/pandaharvester/harvestercloud/k8s_startup_script.py -o /home/atlasprd/k8s_startup_script.py; python /home/atlasprd/k8s_startup_script.py";
          securityContext:
            allowPrivilegeEscalation: false

      restartPolicy: Never
      volumes:
        - name: atlas
          persistentVolumeClaim:
            claimName: csi-cvmfs-atlas-pvc
            readOnly: true
        - name: atlas-condb
          persistentVolumeClaim:
            claimName: csi-cvmfs-atlas-condb-pvc
            readOnly: true
        - name: atlas-nightlies
          persistentVolumeClaim:
            claimName: csi-cvmfs-atlas-nightlies-pvc
            readOnly: true
        - name: sft
          persistentVolumeClaim:
            claimName: csi-cvmfs-sft-pvc
            readOnly: true
        - name: grid
          persistentVolumeClaim:
            claimName: csi-cvmfs-grid-pvc
            readOnly: true
        - name: proxy-secret
          secret:
              secretName: proxy-secret
  backoffLimit: 0

Queue configuration

In queue configuration submitter block, one needs to add the line of proxySecretPath. Note the value of proxySecretPath must be the proxy file path inside the container, basically corresponding to the mountPath setup in yaml and proxy_files defined in configuration json of k8s_secret_cred_manager.

Example of queue configuration json file:

"CERN-EXTENSION_K8S_HARVESTER": {
                "queueStatus": "online",
                "prodSourceLabel": "managed",
                "nQueueLimitWorker": 100,
                "maxWorkers": 1000,
                "maxNewWorkersPerCycle": 30,
                "runMode":"slave",
                "mapType": "NoJob",
                "truePilot": true,
                "preparator": {
                        "name": "DummyPreparator",
                        "module": "pandaharvester.harvesterpreparator.dummy_preparator"
                },
                "submitter": {
                        "name":"K8sSubmitter",
                        "module":"pandaharvester.harvestersubmitter.k8s_submitter",
                        "proxySecretPath":"/proxy/atlas.prod.proxy",
                        "x509UserProxy": "/data/atlpan/x509up_u25606_production",
                        "cpuAdjustRatio": 100,
                        "memoryAdjustRatio": 100
                },
                "workerMaker": {
                        "name": "SimpleWorkerMaker",
                        "module": "pandaharvester.harvesterworkermaker.simple_worker_maker"
                },
                "messenger": {
                        "name": "SharedFileMessenger",
                        "module": "pandaharvester.harvestermessenger.shared_file_messenger",
                        "accessPoint": "/data/atlpan/harvester_wdirs/${harvesterID}/${_workerID_3.2}/${_workerID_1.0}/${workerID}"
                },
                "stager": {
                        "name": "DummyStager",
                        "module": "pandaharvester.harvesterstager.dummy_stager"
                },
                "monitor": {
                        "name":"K8sMonitor",
                        "module":"pandaharvester.harvestermonitor.k8s_monitor"
                },
                "sweeper": {
                        "name": "K8sSweeper",
                        "module": "pandaharvester.harvestersweeper.k8s_sweeper"
                },
                "common": {
                       "k8s_yaml_file": "/opt/harvester_k8s/k8s_atlas_job_prod_secret.yaml",
                       "k8s_config_file": "/opt/harvester_k8s/kubeconf",
                       "k8s_namespace": ""
                }
    }

K8s scheduling: packing nodes rather than spreading

K8s default scheduling spreads the pods across the nodes with a Round Robin algorithm. This can cause single core pods to spread across all nodes and preventing multi core pods to be scheduled. You can define a custom scheduling policy. Here is the example that worked for us:

On the master node define the policy file, including priority stratagy "{"name" : "MostRequestedPriority", "weight" : 1}" at /etc/kubernetes/scheduler-policy.json

{
  "kind" : "Policy",
  "apiVersion" : "v1",
  "predicates" : [
    {"name" : "GeneralPredicates"},
    {"name" : "MatchInterPodAffinity"},
    {"name" : "NoDiskConflict"},
    {"name" : "NoVolumeZoneConflict"},
    {"name" : "PodToleratesNodeTaints"}
    ],
  "priorities" : [
    {"name" : "MostRequestedPriority", "weight" : 1},
    {"name" : "InterPodAffinityPriority", "weight" : 2}
    ]
  }

In /etc/kubernetes/scheduler refer to the policy config file in KUBE_SCHEDULER_ARGS:

  KUBE_SCHEDULER_ARGS="--leader-elect=true --policy-config-file /etc/kubernetes/scheduler-policy.json"

Then restart scheduler to make the changes take effect:

  $ systemctl restart kube-scheduler.service

Implementing the node-packing scheduler as a pod

It seems that more recent k8s clusters, including those built with kubespray, deploy schedulers as pods rather than systemctl services. Here are instructions for deploying a pod to run a custom node-packing scheduler, and using the custom scheduler to schedule production jobs.

1. Clone and build the k8s code from github

You'll need to make the kube-scheduler binary for your k8s version. You can check your k8s version as follows:

kubectl version

I get the following output, indicating that my k8s version is 1.14.3:

Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.3", [...]
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.3", [...]

In order to build the k8s code, you'll need to have the latest version of go ang gcc. Download these if needed:

# Install gcc
yum -y install gcc

# Install go
wget https://dl.google.com/go/go1.12.7.linux-amd64.tar.gz
tar -C /usr/local -xzf go1.12.7.linux-amd64.tar.gz

# Add /usr/local/go/bin to the PATH environment variable
export PATH=$PATH:/usr/local/go/bin

To clone and build the k8s code, run:

git clone https://github.com/kubernetes/kubernetes.git
cd kubernetes

# If needed, check out the version of k8s that you're using
git checkout release-[your version (eg. 1.13)]
make

Copy the kube-scheduler binary to the main repo level, then you can remove the (rather large...) kubernetes directory:

cd ..
cp kubernetes/_output/local/bin/linux/amd64/kube-scheduler .
rm -rf kubernetes

2. Build the custom kube-scheduler container image and push it to docker hub

First, create a file named scheduler-policy.json (touch scheduler-policy.json), and fill it with the custom node-packing scheduler policy:

# scheduler-policy.json
{
  "kind" : "Policy",
  "apiVersion" : "v1",
  "predicates" : [
    {"name" : "GeneralPredicates"},
    {"name" : "MatchInterPodAffinity"},
    {"name" : "NoDiskConflict"},
    {"name" : "NoVolumeZoneConflict"},
    {"name" : "PodToleratesNodeTaints"}
    ],
  "priorities" : [
    {"name" : "MostRequestedPriority", "weight" : 1},
    {"name" : "InterPodAffinityPriority", "weight" : 2}
    ]
  }

Create a Dockerfile (touch Dockerfile), with the following content:

# Dockerfile
FROM busybox
ADD kube-scheduler /usr/local/bin/kube-scheduler
ADD ./scheduler-policy.json /etc/kubernetes/scheduler-policy.json

Lastly, build the custom kube-scheduler image, and push it to docker hub (assuming you have a docker hub account)

docker login -u [your-docker-username]
docker build -t [your-docker-username]/node-packing-scheduler .       # Name it whatever you want, just be sure to include your docker username at the beginning
docker push [your-docker-username]/node-packing-scheduler

3. Run the second scheduler in the cluster

Create a file named node-packing-scheduler.yaml with the following content:

# node-packing-scheduler.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: node-packing-scheduler
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: node-packing-scheduler-as-kube-scheduler
subjects:
- kind: ServiceAccount
  name: node-packing-scheduler
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: system:kube-scheduler
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    component: scheduler
    tier: control-plane
  name: node-packing-scheduler
  namespace: kube-system
spec:
  selector:
    matchLabels:
      component: scheduler
      tier: control-plane
  replicas: 1
  template:
    metadata:
      labels:
        component: scheduler
        tier: control-plane
        version: second
    spec:
      serviceAccountName: node-packing-scheduler
      tolerations:
      - key: "key"
        operator: "Equal"
        value: "value"
        effect: "NoSchedule"
      containers:
      - command:
        - /usr/local/bin/kube-scheduler
        - --address=0.0.0.0
        - --leader-elect=false
        #- --lock-object-namespace=lock-object-namespace
        #- --lock-object-name=lock-object-name
        - --scheduler-name=node-packing-scheduler
        - --policy-config-file=/etc/kubernetes/scheduler-policy.json
        image: [your-docker-username]/node-packing-scheduler
        livenessProbe:
          httpGet:
            path: /healthz
            port: 10251
          initialDelaySeconds: 15
        name: kube-second-scheduler
        readinessProbe:
          httpGet:
            path: /healthz
            port: 10251
        resources:
          requests:
            cpu: '0.1'
        securityContext:
          privileged: false
        volumeMounts: []
      hostNetwork: false
      hostPID: false
      volumes: []

Now, create the custom kube-scheduler pod:

kubectl create -f node-packing-scheduler.yaml

Check that the scheduler pod is running:

kubectl get pods --namespace=kube-system

You should see something like:

NAME                                                 READY   STATUS              RESTARTS   AGE
node-packing-scheduler-7df5697487-55n27              0/1     Running             0          5s

Edit the system: kube-scheduler cluster role:

kubectl edit clusterrole system:kube-scheduler

Add - node-packing-scheduler under kube-scheduler in resourceNames, and copy the following to the bottom of the file:

- apiGroups:
  - storage.k8s.io
  resources:
  - storageclasses
  verbs:
  - watch
  - list
  - get

The scheduler can now be tested by creating a pod that gets scheduled by the custom scheduler by creating the scheduler_test_pod.yaml file (see line 8 in scheduler_test_pod.yaml for the syntax to specify that jobs should use the custom scheduler):

kubectl create -f scheduler_test_pod.yaml

After 30s or so, you should see it scheduled and running:

kubectl get pods

should include something like:

NAME                               READY   STATUS      RESTARTS   AGE
...
annotation-second-scheduler        1/1     Running     0          30s
...

PQ setup on AGIS

If you run pilot2 on Kubernetes PQ, please set

container_type="docker:wrapper"

on AGIS PQ fields. This means that the wrapper will be in charge of starting the pilot inside a docker container.
Due to containers already used user namespace of machines, we disabled Singularity on Kubernetes PQ.

Authored by FaHui Lin, MingJyuan Yang

a6350202/kubernetesPlugin.md