Installation and Troubleshooting of LitmusChaos on Rancher
Pre-requisites
- Rancher 2.4.3 or later
- Kubernetes 1.15 or later
Getting Started
Running chaos on your application involves the following steps:
NOTE:
This guide describes the steps to inject container-kill chaos on an nginx application already deployed in the nginx namespace. It is a mandatory requirement to ensure that the chaos custom resources (chaosexperiment and chaosengine) and the experiment specific serviceaccount are created in the same namespace (typically, the same as the namespace of the Application Under Test (AUT), in this case nginx). This is done to ensure that the developers/users of the experiment isolate the chaos to their respective work-namespaces in shared environments.
In all subsequent steps, please follow these instructions by replacing the nginx namespace and labels with that of your application.
Creating the nginx Pod for testing
Deplying nginx process:
$ kubectl create namespace nginx
namespace/nginx created
$ kubectl create deployment nginx --image=nginx -n nginx
deployment.apps/nginx created
$ kubectl create service nodeport nginx --tcp=80:80 -n nginx
service/nginx created
$ kubectl get svc -n nginx
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nginx NodePort 10.x.x.x <none> 80:31281/TCP 0s
The nginx default web site should be available now.
Install Litmus
Followed the steps in the Getting Started Guide* to install litmus in a nginx
namespace with an nginx application.
Download litmus-operator-v1.13.3.yaml
from https://litmuschaos.github.io/litmus/litmus-operator-v1.13.3.yaml.
Modify it to use the nginx
namespace in three places (at lines 10, 41, and 47 approximately).
Install the litmus-operator in nginx
application namespace using kubectl.
$ kubectl apply -f litmus-operator.yaml -n nginx
Expected output:
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
namespace/nginx configured
serviceaccount/litmus created
clusterrole.rbac.authorization.k8s.io/litmus created
clusterrolebinding.rbac.authorization.k8s.io/litmus created
deployment.apps/chaos-operator-ce created
customresourcedefinition.apiextensions.k8s.io/chaosengines.litmuschaos.io created
customresourcedefinition.apiextensions.k8s.io/chaosexperiments.litmuschaos.io created
customresourcedefinition.apiextensions.k8s.io/chaosresults.litmuschaos.io created
The above command install all the CRDs, required service account configuration, and chaos-operator.
Verify your installation Before you start running a chaos experiment, verify if Litmus is installed correctly.
- Verify if the ChaosOperator is running
$ kubectl get pods -n nginx
Expected output:
NAME READY STATUS RESTARTS AGE
chaos-operator-ce-76d5d844f7-ptr2m 1/1 Running 0 21s
nginx-86c57db685-6lqbw 1/1 Running 0 58s
- Verify if chaos CRDs are installed
$ kubectl get crds | grep chaos
Expected output:
chaosengines.litmuschaos.io 2020-06-24T17:37:01Z
chaosexperiments.litmuschaos.io 2020-06-24T17:37:01Z
chaosresults.litmuschaos.io 2020-06-24T17:37:01Z
Verify if the chaos API resources are successfully created in the desired (application) namespace.
Note: Sometimes, it may take few seconds for the resources to be available post the CRD installation
$ kubectl api-resources | grep chaos
Expected output:
chaosengines litmuschaos.io true ChaosEngine
chaosexperiments litmuschaos.io true ChaosExperiment
chaosresults litmuschaos.io true ChaosResult
Install Chaos Experiments
Chaos experiments contain the actual chaos details. These experiments are installed on your cluster as Kubernetes Custom Resources (CRs). The Chaos Experiments are grouped as Chaos Charts and are published on Chaos Hub.
The generic chaos experiments such as pod-delete
, container-kill
,pod-network-latency
are available under Generic Chaos Chart.
This is the first chart you are recommended to install.
$ kubectl apply -f https://hub.litmuschaos.io/api/chaos/master?file=charts/generic/experiments.yaml -n nginx
Expected output:
chaosexperiment.litmuschaos.io/node-drain created
chaosexperiment.litmuschaos.io/disk-fill created
chaosexperiment.litmuschaos.io/pod-cpu-hog created
chaosexperiment.litmuschaos.io/pod-memory-hog created
chaosexperiment.litmuschaos.io/pod-network-corruption created
chaosexperiment.litmuschaos.io/pod-delete created
chaosexperiment.litmuschaos.io/pod-network-loss created
chaosexperiment.litmuschaos.io/disk-loss created
chaosexperiment.litmuschaos.io/pod-network-latency created
chaosexperiment.litmuschaos.io/node-cpu-hog created
chaosexperiment.litmuschaos.io/kubelet-service-kill created
chaosexperiment.litmuschaos.io/node-memory-hog created
chaosexperiment.litmuschaos.io/container-kill created
Verify if the chaos experiments are installed.
$ kubectl get chaosexperiments -n nginx
Expected output:
NAME AGE
container-kill 0s
disk-fill 4s
disk-loss 2s
kubelet-service-kill 1s
node-cpu-hog 1s
node-drain 4s
node-memory-hog 0s
pod-cpu-hog 3s
pod-delete 2s
pod-memory-hog 3s
pod-network-corruption 3s
pod-network-latency 1s
pod-network-loss 2s
Setup Service Account
A service account should be created to allow chaosengine to run experiments in your application namespace. Copy the following
into a rbac.yaml
manifest and run kubectl apply -f rbac.yaml
to create one such account on the nginx
namespace. This service account
has just enough permissions needed to run the container-kill chaos experiment.
NOTE:
- For rbac samples corresponding to other experiments such as, say, pod-delete, please refer the respective experiment folder in the chaos-charts repository.
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: container-kill-sa
namespace: nginx
labels:
name: container-kill-sa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: container-kill-sa
namespace: nginx
labels:
name: container-kill-sa
rules:
- apiGroups: [""]
resources: ["pods","events"]
verbs: ["create","list","get","patch","update","delete","deletecollection"]
- apiGroups: [""]
resources: ["pods/exec","pods/log","replicationcontrollers"]
verbs: ["list","get","create"]
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["create","list","get","delete","deletecollection"]
- apiGroups: ["apps"]
resources: ["deployments","statefulsets","daemonsets","replicasets"]
verbs: ["list","get"]
- apiGroups: ["apps.openshift.io"]
resources: ["deploymentconfigs"]
verbs: ["list","get"]
- apiGroups: ["argoproj.io"]
resources: ["rollouts"]
verbs: ["list","get"]
- apiGroups: ["litmuschaos.io"]
resources: ["chaosengines","chaosexperiments","chaosresults"]
verbs: ["create","list","get","patch","update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: container-kill-sa
namespace: nginx
labels:
name: container-kill-sa
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: container-kill-sa
subjects:
- kind: ServiceAccount
name: container-kill-sa
namespace: nginx
Annotate your application
Your application has to be annotated with litmuschaos.io/chaos="true"
. As a security measure, and also as a means
to reduce blast radius the ChaosOperator checks for this annotation before invoking chaos experiment(s) on the application.
Replace nginx
with the name of your deployment.
$ kubectl annotate deploy/nginx litmuschaos.io/chaos="true" -n nginx
Expected output:
deployment.apps/nginx annotated
Prepare ChaosEngine
ChaosEngine connects the application instance to a Chaos Experiment. Copy the following YAML snippet into a file called
chaosengine.yaml
.
NOTE: You may update the values of applabel
, appns
, appkind
and experiments
as per your deployment and choices.
Change the chaosServiceAccount
to the name of service account created in above previous steps if you modified the rbac.yaml
.
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: nginx-chaos
namespace: nginx
spec:
# It can be active/stop
engineState: 'active'
#ex. values: ns1:name=percona,ns2:run=nginx
auxiliaryAppInfo: ''
appinfo:
appns: 'nginx'
applabel: 'app=nginx'
appkind: 'deployment'
chaosServiceAccount: container-kill-sa
# It can be delete/retain
jobCleanUpPolicy: 'delete'
experiments:
- name: container-kill
spec:
components:
env:
# provide the total chaos duration
- name: TOTAL_CHAOS_DURATION
value: '20'
# provide the chaos interval
- name: CHAOS_INTERVAL
value: '10'
# provide the name of container runtime
# for litmus LIB, it supports docker, containerd, crio
# for pumba LIB, it supports docker only
- name: CONTAINER_RUNTIME
value: 'docker'
# provide the socket file path
- name: SOCKET_PATH
value: '/var/run/docker.sock'
- name: PODS_AFFECTED_PERC
value: ''
- name: TARGET_CONTAINER
value: ''
Override Default Chaos Experiments Variables
From LitmusChaos v1.1.0, the default environment variable values in chaosexperiments can be overridden by specifying
them in the chaosengine under experiments.<experiment_name>.spec.components.env
with the desired value. In the
example below, the TARGET_CONTAINER is being set to a desired value based on the application instance.
...
experiments:
- name: container-kill
spec:
components:
env:
- name: TARGET_CONTAINER
value: nginx
Run Chaos
$ kubectl apply -f chaosengine.yaml -n nginx
Expected output:
chaosengine.litmuschaos.io/nginx-chaos created
Observe Chaos results
Describe the ChaosResult CR to know the status of each experiment. The spec.verdict
is set to Awaited
when the experiment is in progress, eventually changing to either Pass
or Fail
.
NOTE: ChaosResult CR name will be <chaos-engine-name>-<chaos-experiment-name>
$ kubectl describe chaosresult nginx-chaos-container-kill -n nginx
Expected output:
Name: nginx-chaos-container-kill
Namespace: nginx
Labels: chaosUID=e0f398da-27ad-43c3-b7be-88a7cde2b257
type=ChaosResult
Annotations: API Version: litmuschaos.io/v1alpha1
Kind: ChaosResult
Metadata:
Creation Timestamp: 2020-07-06T15:54:34Z
Generation: 4
Resource Version: 67650
Self Link: /apis/litmuschaos.io/v1alpha1/namespaces/nginx/chaosresults/nginx-chaos-container-kill
UID: 0c848944-f594-4213-ada0-48efb41a6ba1
Spec:
Engine: nginx-chaos
Experiment: container-kill
Status:
Experimentstatus:
Fail Step: N/A
Phase: Completed
Verdict: Pass
Events: <none>
Uninstallation
You can uninstall Litmus by deleting the namespace.
kubectl delete -f chaosengine.yaml -n nginx
kubectl delete -f rbac.yaml -n nginx
kubectl delete -f https://hub.litmuschaos.io/api/chaos/master?file=charts/generic/experiments.yaml -n nginx
kubectl delete -f litmus-operator.yaml -n nginx
Troubleshooting
Following the steps in the Getting Started Guide to install litmus in a nginx
namespace with an nginx deployment, you may do everything correctly and not get the container-kill
experiment to work. The tips below are to assist you in troubleshooting the issue.
The environment used below is Rancher running on AWS.When I ran the ChaosEngine using the content provide in the Getting Started Guide:
$ kubectl apply -f chaosengine.yaml -n nginx
chaosengine.litmuschaos.io/nginx-chaos created
Check to see the status of the nginx
namespace:
$ kubectl get all -n nginx
NAME READY STATUS RESTARTS AGE
pod/chaos-operator-ce-76d5d844f7-ptr2m 1/1 Running 0 7m56s
pod/container-kill-sfeyq0-r97db 1/1 Running 0 54s
pod/nginx-86c57db685-6lqbw 1/1 Running 0 8m37s
pod/nginx-chaos-runner 1/1 Running 0 63s
pod/pumba-sig-kill-uc7rlt 0/1 ContainerCreating 0 2s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/chaos-operator-metrics ClusterIP 10.43.184.74 <none> 8383/TCP 7m53s
service/nginx NodePort 10.43.201.219 <none> 80:31281/TCP 8m36s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/chaos-operator-ce 1/1 1 1 7m56s
deployment.apps/nginx 1/1 1 1 8m37s
NAME DESIRED CURRENT READY AGE
replicaset.apps/chaos-operator-ce-76d5d844f7 1 1 1 7m56s
replicaset.apps/nginx-86c57db685 1 1 1 8m37s
NAME COMPLETIONS DURATION AGE
job.batch/container-kill-sfeyq0 0/1 54s 54s
Everything looks good at this point. You can also look at the Rancher UI to drill down into the Workloads of nginx
namespace and see the pods.
The first time took a while to run but the Verdict was Fail
.
$ kubectl describe chaosresult nginx-chaos-container-kill -n nginx
Name: nginx-chaos-container-kill
Namespace: nginx
Labels: chaosUID=4920e2e0-e241-482a-8f96-c96fff74638a
type=ChaosResult
Annotations: API Version: litmuschaos.io/v1alpha1
Kind: ChaosResult
Metadata:
Creation Timestamp: 2020-06-24T17:44:43Z
Generation: 1
Resource Version: 82340
Self Link: /apis/litmuschaos.io/v1alpha1/namespaces/nginx/chaosresults/nginx-chaos-container-kill
UID: 8dc90e0d-fa0e-4b55-9a4d-1c7c3319175a
Spec:
Engine: nginx-chaos
Experiment: container-kill
Status:
Experimentstatus:
Phase: Running
Verdict: Fail
Events: <none>
Checking the status of the pods, either with kubectl or Rancher:
$ kubectl get pods -n nginx
NAME READY STATUS RESTARTS AGE
chaos-operator-ce-76d5d844f7-ptr2m 1/1 Running 0 8m46s
container-kill-sfeyq0-r97db 1/1 Running 0 104s
nginx-86c57db685-6lqbw 1/1 Running 0 9m27s
nginx-chaos-runner 1/1 Running 0 113s
pumba-sig-kill-uc7rlt 0/1 Error 0 52s
The Error status of the pumba
pod gives us place to focus our attention as we start troubleshooting. Checking the logs on the pumba
pod:
$ kubectl logs pumba-sig-kill-uc7rlt -n nginx
time="2020-06-24T17:44:58Z" level=error msg="failed to list containers" app=pumba error="Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get http://%2Fvar%2Frun%2Fdocker.sock/containers/json?limit=0: dial unix /var/run/docker.sock: connect: permission denied" function=github.com/sirupsen/logrus.WithError source="container/client.go:90"
time="2020-06-24T17:44:58Z" level=error msg="failed to list containers" app=pumba error="Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get http://%2Fvar%2Frun%2Fdocker.sock/containers/json?limit=0: dial unix /var/run/docker.sock: connect: permission denied" function=github.com/sirupsen/logrus.WithError source="docker/kill.go:86"
time="2020-06-24T17:44:58Z" level=error msg="failed to run chaos command" app=pumba error="Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get http://%2Fvar%2Frun%2Fdocker.sock/containers/json?limit=0: dial unix /var/run/docker.sock: connect: permission denied" function=github.com/sirupsen/logrus.WithError source="chaos/command.go:79"
time="2020-06-24T17:44:58Z" level=fatal msg="Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get http://%2Fvar%2Frun%2Fdocker.sock/containers/json?limit=0: dial unix /var/run/docker.sock: connect: permission denied" app=pumba function=main.main source="cmd/main.go:162"
A search of the internet and the pumba image website doesn't reveal any useful information. Next step is to find the /var/run/docker.sock
and the permissions on it that are causing the pumba
pod to fail. Working under the assumption that /var/run/docker.sock
is on the node where the pumba
pod is running and pumba is attempting to volumeMount the file, we need to connect to the worker node via SSH.
First determine which worker node the pod is runnning on:
kubectl describe pod pumba-sig-kill-uc7rlt -n nginx | grep Node:
Expected output
Node: <node_name>/x.y.z.212
After SSHing into the worker node, check the file permissions:
$ ls -l /var/run/docker.sock
srw-rw----. 1 root docker 0 Jun 30 04:44 /var/run/docker.sock
$ ls -Z /var/run/docker.sock
srw-rw----. root docker system_u:object_r:container_var_run_t:s0 /var/run/docker.sock
First fix is to try and change the regular file permissions to srw-rw-rw
on /var/run/docker.sock
.
To test, we delete the chaosengine
kubectl delete -f chaosengine.yaml -n nginx
To attempt the continer-kill
experiment again, apply chaosengine.yaml
again :
kubectl apply -f chaosengine.yaml -n nginx
The pumba
pod errored out with the same log entries. To set-up for later testing, delete the chaosengine
again.
kubectl delete -f chaosengine.yaml -n nginx
Now we need to get into the Selinux access controls on /var/run/docker.sock
NOTE: The logs for Selinux auditing are in /var/log/audit/audit.log
and require superuser privileges to review. To set Selinux policies also require superuser privileges. The rest of the steps in this section are run as root.
With elevated privileges, check monitored the audit log on both worker nodes for the file we having issues with docker.sock
:
$ grep docker.sock /var/log/audit/audit.log
The output was a number of lines that looked like this:
type=AVC msg=audit(1594053128.110:14301): avc: denied { connectto } for pid=15370 comm="pumba" path="/run/docker.sock" scontext=system_u:system_r:container_t:s0:c152,c861 tcontext=system_u:system_r:container_runtime_t:s0 tclass=unix_stream_socket permissive=0
This tells us that we really do have an Selinux access control issue to correct.
Selinux Troubleshooting process:
Note: The sample Litmus experiment is set-up with the exception of applying chaosengine.yaml.
- We need to log into all worker node the cluster where the
pumba
pod will try to run. This is required since we don't know on which worker node the pod will be created. Elevate privileges with sudo. - Checked the permissions on /var/run/docker.sock and /run/docker.sock:
$ ls -lZ /run/docker.sock /var/run/docker.sock
srw-rw----. root docker system_u:object_r:container_var_run_t:s0 /run/docker.sock
srw-rw----. root docker system_u:object_r:container_var_run_t:s0 /var/run/docker.sock
- Monitored the audit log on all worker nodes for the word
pumba
:
$ tail -f /var/log/audit/audit.log | grep pumba
- Applied
chaosengine.yaml
via kubectl in another terminal window. - Found an
avc: denied
message related todocker.sock
.
type=AVC msg=audit(1594053128.110:14301): avc: denied { connectto } for pid=15370 comm="pumba" path="/run/docker.sock" scontext=system_u:system_r:container_t:s0:c152,c861 tcontext=system_u:system_r:container_runtime_t:s0 tclass=unix_stream_socket permissive=0
Solution:
- Created a SeLinux policy for the error. (Ref: Red Hat SELinux User's and Administrator's Guide)
$ grep pumba /var/log/audit/audit.log | audit2allow -M pumba_pol
To make this policy package active, execute:
semodule -i pumba_pol.pp
- Applied the pumba selinux policy.
$ semodule -i pumba_pol.pp
Testing:
- Deleted
chaosengine.yaml
and then applied again. - This time no
avc: denied
message. - Ran the check on the litmus output and received
Pass
for the Verdict.
kubectl describe chaosresult nginx-chaos-container-kill -n nginx
Name: nginx-chaos-container-kill
Namespace: nginx
Labels: chaosUID=e0f398da-27ad-43c3-b7be-88a7cde2b257
type=ChaosResult
Annotations: API Version: litmuschaos.io/v1alpha1
Kind: ChaosResult
Metadata:
Creation Timestamp: 2020-07-06T15:54:34Z
Generation: 4
Resource Version: 67650
Self Link: /apis/litmuschaos.io/v1alpha1/namespaces/nginx/chaosresults/nginx-chaos-container-kill
UID: 0c848944-f594-4213-ada0-48efb41a6ba1
Spec:
Engine: nginx-chaos
Experiment: container-kill
Status:
Experimentstatus:
Fail Step: N/A
Phase: Completed
Verdict: Pass
Events: <none>
NOTE: The pumba_pol.pp
file is a binary policy file. The policy generation process also creates a type enforcement file (.te) that is a text representation of the policy. The contents of pumba_pol.te
:
module pumba_pol 1.0;
require {
type default_t;
type container_runtime_t;
type container_t;
class sock_file write;
class unix_stream_socket connectto;
}
#============= container_t ==============
#!!!! The file '/run/docker.sock' is mislabeled on your system.
#!!!! Fix with $ restorecon -R -v /run/docker.sock
allow container_t container_runtime_t:unix_stream_socket connectto;
#!!!! WARNING: 'default_t' is a base type.
allow container_t default_t:sock_file write;
More Chaos Experiments
- For more details on supported chaos experiments and the steps to run them, refer the Experiments section.
Join our community
If you have not joined our community, do join us here.