Kafka Broker Pod Failure Experiment Details
Experiment Metadata
Type | Description | Kafka Distribution | Tested K8s Platform |
---|---|---|---|
Kafka | Fail kafka leader-broker pods | Confluent, Kudo-Kafka | AWS Konvoy, GKE, EKS |
Prerequisites
Ensure that the Litmus Chaos Operator is running by executing
kubectl get pods
in operator namespace (typically,litmus
). If not, install from hereEnsure that the
kafka-broker-pod-failure
experiment resource is available in the cluster by executingkubectl get chaosexperiments
in the desired namespace. If not, install from hereEnsure that Kafka & Zookeeper are deployed as Statefulsets
If Confluent/Kudo Operators have been used to deploy Kafka, note the instance name, which will be used as the value of
KAFKA_INSTANCE_NAME
experiment environment variable- In case of Confluent, specified by the
--name
flag - In case of Kudo, specified by the
--instance
flag
Zookeeper uses this to construct a path in which kafka cluster data is stored.
- In case of Confluent, specified by the
Ensure that the kafka-broker-disk failure experiment resource is available in the cluster. If not, install from here
Entry Criteria
- Kafka Cluster (comprising the Kafka-broker & Zookeeper Statefulsets) is healthy
Exit Criteria
- Kafka Cluster (comprising the Kafka-broker & Zookeeper Statefulsets) is healthy
- Kafka Message stream (if enabled) is unbroken
Details
- Causes (forced/graceful) pod failure of specific/random Kafka broker pods
- Tests deployment sanity (replica availability & uninterrupted service) and recovery workflows of the Kafka cluster
- Tests unbroken message stream when
KAFKA_LIVENESS_STREAM
experiment environment variable is set toenabled
Integrations
- Pod failures can be effected using one of these chaos libraries:
litmus
,powerfulseal
- The desired chaos library can be selected by setting one of the above options as value for the environment variable
LIB
Steps to Execute the Chaos Experiment
This Chaos Experiment can be triggered by creating a ChaosEngine resource on the cluster. To understand the values to provide in a ChaosEngine specification, refer Getting Started
Follow the steps in the sections below to create the chaosServiceAccount, prepare the ChaosEngine & execute the experiment.
Prepare chaosServiceAccount
- Use this sample RBAC manifest to create a chaosServiceAccount in the desired (app) namespace. This example consists of the minimum necessary role permissions to execute the experiment.
Sample Rbac Manifest
apiVersion: v1
kind: ServiceAccount
metadata:
name: kafka-broker-pod-failure-sa
namespace: default
labels:
name: kafka-broker-pod-failure-sa
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: kafka-broker-pod-failure-sa
labels:
name: kafka-broker-pod-failure-sa
rules:
- apiGroups: ["","litmuschaos.io","batch","apps"]
resources: ["pods","deployments","pods/log","events","jobs","pods/exec","statefulsets","configmaps","chaosengines","chaosexperiments","chaosresults"]
verbs: ["create","list","get","patch","delete"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get","list"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: kafka-broker-pod-failure-sa
labels:
name: kafka-broker-pod-failure-sa
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kafka-broker-pod-failure-sa
subjects:
- kind: ServiceAccount
name: kafka-broker-pod-failure-sa
namespace: default
Prepare ChaosEngine
- Provide the application info in
spec.appinfo
- Provide the experiment tunables. While many tunables have default values specified in the ChaosExperiment CR, some need to be explicitly supplied in
experiments.spec.components.env
. - To understand the values to provided in a ChaosEngine specification, refer ChaosEngine Concepts
Supported Experiment Tunables
Parameter | Description | Specify In ChaosEngine | Notes |
---|---|---|---|
KAFKA_NAMESPACE | Namespace of Kafka Brokers | Mandatory | May be same as value for `spec.appinfo.appns` |
KAFKA_LABEL | Unique label of Kafka Brokers | Mandatory | May be same as value for `spec.appinfo.applabel` |
KAFKA_SERVICE | Headless service of the Kafka Statefulset | Mandatory | |
KAFKA_PORT | Port of the Kafka ClusterIP service | Mandatory | |
ZOOKEEPER_NAMESPACE | Namespace of Zookeeper Cluster | Mandatory | May be same as value for KAFKA_NAMESPACE or other |
ZOOKEEPER_LABEL | Unique label of Zokeeper statefulset | Mandatory | |
ZOOKEEPER_SERVICE | Headless service of the Zookeeper Statefulset | Mandatory | |
ZOOKEEPER_PORT | Port of the Zookeeper ClusterIP service | Mandatory | |
KAFKA_BROKER | Kafka broker pod (name) to be deleted | Optional | A target selection mode (random/liveness-based/specific) |
KAFKA_KIND | Kafka deployment type | Optional | Same as `spec.appinfo.appkind`. Supported: `statefulset` |
KAFKA_LIVENESS_STREAM | Kafka liveness message stream | Optional | Supported: `enabled`, `disabled` |
KAFKA_LIVENESS_IMAGE | Image used for liveness message stream | Optional | Image as ` |
KAFKA_REPLICATION_FACTOR | Number of partition replicas for liveness topic partition | Optional | Necessary if KAFKA_LIVENESS_STREAM is `enabled` |
KAFKA_INSTANCE_NAME | Name of the Kafka chroot path on zookeeper | Optional | Necessary if installation involves use of such path |
KAFKA_CONSUMER_TIMEOUT | Kafka consumer message timeout, post which it terminates | Optional | Defaults to 30000ms, Recommended timeout for EKS platform: 60000 ms |
TOTAL_CHAOS_DURATION | The time duration for chaos insertion (seconds) | Optional | Defaults to 15s |
CHAOS_INTERVAL | Time interval b/w two successive broker failures (sec) | Optional | Defaults to 5s |
KILL_COUNT | No. of application pods to be deleted | Optional | Default to `1`, kill_count > 1 is only supported by litmus lib , not by the powerfulseal |
LIB | The chaos lib used to inject the chaos | Optional | Defaults to `litmus`. Supported: `litmus`, `powerfulseal |
INSTANCE_ID | A user-defined string that holds metadata/info about current run/instance of chaos. Ex: 04-05-2020-9-00. This string is appended as suffix in the chaosresult CR name. | Optional | Ensure that the overall length of the chaosresult CR is still < 64 characters |
Sample ChaosEngine Manifest
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: kafka-chaos
namespace: default
spec:
# It can be true/false
annotationCheck: 'true'
# It can be active/stop
engineState: 'active'
#ex. values: ns1:name=percona,ns2:run=nginx
auxiliaryAppInfo: ''
appinfo:
appns: 'default'
applabel: 'app=cp-kafka'
appkind: 'statefulset'
chaosServiceAccount: kafka-broker-pod-failure-sa
monitoring: false
# It can be delete/retain
jobCleanUpPolicy: 'delete'
experiments:
- name: kafka-broker-pod-failure
spec:
components:
env:
# choose based on available kafka broker replicas
- name: KAFKA_REPLICATION_FACTOR
value: '3'
# get via 'kubectl get pods --show-labels -n <kafka-namespace>'
- name: KAFKA_LABEL
value: 'app=cp-kafka'
- name: KAFKA_NAMESPACE
value: 'default'
# get via 'kubectl get svc -n <kafka-namespace>'
- name: KAFKA_SERVICE
value: 'kafka-cp-kafka-headless'
# get via 'kubectl get svc -n <kafka-namespace>'
- name: KAFKA_PORT
value: '9092'
# Recommended timeout for EKS platform: 60000 ms
- name: KAFKA_CONSUMER_TIMEOUT
value: '30000' # in milliseconds
# ensure to set the instance name if using KUDO operator
- name: KAFKA_INSTANCE_NAME
value: ''
- name: ZOOKEEPER_NAMESPACE
value: 'default'
# get via 'kubectl get pods --show-labels -n <zk-namespace>'
- name: ZOOKEEPER_LABEL
value: 'app=cp-zookeeper'
# get via 'kubectl get svc -n <zk-namespace>
- name: ZOOKEEPER_SERVICE
value: 'kafka-cp-zookeeper-headless'
# get via 'kubectl get svc -n <zk-namespace>
- name: ZOOKEEPER_PORT
value: '2181'
# set chaos duration (in sec) as desired
- name: TOTAL_CHAOS_DURATION
value: '60'
# set chaos interval (in sec) as desired
- name: CHAOS_INTERVAL
value: '20'
# pod failures without '--force' & default terminationGracePeriodSeconds
- name: FORCE
value: 'false'
Create the ChaosEngine Resource
Create the ChaosEngine manifest prepared in the previous step to trigger the Chaos.
kubectl apply -f chaosengine.yml
If the chaos experiment is not executed, refer to the troubleshooting section to identify the root cause and fix the issues.
Watch Chaos progress
View pod terminations & recovery by setting up a watch on the pods in the Kafka namespace
watch -n 1 kubectl get pods -n <kafka-namespace>
Abort/Restart the Chaos Experiment
To stop the pod-delete experiment immediately, either delete the ChaosEngine resource or execute the following command:
kubectl patch chaosengine <chaosengine-name> -n <namespace> --type merge --patch '{"spec":{"engineState":"stop"}}'
To restart the experiment, either re-apply the ChaosEngine YAML or execute the following command:
kubectl patch chaosengine <chaosengine-name> -n <namespace> --type merge --patch '{"spec":{"engineState":"active"}}'
Check Chaos Experiment Result
Check whether the kafka deployment is resilient to the broker pod failure, once the experiment (job) is completed. The ChaosResult resource name is derived like this:
<ChaosEngine-Name>-<ChaosExperiment-Name>
.kubectl describe chaosresult kafka-chaos-kafka-broker-pod-failure -n <kafka-namespace>
Kafka Broker Pod Failure Demo
- TODO: A sample recording of this experiment execution is provided here.