需求:实时监测线上K8s集群xx命名空间下的异常事件—OCI创建失败,将巡检日志持久化,并输出报警
工具链:helm+k8s+cronjob+shell+飞书机器人告警
其中,shell使用configmap来挂载,和cronjob一样创建模板,接受helm的纳管
Helm部分
Helm结构采用的是父子Chart的结构,为了方便理解,我只展示相关的部分
这个是自己写的,没根据helm的标准化来,helm的最佳实践如果要做这一个功能,什么_helper.tql导出都要加相关的字段,一个cronjob而已没有必要,以后又不会迁移它
template
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
| {{- if and .Values.k8scheck .Values.k8scheck.enabled }} apiVersion: batch/v1 kind: CronJob metadata: name: {{ .Values.k8scheck.name | default "k8s-cluster-check" }} spec: schedule: "{{ .Values.k8scheck.schedule }}" concurrencyPolicy: {{ .Values.k8scheck.concurrencyPolicy | default "Forbid" }} startingDeadlineSeconds: {{ .Values.k8scheck.startingDeadlineSeconds | default 60 }} successfulJobsHistoryLimit: {{ .Values.k8scheck.successfulJobsHistoryLimit | default 3 }} failedJobsHistoryLimit: {{ .Values.k8scheck.failedJobsHistoryLimit | default 5 }} jobTemplate: spec: template: spec: imagePullSecrets: {{- range .Values.global.common.imagePullSecrets }} - name: {{ .name }} {{- end }} containers: - name: {{ .Values.k8scheck.name }} image: {{ .Values.k8scheck.image.repository }}:{{ .Values.k8scheck.image.tag }} imagePullPolicy: {{ .Values.k8scheck.image.pullPolicy | default "IfNotPresent" }} {{- with .Values.k8scheck.env }} env: {{- toYaml . | nindent 14 }} {{- end }} command: - "/bin/sh" - "-c" - | {{- .Values.k8scheck.script | nindent 16 }} {{- with .Values.k8scheck.volumeMounts }} volumeMounts: {{- toYaml . | nindent 14 }} {{- end }} securityContext: runAsUser: 0 privileged: false {{- with .Values.k8scheck.volumes }} volumes: {{- toYaml . | nindent 12 }} {{- end }} {{- with .Values.k8scheck.nodeSelector }} nodeSelector: {{- toYaml . | nindent 12 }} {{- end }} restartPolicy: {{ .Values.k8scheck.restartPolicy | default "OnFailure" }} {{- end }}
|
values.yaml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
| k8scheck: enabled: true name: k8s-cluster-check schedule: "*/5 * * * *" concurrencyPolicy: Forbid startingDeadlineSeconds: 60 successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 0 image: repository: ad4-qa-registry-vpc.cn-hangzhou.cr.aliyuncs.com/ad4-qa/kubectl tag: "1.24.0" pullPolicy: IfNotPresent script: | LOG_FILE="/check-logs/check-$(date +%F).log" cp /scripts/check-script.sh /tmp/check-script.sh chmod +x /tmp/check-script.sh /tmp/check-script.sh --kubeconfig /kube/config --all -n ${NAMESPACE} >> "$LOG_FILE" 2>&1 cat "$LOG_FILE" restartPolicy: OnFailure env: - name: MEM_LIMIT value: "60" - name: CPU_LIMIT value: "60" - name: NAMESPACE value: rg-biz nodeSelector: kubernetes.io/hostname: cn-hangzhou.xxx volumeMounts: - name: kube-config mountPath: /kube - name: zoneinfo mountPath: /usr/share/zoneinfo/Asia/Shanghai readOnly: true - name: script-volume mountPath: /scripts - name: logs mountPath: /check-logs volumes: - name: kube-config configMap: name: kube-config - name: zoneinfo hostPath: path: /usr/share/zoneinfo/Asia/Shanghai - name: script-volume configMap: name: k8s-cluster-check-script defaultMode: 0744 - name: logs hostPath: path: /k8s-check-log/ type: DirectoryOrCreate
|
脚本部分
也是helm纳管的
这脚本比较长,几百行,我直接放OSS里了:
https://ws-blog-img.oss-cn-hangzhou.aliyuncs.com/wangsheng/k8s-check-configmap.yaml
注意点:
1.我的cronjob使用的是5分钟一次,与脚本检测的频率一致
脚本中是通过awk抓m前的数字小于等于5的情况来获取异常events
2.飞书机器人的请求体,使用jq进行json字段的生成
3.通过节点选择器kubernetes.io/hostname: cn-hangzhou.xxx,将pod固定在一个节点上
并使用root运行,使其具有对节点的读写权限,来做日志的持久化