Akemi

使用cronjob自动化巡检k8s告警

2025/03/22

需求:实时监测线上K8s集群xx命名空间下的异常事件—OCI创建失败,将巡检日志持久化,并输出报警

工具链:helm+k8s+cronjob+shell+飞书机器人告警

其中,shell使用configmap来挂载,和cronjob一样创建模板,接受helm的纳管

Helm部分

Helm结构采用的是父子Chart的结构,为了方便理解,我只展示相关的部分

这个是自己写的,没根据helm的标准化来,helm的最佳实践如果要做这一个功能,什么_helper.tql导出都要加相关的字段,一个cronjob而已没有必要,以后又不会迁移它

template

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
{{- if and .Values.k8scheck .Values.k8scheck.enabled }}
apiVersion: batch/v1
kind: CronJob
metadata:
name: {{ .Values.k8scheck.name | default "k8s-cluster-check" }}
spec:
schedule: "{{ .Values.k8scheck.schedule }}"
concurrencyPolicy: {{ .Values.k8scheck.concurrencyPolicy | default "Forbid" }}
startingDeadlineSeconds: {{ .Values.k8scheck.startingDeadlineSeconds | default 60 }}
successfulJobsHistoryLimit: {{ .Values.k8scheck.successfulJobsHistoryLimit | default 3 }}
failedJobsHistoryLimit: {{ .Values.k8scheck.failedJobsHistoryLimit | default 5 }}
jobTemplate:
spec:
template:
spec:
imagePullSecrets:
{{- range .Values.global.common.imagePullSecrets }}
- name: {{ .name }}
{{- end }}
containers:
- name: {{ .Values.k8scheck.name }}
image: {{ .Values.k8scheck.image.repository }}:{{ .Values.k8scheck.image.tag }}
imagePullPolicy: {{ .Values.k8scheck.image.pullPolicy | default "IfNotPresent" }}
{{- with .Values.k8scheck.env }}
env:
{{- toYaml . | nindent 14 }}
{{- end }}
command:
- "/bin/sh"
- "-c"
- |
{{- .Values.k8scheck.script | nindent 16 }}
{{- with .Values.k8scheck.volumeMounts }}
volumeMounts:
{{- toYaml . | nindent 14 }}
{{- end }}
securityContext:
runAsUser: 0 # 以root用户运行
privileged: false
{{- with .Values.k8scheck.volumes }}
volumes:
{{- toYaml . | nindent 12 }}
{{- end }}
{{- with .Values.k8scheck.nodeSelector }}
nodeSelector:
{{- toYaml . | nindent 12 }}
{{- end }}
restartPolicy: {{ .Values.k8scheck.restartPolicy | default "OnFailure" }}
{{- end }}

values.yaml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
k8scheck:
enabled: true
name: k8s-cluster-check
schedule: "*/5 * * * *"
concurrencyPolicy: Forbid # 禁止并发执行
startingDeadlineSeconds: 60
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 0
image:
repository: ad4-qa-registry-vpc.cn-hangzhou.cr.aliyuncs.com/ad4-qa/kubectl
tag: "1.24.0"
pullPolicy: IfNotPresent
script: |
LOG_FILE="/check-logs/check-$(date +%F).log"
cp /scripts/check-script.sh /tmp/check-script.sh
chmod +x /tmp/check-script.sh
/tmp/check-script.sh --kubeconfig /kube/config --all -n ${NAMESPACE} >> "$LOG_FILE" 2>&1
cat "$LOG_FILE"
restartPolicy: OnFailure
env:
- name: MEM_LIMIT
value: "60"
- name: CPU_LIMIT
value: "60"
- name: NAMESPACE
value: rg-biz
nodeSelector:
kubernetes.io/hostname: cn-hangzhou.xxx
volumeMounts:
- name: kube-config
mountPath: /kube
- name: zoneinfo
mountPath: /usr/share/zoneinfo/Asia/Shanghai
readOnly: true
- name: script-volume
mountPath: /scripts
- name: logs
mountPath: /check-logs
volumes:
- name: kube-config
configMap:
name: kube-config
- name: zoneinfo
hostPath:
path: /usr/share/zoneinfo/Asia/Shanghai
- name: script-volume
configMap:
name: k8s-cluster-check-script
defaultMode: 0744
- name: logs
hostPath:
path: /k8s-check-log/
type: DirectoryOrCreate

脚本部分

也是helm纳管的

这脚本比较长,几百行,我直接放OSS里了:

https://ws-blog-img.oss-cn-hangzhou.aliyuncs.com/wangsheng/k8s-check-configmap.yaml

注意点:
1.我的cronjob使用的是5分钟一次,与脚本检测的频率一致
脚本中是通过awk抓m前的数字小于等于5的情况来获取异常events

2.飞书机器人的请求体,使用jq进行json字段的生成

3.通过节点选择器kubernetes.io/hostname: cn-hangzhou.xxx,将pod固定在一个节点上
并使用root运行,使其具有对节点的读写权限,来做日志的持久化

CATALOG
  1. 1. Helm部分
  2. 2. 脚本部分