
在军工AI算力集群中,通过Kubernetes的优先级调度保障关键任务资源优先,结合KMS加密持久化卷(RWO+复制)、SELinux/AppArmor安全沙箱及容灾设计(多副本+健康检查+HPA负载均衡),构建容器化部署方案,实现高可靠性并满足军工场景的严格安全要求。
| 调度策略 | 定义 | 特性 | 使用场景 |
|---|---|---|---|
| 默认调度 | 均衡分配资源 | 无优先级 | 普通非关键任务(如测试任务) |
| 优先级调度 | 根据任务权重分配资源 | 高优先级任务优先 | 核心军工AI训练/推理任务(如核心算法) |
| 节点亲和/反亲和 | 节点选择规则 | 亲和(固定节点)、反亲和(避免同节点) | 依赖特定硬件或避免资源竞争(如多容器部署) |
| 机制 | 定义 | 类比 | 注意点 |
|---|---|---|---|
| cgroup | 限制CPU/内存等资源上限 | 给容器设定“资源预算”,超出则被限制 | 需合理设置请求/限制,避免资源浪费或不足 |
| 命名空间 | 隔离网络/存储/进程 | 每个容器有自己的“独立房间”,不共享资源 | 结合cgroup实现强隔离,防止资源泄露或攻击 |
# KMS加密持久化卷声明(RWO+复制,2副本)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ai-data-pvc
spec:
accessModes:
- ReadWriteOnce
storageClassName: "ai-storage-class"
resources:
requests:
storage: 10Gi
volumeMode: ReadWriteOnce
volumeAttributes:
replication: "2" # RWO+复制,2副本
encryption: "kms" # KMS加密标识
# 部署(3副本+健康检查+节点选择器+安全上下文)
apiVersion: apps/v1
kind: Deployment
metadata:
name: ai-calc-deployment
spec:
replicas: 3
selector:
matchLabels:
app: ai-calc
template:
metadata:
labels:
app: ai-calc
spec:
containers:
- name: ai-calc
image: ai-calc:1.0
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
volumeMounts:
- name: ai-data
mountPath: /data
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
securityContext:
runAsUser: 1000
fsGroup: 2000
seLinuxOptions:
level: "level_u_r_b"
imagePullPolicy: IfNotPresent
volumes:
- name: ai-data
persistentVolumeClaim:
claimName: ai-data-pvc
volumeAttributes:
storageClassName: "ai-storage-class"
replication: "2"
encryption: "kms"
nodeSelector:
node-role.kubernetes.io/critical: ""
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/critical
operator: In
values:
- ""
priorityClassName: "high-priority"
serviceAccountName: ai-calc-sa
# RBAC(访问控制)
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: ai-calc-role
rules:
- apiGroups: [""]
resources: ["pods", "services"]
verbs: ["get", "list", "watch"]
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: ai-calc-role-binding
subjects:
- kind: ServiceAccount
name: ai-calc-sa
namespace: default
roleRef:
kind: Role
name: ai-calc-role
apiGroup: rbac.authorization.k8s.io
# 网络策略(安全隔离)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: ai-calc-network-policy
spec:
podSelector:
matchLabels:
app: ai-calc
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: "trusted-ns"
ports:
- protocol: TCP
port: 8080
egress:
- to:
- namespaceSelector:
matchLabels:
name: "trusted-ns"
ports:
- protocol: TCP
port: 8080
# SELinux/AppArmor配置(假设通过Kubernetes安全上下文)
# 示例:在节点上配置SELinux策略文件(如/etc/selinux/config)
# 添加:
# SELINUX=enforcing
# SELINUXTYPE=military
# 示例:AppArmor规则文件(如/etc/apparmor.d/ai-calc)
# 添加:
# profile ai-calc {
# include /etc/apparmor.d/tunables/common
# # 限制容器进程只能访问指定目录
# allow /data r,
# allow /data w,
# deny / r,
# deny / w,
# deny /dev r,
# deny /dev w,
# }
说明:
replication: "2"和encryption: "kms"实现RWO+复制+加密,确保数据多副本存储且加密。nodeSelector避免部署到关键节点,affinity反亲和避免同节点部署。“面试官您好,针对军工AI算力集群的高可靠性容器化部署,我的设计思路是:首先,针对军工数据安全,我们采用KMS加密持久化卷(通过存储类绑定加密后端),结合SELinux/AppArmor安全沙箱,确保数据在存储和传输中加密,并限制容器权限,防止恶意代码执行。调度策略上,为关键AI任务设置高优先级(权重1000),并使用节点反亲和,避免容器部署到同一节点,保障资源隔离。资源隔离方面,cgroup限制容器CPU/内存(如2核4GB),命名空间隔离网络存储,防止容器间攻击。容灾设计采用3副本+10秒健康检查,当节点故障时,HPA动态调整副本数,确保负载均衡,故障转移时间(RTO)控制在秒级,数据丢失(RPO)为0。这样能全面保障高可靠性,满足军工场景的安全与可用性要求。”
encryption: "kms")和存储后端日志,检查数据在存储介质和传输中的加密状态。