首次提交:初始化项目

This commit is contained in:
fei
2026-02-05 00:11:05 +08:00
commit 26eaf8110b
171 changed files with 17105 additions and 0 deletions

View File

@@ -0,0 +1,429 @@
# PostgreSQL 16 K3s 部署指南
本目录包含在 K3s 集群中部署 PostgreSQL 16 数据库的完整配置文件。
## 📋 目录结构
```
001-pg16/
├── README.md # 本文件 - 部署说明
└── k8s/ # K8s 配置文件目录
├── namespace.yaml # infrastructure 命名空间
├── secret.yaml # 数据库密码
├── configmap.yaml # 初始化脚本
├── pvc.yaml # 持久化存储卷声明
├── deployment.yaml # PostgreSQL 部署配置
├── service.yaml # 服务配置
└── README.md # K8s 配置详细说明
```
## 🚀 快速部署
### 前置条件
1. **已安装 K3s**
```bash
# 检查 K3s 是否运行
sudo systemctl status k3s
# 检查节点状态
sudo kubectl get nodes
```
2. **配置 kubectl 权限**(可选,避免每次使用 sudo
```bash
# 方法1复制配置到用户目录推荐
mkdir -p ~/.kube
sudo cp /etc/rancher/k3s/k3s.yaml ~/.kube/config
sudo chown $USER:$USER ~/.kube/config
chmod 600 ~/.kube/config
# 验证配置
kubectl get nodes
```
### 一键部署
```bash
# 进入配置目录
cd /path/to/001-pg16/k8s
# 部署所有资源
kubectl apply -f .
# 或者使用 sudo如果未配置 kubectl 权限)
sudo kubectl apply -f .
```
### 查看部署状态
```bash
# 查看 Pod 状态
kubectl get pods -n infrastructure
# 查看 Pod 详细信息
kubectl describe pod -n infrastructure -l app=pg16
# 查看初始化日志(实时)
kubectl logs -n infrastructure -l app=pg16 -f
# 查看服务状态
kubectl get svc -n infrastructure
# 查看 PVC 状态
kubectl get pvc -n infrastructure
```
## ✅ 验证部署
### 1. 检查 Pod 是否运行
```bash
kubectl get pods -n infrastructure
```
期望输出:
```
NAME READY STATUS RESTARTS AGE
pg16-xxxxxxxxx-xxxxx 1/1 Running 0 2m
```
### 2. 验证数据库创建
```bash
# 统计数据库总数(应该是 303 个)
kubectl exec -n infrastructure -l app=pg16 -- psql -U postgres -c "SELECT count(*) FROM pg_database;"
# 查看前 10 个数据库
kubectl exec -n infrastructure -l app=pg16 -- psql -U postgres -c "SELECT datname FROM pg_database WHERE datname LIKE 'pg0%' ORDER BY datname LIMIT 10;"
# 查看最后 10 个数据库
kubectl exec -n infrastructure -l app=pg16 -- psql -U postgres -c "SELECT datname FROM pg_database WHERE datname LIKE 'pg2%' ORDER BY datname DESC LIMIT 10;"
```
期望结果:
- 总数据库数303 个300 个业务数据库 + postgres + template0 + template1
- 数据库命名pg001, pg002, ..., pg300
- 数据库所有者fei
### 3. 测试数据库连接
```bash
# 方法1直接在 Pod 内执行 SQL
kubectl exec -n infrastructure -l app=pg16 -- psql -U fei -d pg001 -c "SELECT current_database(), version();"
# 方法2进入 Pod 交互式操作
kubectl exec -it -n infrastructure -l app=pg16 -- bash
# 在 Pod 内执行
psql -U fei -d pg001
# 退出
\q
exit
```
## 🔌 连接数据库
### 集群内部连接
从集群内其他 Pod 连接:
```
主机: pg16.infrastructure.svc.cluster.local
端口: 5432
用户: fei
密码: feiks..
数据库: pg001 ~ pg300
```
连接字符串示例:
```
postgresql://fei:feiks..@pg16.infrastructure.svc.cluster.local:5432/pg001
```
### 集群外部连接
#### 方法1使用 NodePort推荐
```bash
# 获取节点 IP
kubectl get nodes -o wide
# 使用 NodePort 连接
psql -h <节点IP> -U fei -d pg001 -p 30432
```
连接信息:
- 主机:节点 IP 地址
- 端口30432
- 用户fei
- 密码feiks..
#### 方法2使用 Port Forward
```bash
# 转发端口到本地
kubectl port-forward -n infrastructure svc/pg16 5432:5432
# 在另一个终端连接
psql -h localhost -U fei -d pg001 -p 5432
```
## 📊 数据库信息
### 默认配置
- **PostgreSQL 版本**: 16
- **命名空间**: infrastructure
- **数据库数量**: 300 个pg001 ~ pg300
- **超级用户**: fei密码feiks..
- **系统用户**: postgres密码adminks..
- **持久化存储**: 20Gi使用 K3s 默认 local-path StorageClass
### 资源配置
- **CPU 请求**: 500m
- **CPU 限制**: 2000m
- **内存请求**: 512Mi
- **内存限制**: 2Gi
### 服务端口
- **ClusterIP 服务**: pg16端口 5432
- **NodePort 服务**: pg16-nodeport端口 30432
## 🔧 常用操作
### 查看日志
```bash
# 查看最近 50 行日志
kubectl logs -n infrastructure -l app=pg16 --tail=50
# 实时查看日志
kubectl logs -n infrastructure -l app=pg16 -f
# 查看上一次容器的日志(如果 Pod 重启过)
kubectl logs -n infrastructure -l app=pg16 --previous
```
### 进入容器
```bash
# 进入 PostgreSQL 容器
kubectl exec -it -n infrastructure -l app=pg16 -- bash
# 直接进入 psql
kubectl exec -it -n infrastructure -l app=pg16 -- psql -U postgres
```
### 重启 Pod
```bash
# 删除 PodDeployment 会自动重建)
kubectl delete pod -n infrastructure -l app=pg16
# 或者重启 Deployment
kubectl rollout restart deployment pg16 -n infrastructure
```
### 扩缩容(不推荐用于数据库)
```bash
# 查看当前副本数
kubectl get deployment pg16 -n infrastructure
# 注意PostgreSQL 不支持多副本,保持 replicas=1
```
## 🗑️ 卸载
### 删除部署(保留数据)
```bash
# 删除 Deployment 和 Service
kubectl delete deployment pg16 -n infrastructure
kubectl delete svc pg16 pg16-nodeport -n infrastructure
# PVC 和数据会保留
```
### 完全卸载(包括数据)
```bash
# 删除所有资源
kubectl delete -f k8s/
# 或者逐个删除
kubectl delete deployment pg16 -n infrastructure
kubectl delete svc pg16 pg16-nodeport -n infrastructure
kubectl delete pvc pg16-data -n infrastructure
kubectl delete configmap pg16-init-script -n infrastructure
kubectl delete secret pg16-secret -n infrastructure
kubectl delete namespace infrastructure
```
**⚠️ 警告**: 删除 PVC 会永久删除所有数据库数据,无法恢复!
## 🔐 安全建议
### 修改默认密码
部署后建议立即修改默认密码:
```bash
# 进入 Pod
kubectl exec -it -n infrastructure -l app=pg16 -- psql -U postgres
# 修改 fei 用户密码
ALTER USER fei WITH PASSWORD '新密码';
# 修改 postgres 用户密码
ALTER USER postgres WITH PASSWORD '新密码';
# 退出
\q
```
然后更新 Secret
```bash
# 编辑 secret.yaml修改密码需要 base64 编码)
echo -n "新密码" | base64
# 更新 Secret
kubectl apply -f k8s/secret.yaml
```
### 网络安全
- 默认配置使用 NodePort 30432 暴露服务
- 生产环境建议:
- 使用防火墙限制访问 IP
- 或者删除 NodePort 服务,仅使用集群内部访问
- 配置 NetworkPolicy 限制访问
```bash
# 删除 NodePort 服务(仅保留集群内访问)
kubectl delete svc pg16-nodeport -n infrastructure
```
## 🐛 故障排查
### Pod 无法启动
```bash
# 查看 Pod 状态
kubectl describe pod -n infrastructure -l app=pg16
# 查看事件
kubectl get events -n infrastructure --sort-by='.lastTimestamp'
# 查看日志
kubectl logs -n infrastructure -l app=pg16
```
常见问题:
- **ImagePullBackOff**: 无法拉取镜像,检查网络连接
- **CrashLoopBackOff**: 容器启动失败,查看日志
- **Pending**: PVC 无法绑定,检查存储类
### PVC 无法绑定
```bash
# 查看 PVC 状态
kubectl describe pvc pg16-data -n infrastructure
# 查看 StorageClass
kubectl get storageclass
# 检查 local-path-provisioner
kubectl get pods -n kube-system | grep local-path
```
### 数据库连接失败
```bash
# 检查服务是否正常
kubectl get svc -n infrastructure
# 检查 Pod 是否就绪
kubectl get pods -n infrastructure
# 测试集群内连接
kubectl run -it --rm debug --image=postgres:16 --restart=Never -- psql -h pg16.infrastructure.svc.cluster.local -U fei -d pg001
```
### 初始化脚本未执行
如果发现数据库未创建 300 个数据库:
```bash
# 查看初始化日志
kubectl logs -n infrastructure -l app=pg16 | grep -i "init\|create database"
# 检查 ConfigMap 是否正确挂载
kubectl exec -n infrastructure -l app=pg16 -- ls -la /docker-entrypoint-initdb.d/
# 查看脚本内容
kubectl exec -n infrastructure -l app=pg16 -- cat /docker-entrypoint-initdb.d/01-init.sh
```
**注意**: PostgreSQL 初始化脚本只在首次启动且数据目录为空时执行。如果需要重新初始化:
```bash
# 删除 Deployment 和 PVC
kubectl delete deployment pg16 -n infrastructure
kubectl delete pvc pg16-data -n infrastructure
# 重新部署
kubectl apply -f k8s/
```
## 📝 备份与恢复
### 备份单个数据库
```bash
# 备份 pg001 数据库
kubectl exec -n infrastructure -l app=pg16 -- pg_dump -U fei pg001 > pg001_backup.sql
# 备份所有数据库
kubectl exec -n infrastructure -l app=pg16 -- pg_dumpall -U postgres > all_databases_backup.sql
```
### 恢复数据库
```bash
# 恢复单个数据库
cat pg001_backup.sql | kubectl exec -i -n infrastructure -l app=pg16 -- psql -U fei pg001
# 恢复所有数据库
cat all_databases_backup.sql | kubectl exec -i -n infrastructure -l app=pg16 -- psql -U postgres
```
### 数据持久化
数据存储在 K3s 的 local-path 存储中,默认路径:
```
/var/lib/rancher/k3s/storage/pvc-<uuid>_infrastructure_pg16-data/
```
## 📚 更多信息
- PostgreSQL 官方文档: https://www.postgresql.org/docs/16/
- K3s 官方文档: https://docs.k3s.io/
- Kubernetes 官方文档: https://kubernetes.io/docs/
## 🆘 获取帮助
如有问题,请检查:
1. Pod 日志: `kubectl logs -n infrastructure -l app=pg16`
2. Pod 状态: `kubectl describe pod -n infrastructure -l app=pg16`
3. 事件记录: `kubectl get events -n infrastructure`
---
**版本信息**
- PostgreSQL: 16
- 创建日期: 2026-01-29
- 最后更新: 2026-01-29

View File

@@ -0,0 +1,112 @@
# PostgreSQL 16 K3s 部署配置
## 文件说明
- `namespace.yaml` - 创建 infrastructure 命名空间
- `secret.yaml` - 存储 PostgreSQL 密码等敏感信息
- `configmap.yaml` - 存储初始化脚本(创建用户和 300 个数据库)
- `pvc.yaml` - 持久化存储声明20Gi
- `deployment.yaml` - PostgreSQL 16 部署配置
- `service.yaml` - 服务暴露ClusterIP + NodePort
## 部署步骤
### 1. 部署所有资源
```bash
kubectl apply -f namespace.yaml
kubectl apply -f secret.yaml
kubectl apply -f configmap.yaml
kubectl apply -f pvc.yaml
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
```
或者一次性部署:
```bash
kubectl apply -f .
```
### 2. 查看部署状态
```bash
# 查看 Pod 状态
kubectl get pods -n infrastructure
# 查看 Pod 日志
kubectl logs -n infrastructure -l app=pg16 -f
# 查看服务
kubectl get svc -n infrastructure
```
### 3. 访问数据库
**集群内访问:**
```bash
# 使用 ClusterIP 服务
psql -h pg16.infrastructure.svc.cluster.local -U postgres -p 5432
```
**集群外访问:**
```bash
# 使用 NodePort端口 30432
psql -h <节点IP> -U postgres -p 30432
```
**使用 kubectl port-forward**
```bash
kubectl port-forward -n infrastructure svc/pg16 5432:5432
psql -h localhost -U postgres -p 5432
```
## 配置说明
### 存储
- 使用 k3s 默认的 `local-path` StorageClass
- 默认申请 20Gi 存储空间
- 数据存储在 `/var/lib/postgresql/data/pgdata`
### 资源限制
- 请求512Mi 内存0.5 核 CPU
- 限制2Gi 内存2 核 CPU
### 初始化
- 自动创建超级用户 `fei`
- 自动创建 300 个数据库pg001 到 pg300
### 服务暴露
- **ClusterIP 服务**:集群内部访问,服务名 `pg16`
- **NodePort 服务**:集群外部访问,端口 `30432`
## 数据迁移
### 从现有 Docker 数据迁移
如果你有现有的 pgdata 数据,可以:
1. 先部署不带数据的 PostgreSQL
2. 停止 Pod
3. 将数据复制到 PVC 对应的主机路径
4. 重启 Pod
```bash
# 查找 PVC 对应的主机路径
kubectl get pv
# 停止 Pod
kubectl scale deployment pg16 -n infrastructure --replicas=0
# 复制数据到主机路径(通常在 /var/lib/rancher/k3s/storage/
# 然后重启
kubectl scale deployment pg16 -n infrastructure --replicas=1
```
## 卸载
```bash
kubectl delete -f .
```
注意:删除 PVC 会删除所有数据,请谨慎操作。

View File

@@ -0,0 +1,19 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: pg16-init-script
namespace: infrastructure
data:
01-init.sh: |
#!/bin/bash
set -e
# 创建超级用户 fei
psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL
CREATE USER fei WITH SUPERUSER PASSWORD 'feiks..';
EOSQL
# 创建 300 个数据库
for i in $(seq -w 1 300); do
psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" -c "CREATE DATABASE pg${i} OWNER fei;"
done

View File

@@ -0,0 +1,76 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: pg16
namespace: infrastructure
labels:
app: pg16
spec:
replicas: 1
strategy:
type: Recreate
selector:
matchLabels:
app: pg16
template:
metadata:
labels:
app: pg16
spec:
containers:
- name: postgres
image: postgres:16
ports:
- containerPort: 5432
name: postgres
env:
- name: POSTGRES_USER
valueFrom:
secretKeyRef:
name: pg16-secret
key: POSTGRES_USER
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: pg16-secret
key: POSTGRES_PASSWORD
- name: PGDATA
value: /var/lib/postgresql/data/pgdata
volumeMounts:
- name: postgres-data
mountPath: /var/lib/postgresql/data
- name: init-scripts
mountPath: /docker-entrypoint-initdb.d
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "2000m"
livenessProbe:
exec:
command:
- pg_isready
- -U
- postgres
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
readinessProbe:
exec:
command:
- pg_isready
- -U
- postgres
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
volumes:
- name: postgres-data
persistentVolumeClaim:
claimName: pg16-data
- name: init-scripts
configMap:
name: pg16-init-script
defaultMode: 0755

View File

@@ -0,0 +1,4 @@
apiVersion: v1
kind: Namespace
metadata:
name: infrastructure

View File

@@ -0,0 +1,12 @@
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pg16-data
namespace: infrastructure
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
storageClassName: local-path

View File

@@ -0,0 +1,10 @@
apiVersion: v1
kind: Secret
metadata:
name: pg16-secret
namespace: infrastructure
type: Opaque
stringData:
POSTGRES_PASSWORD: "adminks.."
POSTGRES_USER: "postgres"
FEI_PASSWORD: "feiks.."

View File

@@ -0,0 +1,34 @@
apiVersion: v1
kind: Service
metadata:
name: pg16
namespace: infrastructure
labels:
app: pg16
spec:
type: ClusterIP
ports:
- port: 5432
targetPort: 5432
protocol: TCP
name: postgres
selector:
app: pg16
---
apiVersion: v1
kind: Service
metadata:
name: pg16-nodeport
namespace: infrastructure
labels:
app: pg16
spec:
type: NodePort
ports:
- port: 5432
targetPort: 5432
nodePort: 30432
protocol: TCP
name: postgres
selector:
app: pg16

View File

@@ -0,0 +1,131 @@
# MinIO S3 对象存储部署
## 功能特性
- ✅ MinIO 对象存储服务
- ✅ 自动 SSL 证书(通过 Caddy
- ✅ 自动设置新存储桶为公开只读权限
- ✅ Web 管理控制台
- ✅ S3 兼容 API
## 部署前准备
### 1. 修改配置
编辑 `minio.yaml`,替换以下内容:
**域名配置3 处):**
- `s3.u6.net3w.com` → 你的 S3 API 域名
- `console.s3.u6.net3w.com` → 你的控制台域名
**凭证配置4 处):**
- `MINIO_ROOT_USER: "admin"` → 你的管理员账号
- `MINIO_ROOT_PASSWORD: "adminks.."` → 你的管理员密码(建议至少 8 位)
**架构配置1 处):**
- `linux-arm64` → 根据你的 CPU 架构选择:
- ARM64: `linux-arm64`
- x86_64: `linux-amd64`
### 2. 配置 DNS
将域名解析到你的服务器 IP
```
s3.yourdomain.com A your-server-ip
console.s3.yourdomain.com A your-server-ip
```
### 3. 配置 Caddy
在 Caddy 配置中添加(如果使用 Caddy 做 SSL
```
s3.yourdomain.com {
reverse_proxy traefik.kube-system.svc.cluster.local:80
}
console.s3.yourdomain.com {
reverse_proxy traefik.kube-system.svc.cluster.local:80
}
```
## 部署步骤
```bash
# 1. 部署 MinIO
kubectl apply -f minio.yaml
# 2. 检查部署状态
kubectl get pods -n minio
# 3. 查看日志
kubectl logs -n minio -l app=minio -c minio
kubectl logs -n minio -l app=minio -c policy-manager
```
## 访问服务
- **Web 控制台**: https://console.s3.yourdomain.com
- **S3 API 端点**: https://s3.yourdomain.com
- **登录凭证**: 使用你配置的 MINIO_ROOT_USER 和 MINIO_ROOT_PASSWORD
## 自动权限策略
新创建的存储桶会在 30 秒内自动设置为 **公开只读download** 权限:
- ✅ 任何人可以下载文件(无需认证)
- ✅ 上传/删除需要认证
如需保持某个桶为私有,在控制台手动改回 PRIVATE 即可。
## 存储配置
默认使用 50Gi 存储空间,修改方法:
编辑 `minio.yaml` 中的 PersistentVolumeClaim
```yaml
resources:
requests:
storage: 50Gi # 修改为你需要的大小
```
## 故障排查
### Pod 无法启动
```bash
kubectl describe pod -n minio <pod-name>
```
### 查看详细日志
```bash
# MinIO 主容器
kubectl logs -n minio <pod-name> -c minio
# 策略管理器
kubectl logs -n minio <pod-name> -c policy-manager
```
### 检查 Ingress
```bash
kubectl get ingress -n minio
```
## 架构说明
```
用户 HTTPS 请求
Caddy (SSL 终止)
↓ HTTP
Traefik (路由)
MinIO Service
├─ MinIO 容器 (9000: API, 9001: Console)
└─ Policy Manager 容器 (自动设置桶权限)
```
## 卸载
```bash
kubectl delete -f minio.yaml
```
注意:这会删除所有数据,请先备份重要文件。

View File

@@ -0,0 +1,169 @@
apiVersion: v1
kind: Namespace
metadata:
name: minio
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: minio-data
namespace: minio
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: local-path
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: minio
namespace: minio
spec:
replicas: 1
selector:
matchLabels:
app: minio
template:
metadata:
labels:
app: minio
spec:
containers:
- name: minio
image: minio/minio:latest
command:
- /bin/sh
- -c
- minio server /data --console-address ":9001"
ports:
- containerPort: 9000
name: api
- containerPort: 9001
name: console
env:
- name: MINIO_ROOT_USER
value: "admin"
- name: MINIO_ROOT_PASSWORD
value: "adminks.."
- name: MINIO_SERVER_URL
value: "https://s3.u6.net3w.com"
- name: MINIO_BROWSER_REDIRECT_URL
value: "https://console.s3.u6.net3w.com"
volumeMounts:
- name: data
mountPath: /data
livenessProbe:
httpGet:
path: /minio/health/live
port: 9000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /minio/health/ready
port: 9000
initialDelaySeconds: 10
periodSeconds: 5
- name: policy-manager
image: alpine:latest
command:
- /bin/sh
- -c
- |
# 安装 MinIO Client
wget https://dl.min.io/client/mc/release/linux-arm64/mc -O /usr/local/bin/mc
chmod +x /usr/local/bin/mc
# 等待 MinIO 启动
sleep 10
# 配置 mc 客户端
mc alias set myminio http://localhost:9000 ${MINIO_ROOT_USER} ${MINIO_ROOT_PASSWORD}
echo "Policy manager started. Monitoring buckets..."
# 持续监控并设置新桶的策略
while true; do
# 获取所有存储桶
mc ls myminio 2>/dev/null | awk '{print $NF}' | sed 's/\///' | while read -r BUCKET; do
if [ -n "$BUCKET" ]; then
# 检查当前策略
POLICY_OUTPUT=$(mc anonymous get myminio/${BUCKET} 2>&1)
# 如果是私有的(包含 "Access permission for" 且不包含 "download"
if echo "$POLICY_OUTPUT" | grep -q "Access permission for" && ! echo "$POLICY_OUTPUT" | grep -q "download"; then
echo "Setting download policy for bucket: ${BUCKET}"
mc anonymous set download myminio/${BUCKET}
fi
fi
done
sleep 30
done
env:
- name: MINIO_ROOT_USER
value: "admin"
- name: MINIO_ROOT_PASSWORD
value: "adminks.."
volumes:
- name: data
persistentVolumeClaim:
claimName: minio-data
---
apiVersion: v1
kind: Service
metadata:
name: minio
namespace: minio
spec:
type: ClusterIP
ports:
- port: 9000
targetPort: 9000
name: api
- port: 9001
targetPort: 9001
name: console
selector:
app: minio
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: minio-api
namespace: minio
spec:
ingressClassName: traefik
rules:
- host: s3.u6.net3w.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: minio
port:
number: 9000
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: minio-console
namespace: minio
spec:
ingressClassName: traefik
rules:
- host: console.s3.u6.net3w.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: minio
port:
number: 9001

View File

@@ -0,0 +1,65 @@
1. MinIO 配置已更新
- MINIO_SERVER_URL: https://s3.u6.net3w.com
- MINIO_BROWSER_REDIRECT_URL: https://console.s3.u6.net3w.com
2. MinIO 日志确认
API: https://s3.u6.net3w.com
WebUI: https://console.s3.u6.net3w.com
3. 访问测试通过
- https://s3.u6.net3w.com - 正确重定向到 Console不再显示端口号
- https://console.s3.u6.net3w.com - 直接访问 Console
访问方式:
- Web 控制台: https://console.s3.u6.net3w.com
- S3 API 端点: https://s3.u6.net3w.com
- 登录凭证:
- 用户名: admin
- 密码: adminks..
配置文件位置:
- MinIO: /home/fei/k3s/009-基础设施/002-s3/minio.yaml
- Caddy: /home/fei/k3s/caddy-ssl-deployment.yaml
所有服务已正常运行,可以开始使用 MinIO 对象存储了!
不如固定桶名?
在对象存储的使用习惯中“频繁创建桶”其实是一种反模式Anti-pattern
建议方案:创建一个名为 public-assets 的固定桶,并将其设为 download 权限。
管理方式:不同的业务或用户,通过 文件夹Prefix 来区分,例如 public-assets/user1/public-assets/user2/。
结果:你只需要在部署 Pod 时设置一次权限,之后所有上传到该桶下的文件自动就能下载。
配置完成: ─
从日志可以看到:
Policy manager started. Monitoring buckets...
Setting download policy for bucket: test
Access permission for `myminio/test` is set to `download`
功能说明:
1. 自动检测 - 每 30 秒扫描一次所有存储桶
2. 自动设置 - 发现私有PRIVATE存储桶时自动设置为 download公开只读
3. 已处理 - 你的 test 存储桶已自动设置为公开只读权限
权限效果:
- ✅ 任何人都可以下载文件(无需认证)
- ✅ 上传/删除需要认证Access Key + Secret Key
测试方法:
# 现在可以直接访问 test 桶中的文件(无需认证)
curl https://s3.u6.net3w.com/test/your-file.txt
后续使用:
- 新创建的任何存储桶都会在 30 秒内自动设置为公开只读
- 如果需要保持某个桶为私有,可以在 Web 控制台手动改回 PRIVATE
- Policy manager 只会处理完全私有的桶,不会覆盖已设置的策略
配置已生效,可以继续使用了!

View File

@@ -0,0 +1,25 @@
#!/bin/sh
# 自动为新创建的存储桶设置 download公开只读策略
# 配置 mc 客户端
mc alias set myminio http://localhost:9000 ${MINIO_ROOT_USER} ${MINIO_ROOT_PASSWORD}
# 持续监控并设置新桶的策略
while true; do
# 获取所有存储桶
BUCKETS=$(mc ls myminio 2>/dev/null | awk '{print $NF}' | sed 's/\///')
for BUCKET in $BUCKETS; do
# 检查当前策略
CURRENT_POLICY=$(mc anonymous get myminio/${BUCKET} 2>/dev/null | grep -o "download\|upload\|public" || echo "none")
# 如果策略为 none私有则设置为 download
if [ "$CURRENT_POLICY" = "none" ]; then
echo "Setting download policy for bucket: ${BUCKET}"
mc anonymous set download myminio/${BUCKET}
fi
done
# 每 30 秒检查一次
sleep 30
done

View File

@@ -0,0 +1,4 @@
# 写入以下内容
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
# 记录 K3s 环境变量
echo "export KUBECONFIG=/etc/rancher/k3s/k3s.yaml" >> ~/.bashrc

View File

View File

@@ -0,0 +1,8 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: longhorn-backup-config
namespace: longhorn-system
data:
backup-target: "s3://longhorn-backup@us-east-1/"
backup-target-credential-secret: "longhorn-crypto"

View File

@@ -0,0 +1,10 @@
# 1. 创建命名空间
kubectl create namespace longhorn-system
# 2. 应用 S3 密钥
kubectl apply -f s3-secret.yaml
# 3. 使用本地 values.yaml 安装远程仓库的应用
helm install longhorn longhorn/longhorn \
--namespace longhorn-system \
-f values.yaml

View File

@@ -0,0 +1,130 @@
Installation Summary:
✅ Namespace created: longhorn-system
✅ S3 Secret created: longhorn-crypto with MinIO credentials
✅ Helm deployment: Longhorn v1 deployed successfully
✅ Components starting:
- longhorn-manager: Running (2/2)
- longhorn-ui: Running (2 replicas)
- longhorn-driver-deployer: Initializing
S3 Configuration Verified:
- AWS_ACCESS_KEY_ID: admin ✓
- AWS_SECRET_ACCESS_KEY: adminks.. ✓
- AWS_ENDPOINTS: https://s3.u6.net3w.com ✓
Next Steps:
The Longhorn components are still initializing. It will take a few minutes for all components (CSI drivers, Engine Images, Instance Managers) to be fully ready.
You can monitor the progress with:
kubectl get pods -n longhorn-system -w
这是一个非常棒的习惯!在生产实践中,这被称为 **IaC (基础设施即代码)**。即使我们使用 Helm 远程安装,也应该在本地保留一份“配置清单”,以便日后维护、迁移或重建集群。
为了满足你的需求,建议在 `/home/fei/k3s/009-基础设施/` 下创建两个标准化的文件夹。
---
## 1. 推荐的目录结构
你可以执行以下命令创建目录:
```bash
mkdir -p /home/fei/k3s/009-基础设施/003-helm
mkdir -p /home/fei/k3s/009-基础设施/004-longhorn
```
结构如下:
* **003-helm**: 存放 Helm 的安装脚本、环境变量配置。
* **004-longhorn**: 存放 Longhorn 的自定义配置文件(`values.yaml`)和 S3 密钥配置。
---
## 2. 文件夹内应该放什么内容?
### 文件夹003-helm
由于 Helm 是一个工具,这个文件夹主要存放**安装记录**和**环境配置**。
* **创建文件:`install_helm.sh**` (记录你是怎么装的)
```bash
# 写入以下内容
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
# 记录 K3s 环境变量
echo "export KUBECONFIG=/etc/rancher/k3s/k3s.yaml" >> ~/.bashrc
```
* **创建文件:`readme.md**`: 写明当前 Helm 的版本和用途。
---
### 文件夹004-longhorn
这是核心。虽然我们用远程仓库安装,但我们要把**配置参数**固化在本地。
**第一步:导出远程默认配置**
在文件夹下执行:
```bash
helm repo add longhorn https://charts.longhorn.io
helm repo update
helm show values longhorn/longhorn > values.yaml
```
现在,你拥有了一个几百行的 `values.yaml`,这就是 Longhorn 的“灵魂”。
**第二步:创建 S3 密钥文件 `s3-secret.yaml**`
因为你要对接 S3你需要把秘钥写成 K8s 的 Secret
```yaml
apiVersion: v1
kind: Secret
metadata:
name: longhorn-crypto
namespace: longhorn-system
type: Opaque
stringData:
AWS_ACCESS_KEY_ID: "你的ID"
AWS_SECRET_ACCESS_KEY: "你的密钥"
AWS_ENDPOINTS: "你的S3连接地址"
```
**第三步:创建一键安装脚本 `deploy.sh**`
```bash
# 1. 创建命名空间
kubectl create namespace longhorn-system
# 2. 应用 S3 密钥
kubectl apply -f s3-secret.yaml
# 3. 使用本地 values.yaml 安装远程仓库的应用
helm install longhorn longhorn/longhorn \
--namespace longhorn-system \
-f values.yaml
```
---
## 3. 这样做的好处
1. **版本可控**:以后你想改 S3 地址或副本数,只需要改 `/004-longhorn/values.yaml`,然后执行 `helm upgrade`。
2. **一目了然**:当你打开 `009-基础设施` 文件夹,你能清楚看到你的 S3、Helm 和 Longhorn 是如何相互关联的。
3. **快速重建**:如果哪天服务器系统重装了,你只需要进入这些文件夹,按顺序执行 `deploy.sh`,几分钟内整个基础设施就能起死回生。
---
## 4. 特别提示Redis 放在哪?
按照你的逻辑,你应该在 `009-基础设施` 之外(或者创建一个 `010-中间件` 文件夹)专门放 Redis 的 YAML。因为 Redis 是属于**应用层**,而 Longhorn 属于**存储层**。
**接下来,需要我帮你把上述的 `values.yaml` 中关于 S3 备份的关键配置项找出来,并告诉你如何修改吗?**

View File

@@ -0,0 +1,10 @@
apiVersion: v1
kind: Secret
metadata:
name: longhorn-crypto
namespace: longhorn-system
type: Opaque
stringData:
AWS_ACCESS_KEY_ID: "admin"
AWS_SECRET_ACCESS_KEY: "adminks.."
AWS_ENDPOINTS: "https://s3.u6.net3w.com"

View File

@@ -0,0 +1,660 @@
# Default values for longhorn.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.
global:
# -- Global override for container image registry.
imageRegistry: "docker.io"
# -- Global override for image pull secrets for container registry.
imagePullSecrets: []
# -- Set container timezone (TZ env) for all Longhorn workloads. Leave empty to use container default.
timezone: ""
# -- Toleration for nodes allowed to run user-deployed components such as Longhorn Manager, Longhorn UI, and Longhorn Driver Deployer.
tolerations: []
# -- Node selector for nodes allowed to run user-deployed components such as Longhorn Manager, Longhorn UI, and Longhorn Driver Deployer.
nodeSelector: {}
cattle:
# -- Default system registry.
systemDefaultRegistry: ""
windowsCluster:
# -- Setting that allows Longhorn to run on a Rancher Windows cluster.
enabled: false
# -- Toleration for Linux nodes that can run user-deployed Longhorn components.
tolerations:
- key: "cattle.io/os"
value: "linux"
effect: "NoSchedule"
operator: "Equal"
# -- Node selector for Linux nodes that can run user-deployed Longhorn components.
nodeSelector:
kubernetes.io/os: "linux"
defaultSetting:
# -- Toleration for system-managed Longhorn components.
taintToleration: cattle.io/os=linux:NoSchedule
# -- Node selector for system-managed Longhorn components.
systemManagedComponentsNodeSelector: kubernetes.io/os:linux
networkPolicies:
# -- Setting that allows you to enable network policies that control access to Longhorn pods.
enabled: false
# -- Distribution that determines the policy for allowing access for an ingress. (Options: "k3s", "rke2", "rke1")
type: "k3s"
image:
longhorn:
engine:
# -- Registry for the Longhorn Engine image.
registry: ""
# -- Repository for the Longhorn Engine image.
repository: longhornio/longhorn-engine
# -- Tag for the Longhorn Engine image.
tag: v1.11.0
manager:
# -- Registry for the Longhorn Manager image.
registry: ""
# -- Repository for the Longhorn Manager image.
repository: longhornio/longhorn-manager
# -- Tag for the Longhorn Manager image.
tag: v1.11.0
ui:
# -- Registry for the Longhorn UI image.
registry: ""
# -- Repository for the Longhorn UI image.
repository: longhornio/longhorn-ui
# -- Tag for the Longhorn UI image.
tag: v1.11.0
instanceManager:
# -- Registry for the Longhorn Instance Manager image.
registry: ""
# -- Repository for the Longhorn Instance Manager image.
repository: longhornio/longhorn-instance-manager
# -- Tag for the Longhorn Instance Manager image.
tag: v1.11.0
shareManager:
# -- Registry for the Longhorn Share Manager image.
registry: ""
# -- Repository for the Longhorn Share Manager image.
repository: longhornio/longhorn-share-manager
# -- Tag for the Longhorn Share Manager image.
tag: v1.11.0
backingImageManager:
# -- Registry for the Backing Image Manager image. When unspecified, Longhorn uses the default value.
registry: ""
# -- Repository for the Backing Image Manager image. When unspecified, Longhorn uses the default value.
repository: longhornio/backing-image-manager
# -- Tag for the Backing Image Manager image. When unspecified, Longhorn uses the default value.
tag: v1.11.0
supportBundleKit:
# -- Registry for the Longhorn Support Bundle Manager image.
registry: ""
# -- Repository for the Longhorn Support Bundle Manager image.
repository: longhornio/support-bundle-kit
# -- Tag for the Longhorn Support Bundle Manager image.
tag: v0.0.79
csi:
attacher:
# -- Registry for the CSI attacher image. When unspecified, Longhorn uses the default value.
registry: ""
# -- Repository for the CSI attacher image. When unspecified, Longhorn uses the default value.
repository: longhornio/csi-attacher
# -- Tag for the CSI attacher image. When unspecified, Longhorn uses the default value.
tag: v4.10.0-20251226
provisioner:
# -- Registry for the CSI Provisioner image. When unspecified, Longhorn uses the default value.
registry: ""
# -- Repository for the CSI Provisioner image. When unspecified, Longhorn uses the default value.
repository: longhornio/csi-provisioner
# -- Tag for the CSI Provisioner image. When unspecified, Longhorn uses the default value.
tag: v5.3.0-20251226
nodeDriverRegistrar:
# -- Registry for the CSI Node Driver Registrar image. When unspecified, Longhorn uses the default value.
registry: ""
# -- Repository for the CSI Node Driver Registrar image. When unspecified, Longhorn uses the default value.
repository: longhornio/csi-node-driver-registrar
# -- Tag for the CSI Node Driver Registrar image. When unspecified, Longhorn uses the default value.
tag: v2.15.0-20251226
resizer:
# -- Registry for the CSI Resizer image. When unspecified, Longhorn uses the default value.
registry: ""
# -- Repository for the CSI Resizer image. When unspecified, Longhorn uses the default value.
repository: longhornio/csi-resizer
# -- Tag for the CSI Resizer image. When unspecified, Longhorn uses the default value.
tag: v2.0.0-20251226
snapshotter:
# -- Registry for the CSI Snapshotter image. When unspecified, Longhorn uses the default value.
registry: ""
# -- Repository for the CSI Snapshotter image. When unspecified, Longhorn uses the default value.
repository: longhornio/csi-snapshotter
# -- Tag for the CSI Snapshotter image. When unspecified, Longhorn uses the default value.
tag: v8.4.0-20251226
livenessProbe:
# -- Registry for the CSI liveness probe image. When unspecified, Longhorn uses the default value.
registry: ""
# -- Repository for the CSI liveness probe image. When unspecified, Longhorn uses the default value.
repository: longhornio/livenessprobe
# -- Tag for the CSI liveness probe image. When unspecified, Longhorn uses the default value.
tag: v2.17.0-20251226
openshift:
oauthProxy:
# -- Registry for the OAuth Proxy image. Specify the upstream image (for example, "quay.io/openshift/origin-oauth-proxy"). This setting applies only to OpenShift users.
registry: ""
# -- Repository for the OAuth Proxy image. Specify the upstream image (for example, "quay.io/openshift/origin-oauth-proxy"). This setting applies only to OpenShift users.
repository: ""
# -- Tag for the OAuth Proxy image. Specify OCP/OKD version 4.1 or later (including version 4.18, which is available at quay.io/openshift/origin-oauth-proxy:4.18). This setting applies only to OpenShift users.
tag: ""
# -- Image pull policy that applies to all user-deployed Longhorn components, such as Longhorn Manager, Longhorn driver, and Longhorn UI.
pullPolicy: IfNotPresent
service:
ui:
# -- Service type for Longhorn UI. (Options: "ClusterIP", "NodePort", "LoadBalancer", "Rancher-Proxy")
type: ClusterIP
# -- NodePort port number for Longhorn UI. When unspecified, Longhorn selects a free port between 30000 and 32767.
nodePort: null
# -- Class of a load balancer implementation
loadBalancerClass: ""
# -- Annotation for the Longhorn UI service.
annotations: {}
## If you want to set annotations for the Longhorn UI service, delete the `{}` in the line above
## and uncomment this example block
# annotation-key1: "annotation-value1"
# annotation-key2: "annotation-value2"
labels: {}
## If you want to set additional labels for the Longhorn UI service, delete the `{}` in the line above
## and uncomment this example block
# label-key1: "label-value1"
# label-key2: "label-value2"
manager:
# -- Service type for Longhorn Manager.
type: ClusterIP
# -- NodePort port number for Longhorn Manager. When unspecified, Longhorn selects a free port between 30000 and 32767.
nodePort: ""
persistence:
# -- Setting that allows you to specify the default Longhorn StorageClass.
defaultClass: true
# -- Filesystem type of the default Longhorn StorageClass.
defaultFsType: ext4
# -- mkfs parameters of the default Longhorn StorageClass.
defaultMkfsParams: ""
# -- Replica count of the default Longhorn StorageClass.
defaultClassReplicaCount: 3
# -- Data locality of the default Longhorn StorageClass. (Options: "disabled", "best-effort")
defaultDataLocality: disabled
# -- Reclaim policy that provides instructions for handling of a volume after its claim is released. (Options: "Retain", "Delete")
reclaimPolicy: Delete
# -- VolumeBindingMode controls when volume binding and dynamic provisioning should occur. (Options: "Immediate", "WaitForFirstConsumer") (Defaults to "Immediate")
volumeBindingMode: "Immediate"
# -- Setting that allows you to enable live migration of a Longhorn volume from one node to another.
migratable: false
# -- Setting that disables the revision counter and thereby prevents Longhorn from tracking all write operations to a volume. When salvaging a volume, Longhorn uses properties of the volume-head-xxx.img file (the last file size and the last time the file was modified) to select the replica to be used for volume recovery.
disableRevisionCounter: "true"
# -- Set NFS mount options for Longhorn StorageClass for RWX volumes
nfsOptions: ""
recurringJobSelector:
# -- Setting that allows you to enable the recurring job selector for a Longhorn StorageClass.
enable: false
# -- Recurring job selector for a Longhorn StorageClass. Ensure that quotes are used correctly when specifying job parameters. (Example: `[{"name":"backup", "isGroup":true}]`)
jobList: []
backingImage:
# -- Setting that allows you to use a backing image in a Longhorn StorageClass.
enable: false
# -- Backing image to be used for creating and restoring volumes in a Longhorn StorageClass. When no backing images are available, specify the data source type and parameters that Longhorn can use to create a backing image.
name: ~
# -- Data source type of a backing image used in a Longhorn StorageClass.
# If the backing image exists in the cluster, Longhorn uses this setting to verify the image.
# If the backing image does not exist, Longhorn creates one using the specified data source type.
dataSourceType: ~
# -- Data source parameters of a backing image used in a Longhorn StorageClass.
# You can specify a JSON string of a map. (Example: `'{\"url\":\"https://backing-image-example.s3-region.amazonaws.com/test-backing-image\"}'`)
dataSourceParameters: ~
# -- Expected SHA-512 checksum of a backing image used in a Longhorn StorageClass.
expectedChecksum: ~
defaultDiskSelector:
# -- Setting that allows you to enable the disk selector for the default Longhorn StorageClass.
enable: false
# -- Disk selector for the default Longhorn StorageClass. Longhorn uses only disks with the specified tags for storing volume data. (Examples: "nvme,sata")
selector: ""
defaultNodeSelector:
# -- Setting that allows you to enable the node selector for the default Longhorn StorageClass.
enable: false
# -- Node selector for the default Longhorn StorageClass. Longhorn uses only nodes with the specified tags for storing volume data. (Examples: "storage,fast")
selector: ""
# -- Setting that allows you to enable automatic snapshot removal during filesystem trim for a Longhorn StorageClass. (Options: "ignored", "enabled", "disabled")
unmapMarkSnapChainRemoved: ignored
# -- Setting that allows you to specify the data engine version for the default Longhorn StorageClass. (Options: "v1", "v2")
dataEngine: v1
# -- Setting that allows you to specify the backup target for the default Longhorn StorageClass.
backupTargetName: default
preUpgradeChecker:
# -- Setting that allows Longhorn to perform pre-upgrade checks. Disable this setting when installing Longhorn using Argo CD or other GitOps solutions.
jobEnabled: true
# -- Setting that allows Longhorn to perform upgrade version checks after starting the Longhorn Manager DaemonSet Pods. Disabling this setting also disables `preUpgradeChecker.jobEnabled`. Longhorn recommends keeping this setting enabled.
upgradeVersionCheck: true
csi:
# -- kubelet root directory. When unspecified, Longhorn uses the default value.
kubeletRootDir: ~
# -- Configures Pod anti-affinity to prevent multiple instances on the same node. Use soft (tries to separate) or hard (must separate). When unspecified, Longhorn uses the default value ("soft").
podAntiAffinityPreset: ~
# -- Replica count of the CSI Attacher. When unspecified, Longhorn uses the default value ("3").
attacherReplicaCount: ~
# -- Replica count of the CSI Provisioner. When unspecified, Longhorn uses the default value ("3").
provisionerReplicaCount: ~
# -- Replica count of the CSI Resizer. When unspecified, Longhorn uses the default value ("3").
resizerReplicaCount: ~
# -- Replica count of the CSI Snapshotter. When unspecified, Longhorn uses the default value ("3").
snapshotterReplicaCount: ~
defaultSettings:
# -- Setting that allows Longhorn to automatically attach a volume and create snapshots or backups when recurring jobs are run.
allowRecurringJobWhileVolumeDetached: ~
# -- Setting that allows Longhorn to automatically create a default disk only on nodes with the label "node.longhorn.io/create-default-disk=true" (if no other disks exist). When this setting is disabled, Longhorn creates a default disk on each node that is added to the cluster.
createDefaultDiskLabeledNodes: ~
# -- Default path to use for storing data on a host. An absolute directory path indicates a filesystem-type disk used by the V1 Data Engine, while a path to a block device indicates a block-type disk used by the V2 Data Engine. The default value is "/var/lib/longhorn/".
defaultDataPath: ~
# -- Default data locality. A Longhorn volume has data locality if a local replica of the volume exists on the same node as the pod that is using the volume.
defaultDataLocality: ~
# -- Setting that allows scheduling on nodes with healthy replicas of the same volume. This setting is disabled by default.
replicaSoftAntiAffinity: ~
# -- Setting that automatically rebalances replicas when an available node is discovered.
replicaAutoBalance: ~
# -- Percentage of storage that can be allocated relative to hard drive capacity. The default value is "100".
storageOverProvisioningPercentage: ~
# -- Percentage of minimum available disk capacity. When the minimum available capacity exceeds the total available capacity, the disk becomes unschedulable until more space is made available for use. The default value is "25".
storageMinimalAvailablePercentage: ~
# -- Percentage of disk space that is not allocated to the default disk on each new Longhorn node.
storageReservedPercentageForDefaultDisk: ~
# -- Upgrade Checker that periodically checks for new Longhorn versions. When a new version is available, a notification appears on the Longhorn UI. This setting is enabled by default
upgradeChecker: ~
# -- The Upgrade Responder sends a notification whenever a new Longhorn version that you can upgrade to becomes available. The default value is https://longhorn-upgrade-responder.rancher.io/v1/checkupgrade.
upgradeResponderURL: ~
# -- The external URL used to access the Longhorn Manager API. When set, this URL is returned in API responses (the actions and links fields) instead of the internal pod IP. This is useful when accessing the API through Ingress or Gateway API HTTPRoute. Format: scheme://host[:port] (for example, https://longhorn.example.com or https://longhorn.example.com:8443). Leave it empty to use the default behavior.
managerUrl: ~
# -- Default number of replicas for volumes created using the Longhorn UI. For Kubernetes configuration, modify the `numberOfReplicas` field in the StorageClass. The default value is "{"v1":"3","v2":"3"}".
defaultReplicaCount: ~
# -- Default name of Longhorn static StorageClass. "storageClassName" is assigned to PVs and PVCs that are created for an existing Longhorn volume. "storageClassName" can also be used as a label, so it is possible to use a Longhorn StorageClass to bind a workload to an existing PV without creating a Kubernetes StorageClass object. "storageClassName" needs to be an existing StorageClass. The default value is "longhorn-static".
defaultLonghornStaticStorageClass: ~
# -- Number of minutes that Longhorn keeps a failed backup resource. When the value is "0", automatic deletion is disabled.
failedBackupTTL: ~
# -- Number of minutes that Longhorn allows for the backup execution. The default value is "1".
backupExecutionTimeout: ~
# -- Setting that restores recurring jobs from a backup volume on a backup target and creates recurring jobs if none exist during backup restoration.
restoreVolumeRecurringJobs: ~
# -- Maximum number of successful recurring backup and snapshot jobs to be retained. When the value is "0", a history of successful recurring jobs is not retained.
recurringSuccessfulJobsHistoryLimit: ~
# -- Maximum number of failed recurring backup and snapshot jobs to be retained. When the value is "0", a history of failed recurring jobs is not retained.
recurringFailedJobsHistoryLimit: ~
# -- Maximum number of snapshots or backups to be retained.
recurringJobMaxRetention: ~
# -- Maximum number of failed support bundles that can exist in the cluster. When the value is "0", Longhorn automatically purges all failed support bundles.
supportBundleFailedHistoryLimit: ~
# -- Taint or toleration for system-managed Longhorn components.
# Specify values using a semicolon-separated list in `kubectl taint` syntax (Example: key1=value1:effect; key2=value2:effect).
taintToleration: ~
# -- Node selector for system-managed Longhorn components.
systemManagedComponentsNodeSelector: ~
# -- Resource limits for system-managed CSI components.
# This setting allows you to configure CPU and memory requests/limits for CSI attacher, provisioner, resizer, snapshotter, and plugin components.
# Supported components: csi-attacher, csi-provisioner, csi-resizer, csi-snapshotter, longhorn-csi-plugin, node-driver-registrar, longhorn-liveness-probe.
# Notice that changing resource limits will cause CSI components to restart, which may temporarily affect volume provisioning and attach/detach operations until the components are ready. The value should be a JSON object with component names as keys and ResourceRequirements as values.
systemManagedCSIComponentsResourceLimits: ~
# -- PriorityClass for system-managed Longhorn components.
# This setting can help prevent Longhorn components from being evicted under Node Pressure.
# Notice that this will be applied to Longhorn user-deployed components by default if there are no priority class values set yet, such as `longhornManager.priorityClass`.
priorityClass: &defaultPriorityClassNameRef "longhorn-critical"
# -- Setting that allows Longhorn to automatically salvage volumes when all replicas become faulty (for example, when the network connection is interrupted). Longhorn determines which replicas are usable and then uses these replicas for the volume. This setting is enabled by default.
autoSalvage: ~
# -- Setting that allows Longhorn to automatically delete a workload pod that is managed by a controller (for example, daemonset) whenever a Longhorn volume is detached unexpectedly (for example, during Kubernetes upgrades). After deletion, the controller restarts the pod and then Kubernetes handles volume reattachment and remounting.
autoDeletePodWhenVolumeDetachedUnexpectedly: ~
# -- Blacklist of controller api/kind values for the setting Automatically Delete Workload Pod when the Volume Is Detached Unexpectedly. If a workload pod is managed by a controller whose api/kind is listed in this blacklist, Longhorn will not automatically delete the pod when its volume is unexpectedly detached. Multiple controller api/kind entries can be specified, separated by semicolons. For example: `apps/StatefulSet;apps/DaemonSet`. Note that the controller api/kind is case sensitive and must exactly match the api/kind in the workload pod's owner reference.
blacklistForAutoDeletePodWhenVolumeDetachedUnexpectedly: ~
# -- Setting that prevents Longhorn Manager from scheduling replicas on a cordoned Kubernetes node. This setting is enabled by default.
disableSchedulingOnCordonedNode: ~
# -- Setting that allows Longhorn to schedule new replicas of a volume to nodes in the same zone as existing healthy replicas. Nodes that do not belong to any zone are treated as existing in the zone that contains healthy replicas. When identifying zones, Longhorn relies on the label "topology.kubernetes.io/zone=<Zone name of the node>" in the Kubernetes node object.
replicaZoneSoftAntiAffinity: ~
# -- Setting that allows scheduling on disks with existing healthy replicas of the same volume. This setting is enabled by default.
replicaDiskSoftAntiAffinity: ~
# -- Policy that defines the action Longhorn takes when a volume is stuck with a StatefulSet or Deployment pod on a node that failed.
nodeDownPodDeletionPolicy: ~
# -- Policy that defines the action Longhorn takes when a node with the last healthy replica of a volume is drained.
nodeDrainPolicy: ~
# -- Setting that allows automatic detaching of manually-attached volumes when a node is cordoned.
detachManuallyAttachedVolumesWhenCordoned: ~
# -- Number of seconds that Longhorn waits before reusing existing data on a failed replica instead of creating a new replica of a degraded volume.
replicaReplenishmentWaitInterval: ~
# -- Maximum number of replicas that can be concurrently rebuilt on each node.
concurrentReplicaRebuildPerNodeLimit: ~
# -- Maximum number of file synchronization operations that can run concurrently during a single replica rebuild. Right now, it's for v1 data engine only.
rebuildConcurrentSyncLimit: ~
# -- Maximum number of volumes that can be concurrently restored on each node using a backup. When the value is "0", restoration of volumes using a backup is disabled.
concurrentVolumeBackupRestorePerNodeLimit: ~
# -- Setting that disables the revision counter and thereby prevents Longhorn from tracking all write operations to a volume. When salvaging a volume, Longhorn uses properties of the "volume-head-xxx.img" file (the last file size and the last time the file was modified) to select the replica to be used for volume recovery. This setting applies only to volumes created using the Longhorn UI.
disableRevisionCounter: '{"v1":"true"}'
# -- Image pull policy for system-managed pods, such as Instance Manager, engine images, and CSI Driver. Changes to the image pull policy are applied only after the system-managed pods restart.
systemManagedPodsImagePullPolicy: ~
# -- Setting that allows you to create and attach a volume without having all replicas scheduled at the time of creation.
allowVolumeCreationWithDegradedAvailability: ~
# -- Setting that allows Longhorn to automatically clean up the system-generated snapshot after replica rebuilding is completed.
autoCleanupSystemGeneratedSnapshot: ~
# -- Setting that allows Longhorn to automatically clean up the snapshot generated by a recurring backup job.
autoCleanupRecurringJobBackupSnapshot: ~
# -- Maximum number of engines that are allowed to concurrently upgrade on each node after Longhorn Manager is upgraded. When the value is "0", Longhorn does not automatically upgrade volume engines to the new default engine image version.
concurrentAutomaticEngineUpgradePerNodeLimit: ~
# -- Number of minutes that Longhorn waits before cleaning up the backing image file when no replicas in the disk are using it.
backingImageCleanupWaitInterval: ~
# -- Number of seconds that Longhorn waits before downloading a backing image file again when the status of all image disk files changes to "failed" or "unknown".
backingImageRecoveryWaitInterval: ~
# -- Percentage of the total allocatable CPU resources on each node to be reserved for each instance manager pod. The default value is {"v1":"12","v2":"12"}.
guaranteedInstanceManagerCPU: ~
# -- Setting that notifies Longhorn that the cluster is using the Kubernetes Cluster Autoscaler.
kubernetesClusterAutoscalerEnabled: ~
# -- Enables Longhorn to automatically delete orphaned resources and their associated data or processes (e.g., stale replicas). Orphaned resources on failed or unknown nodes are not automatically cleaned up.
# You need to specify the resource types to be deleted using a semicolon-separated list (e.g., `replica-data;instance`). Available items are: `replica-data`, `instance`.
orphanResourceAutoDeletion: ~
# -- Specifies the wait time, in seconds, before Longhorn automatically deletes an orphaned Custom Resource (CR) and its associated resources.
# Note that if a user manually deletes an orphaned CR, the deletion occurs immediately and does not respect this grace period.
orphanResourceAutoDeletionGracePeriod: ~
# -- Storage network for in-cluster traffic. When unspecified, Longhorn uses the Kubernetes cluster network.
storageNetwork: ~
# -- Specifies a dedicated network for mounting RWX (ReadWriteMany) volumes. Leave this blank to use the default Kubernetes cluster network. **Caution**: This setting should change after all RWX volumes are detached because some Longhorn component pods must be recreated to apply the setting. You cannot modify this setting while RWX volumes are still attached.
endpointNetworkForRWXVolume: ~
# -- Flag that prevents accidental uninstallation of Longhorn.
deletingConfirmationFlag: ~
# -- Timeout between the Longhorn Engine and replicas. Specify a value between "8" and "30" seconds. The default value is "8".
engineReplicaTimeout: ~
# -- Setting that allows you to enable and disable snapshot hashing and data integrity checks.
snapshotDataIntegrity: ~
# -- Setting that allows disabling of snapshot hashing after snapshot creation to minimize impact on system performance.
snapshotDataIntegrityImmediateCheckAfterSnapshotCreation: ~
# -- Setting that defines when Longhorn checks the integrity of data in snapshot disk files. You must use the Unix cron expression format.
snapshotDataIntegrityCronjob: ~
# -- Setting that controls how many snapshot heavy task operations (such as purge and clone) can run concurrently per node. This is a best-effort mechanism: due to the distributed nature of the system, temporary oversubscription may occur. The limiter reduces worst-case overload but does not guarantee perfect enforcement.
snapshotHeavyTaskConcurrentLimit: ~
# -- Setting that allows Longhorn to automatically mark the latest snapshot and its parent files as removed during a filesystem trim. Longhorn does not remove snapshots containing multiple child files.
removeSnapshotsDuringFilesystemTrim: ~
# -- Setting that allows fast rebuilding of replicas using the checksum of snapshot disk files. Before enabling this setting, you must set the snapshot-data-integrity value to "enable" or "fast-check".
fastReplicaRebuildEnabled: ~
# -- Number of seconds that an HTTP client waits for a response from a File Sync server before considering the connection to have failed.
replicaFileSyncHttpClientTimeout: ~
# -- Number of seconds that Longhorn allows for the completion of replica rebuilding and snapshot cloning operations.
longGRPCTimeOut: ~
# -- Log levels that indicate the type and severity of logs in Longhorn Manager. The default value is "Info". (Options: "Panic", "Fatal", "Error", "Warn", "Info", "Debug", "Trace")
logLevel: ~
# -- Specifies the directory on the host where Longhorn stores log files for the instance manager pod. Currently, it is only used for instance manager pods in the v2 data engine.
logPath: ~
# -- Setting that allows you to specify a backup compression method.
backupCompressionMethod: ~
# -- Maximum number of worker threads that can concurrently run for each backup.
backupConcurrentLimit: ~
# -- Specifies the default backup block size, in MiB, used when creating a new volume. Supported values are 2 or 16.
defaultBackupBlockSize: ~
# -- Maximum number of worker threads that can concurrently run for each restore operation.
restoreConcurrentLimit: ~
# -- Setting that allows you to enable the V1 Data Engine.
v1DataEngine: ~
# -- Setting that allows you to enable the V2 Data Engine, which is based on the Storage Performance Development Kit (SPDK). The V2 Data Engine is an experimental feature and should not be used in production environments.
v2DataEngine: ~
# -- Applies only to the V2 Data Engine. Enables hugepages for the Storage Performance Development Kit (SPDK) target daemon. If disabled, legacy memory is used. Allocation size is set via the Data Engine Memory Size setting.
dataEngineHugepageEnabled: ~
# -- Applies only to the V2 Data Engine. Specifies the hugepage size, in MiB, for the Storage Performance Development Kit (SPDK) target daemon. The default value is "{"v2":"2048"}"
dataEngineMemorySize: ~
# -- Applies only to the V2 Data Engine. Specifies the CPU cores on which the Storage Performance Development Kit (SPDK) target daemon runs. The daemon is deployed in each Instance Manager pod. Ensure that the number of assigned cores does not exceed the guaranteed Instance Manager CPUs for the V2 Data Engine. The default value is "{"v2":"0x1"}".
dataEngineCPUMask: ~
# -- This setting specifies the default write bandwidth limit (in megabytes per second) for volume replica rebuilding when using the v2 data engine (SPDK). If this value is set to 0, there will be no write bandwidth limitation. Individual volumes can override this setting by specifying their own rebuilding bandwidth limit.
replicaRebuildingBandwidthLimit: ~
# -- This setting specifies the default depth of each queue for Ublk frontend. This setting applies to volumes using the V2 Data Engine with Ublk front end. Individual volumes can override this setting by specifying their own Ublk queue depth.
defaultUblkQueueDepth: ~
# -- This setting specifies the default the number of queues for ublk frontend. This setting applies to volumes using the V2 Data Engine with Ublk front end. Individual volumes can override this setting by specifying their own number of queues for ublk.
defaultUblkNumberOfQueue: ~
# -- In seconds. The setting specifies the timeout for the instance manager pod liveness probe. The default value is 10 seconds.
instanceManagerPodLivenessProbeTimeout: ~
# -- Setting that allows scheduling of empty node selector volumes to any node.
allowEmptyNodeSelectorVolume: ~
# -- Setting that allows scheduling of empty disk selector volumes to any disk.
allowEmptyDiskSelectorVolume: ~
# -- Setting that allows Longhorn to periodically collect anonymous usage data for product improvement purposes. Longhorn sends collected data to the [Upgrade Responder](https://github.com/longhorn/upgrade-responder) server, which is the data source of the Longhorn Public Metrics Dashboard (https://metrics.longhorn.io). The Upgrade Responder server does not store data that can be used to identify clients, including IP addresses.
allowCollectingLonghornUsageMetrics: ~
# -- Setting that temporarily prevents all attempts to purge volume snapshots.
disableSnapshotPurge: ~
# -- Maximum snapshot count for a volume. The value should be between 2 to 250
snapshotMaxCount: ~
# -- Applies only to the V2 Data Engine. Specifies the log level for the Storage Performance Development Kit (SPDK) target daemon. Supported values are: Error, Warning, Notice, Info, and Debug. The default is Notice.
dataEngineLogLevel: ~
# -- Applies only to the V2 Data Engine. Specifies the log flags for the Storage Performance Development Kit (SPDK) target daemon.
dataEngineLogFlags: ~
# -- Setting that freezes the filesystem on the root partition before a snapshot is created.
freezeFilesystemForSnapshot: ~
# -- Setting that automatically cleans up the snapshot when the backup is deleted.
autoCleanupSnapshotWhenDeleteBackup: ~
# -- Setting that automatically cleans up the snapshot after the on-demand backup is completed.
autoCleanupSnapshotAfterOnDemandBackupCompleted: ~
# -- Setting that allows Longhorn to detect node failure and immediately migrate affected RWX volumes.
rwxVolumeFastFailover: ~
# -- Enables automatic rebuilding of degraded replicas while the volume is detached. This setting only takes effect if the individual volume setting is set to `ignored` or `enabled`.
offlineReplicaRebuilding: ~
# -- Controls whether Longhorn monitors and records health information for node disks. When disabled, disk health checks and status updates are skipped.
nodeDiskHealthMonitoring: ~
# -- Setting that allows you to update the default backupstore.
defaultBackupStore:
# -- Endpoint used to access the default backupstore. (Options: "NFS", "CIFS", "AWS", "GCP", "AZURE")
backupTarget: "s3://longhorn-backup@us-east-1/"
# -- Name of the Kubernetes secret associated with the default backup target.
backupTargetCredentialSecret: "longhorn-crypto"
# -- Number of seconds that Longhorn waits before checking the default backupstore for new backups. The default value is "300". When the value is "0", polling is disabled.
pollInterval: 300
privateRegistry:
# -- Set to `true` to automatically create a new private registry secret.
createSecret: ~
# -- URL of a private registry. When unspecified, Longhorn uses the default system registry.
registryUrl: ~
# -- User account used for authenticating with a private registry.
registryUser: ~
# -- Password for authenticating with a private registry.
registryPasswd: ~
# -- If create a new private registry secret is true, create a Kubernetes secret with this name; else use the existing secret of this name. Use it to pull images from your private registry.
registrySecret: ~
longhornManager:
log:
# -- Format of Longhorn Manager logs. (Options: "plain", "json")
format: plain
# -- PriorityClass for Longhorn Manager.
priorityClass: *defaultPriorityClassNameRef
# -- Toleration for Longhorn Manager on nodes allowed to run Longhorn components.
tolerations: []
## If you want to set tolerations for Longhorn Manager DaemonSet, delete the `[]` in the line above
## and uncomment this example block
# - key: "key"
# operator: "Equal"
# value: "value"
# effect: "NoSchedule"
# -- Resource requests and limits for Longhorn Manager pods.
resources: ~
# -- Node selector for Longhorn Manager. Specify the nodes allowed to run Longhorn Manager.
nodeSelector: {}
## If you want to set node selector for Longhorn Manager DaemonSet, delete the `{}` in the line above
## and uncomment this example block
# label-key1: "label-value1"
# label-key2: "label-value2"
# -- Annotation for the Longhorn Manager service.
serviceAnnotations: {}
## If you want to set annotations for the Longhorn Manager service, delete the `{}` in the line above
## and uncomment this example block
# annotation-key1: "annotation-value1"
# annotation-key2: "annotation-value2"
serviceLabels: {}
## If you want to set labels for the Longhorn Manager service, delete the `{}` in the line above
## and uncomment this example block
# label-key1: "label-value1"
# label-key2: "label-value2"
## DaemonSet update strategy. Default "100% unavailable" matches the upgrade
## flow (old managers removed before new start); override for rolling updates
## if you prefer that behavior.
updateStrategy:
rollingUpdate:
maxUnavailable: "100%"
longhornDriver:
log:
# -- Format of longhorn-driver logs. (Options: "plain", "json")
format: plain
# -- PriorityClass for Longhorn Driver.
priorityClass: *defaultPriorityClassNameRef
# -- Toleration for Longhorn Driver on nodes allowed to run Longhorn components.
tolerations: []
## If you want to set tolerations for Longhorn Driver Deployer Deployment, delete the `[]` in the line above
## and uncomment this example block
# - key: "key"
# operator: "Equal"
# value: "value"
# effect: "NoSchedule"
# -- Node selector for Longhorn Driver. Specify the nodes allowed to run Longhorn Driver.
nodeSelector: {}
## If you want to set node selector for Longhorn Driver Deployer Deployment, delete the `{}` in the line above
## and uncomment this example block
# label-key1: "label-value1"
# label-key2: "label-value2"
longhornUI:
# -- Replica count for Longhorn UI.
replicas: 2
# -- PriorityClass for Longhorn UI.
priorityClass: *defaultPriorityClassNameRef
# -- Affinity for Longhorn UI pods. Specify the affinity you want to use for Longhorn UI.
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- longhorn-ui
topologyKey: kubernetes.io/hostname
# -- Toleration for Longhorn UI on nodes allowed to run Longhorn components.
tolerations: []
## If you want to set tolerations for Longhorn UI Deployment, delete the `[]` in the line above
## and uncomment this example block
# - key: "key"
# operator: "Equal"
# value: "value"
# effect: "NoSchedule"
# -- Node selector for Longhorn UI. Specify the nodes allowed to run Longhorn UI.
nodeSelector: {}
## If you want to set node selector for Longhorn UI Deployment, delete the `{}` in the line above
## and uncomment this example block
# label-key1: "label-value1"
# label-key2: "label-value2"
ingress:
# -- Setting that allows Longhorn to generate ingress records for the Longhorn UI service.
enabled: false
# -- IngressClass resource that contains ingress configuration, including the name of the Ingress controller.
# ingressClassName can replace the kubernetes.io/ingress.class annotation used in earlier Kubernetes releases.
ingressClassName: ~
# -- Hostname of the Layer 7 load balancer.
host: sslip.io
# -- Extra hostnames for TLS (Subject Alternative Names - SAN). Used when you need multiple FQDNs for the same ingress.
# Example:
# extraHosts:
# - longhorn.example.com
# - longhorn-ui.internal.local
extraHosts: []
# -- Setting that allows you to enable TLS on ingress records.
tls: false
# -- Setting that allows you to enable secure connections to the Longhorn UI service via port 443.
secureBackends: false
# -- TLS secret that contains the private key and certificate to be used for TLS. This setting applies only when TLS is enabled on ingress records.
tlsSecret: longhorn.local-tls
# -- Default ingress path. You can access the Longhorn UI by following the full ingress path {{host}}+{{path}}.
path: /
# -- Ingress path type. To maintain backward compatibility, the default value is "ImplementationSpecific".
pathType: ImplementationSpecific
## If you're using kube-lego, you will want to add:
## kubernetes.io/tls-acme: true
##
## For a full list of possible ingress annotations, please see
## ref: https://github.com/kubernetes/ingress-nginx/blob/master/docs/annotations.md
##
## If tls is set to true, annotation ingress.kubernetes.io/secure-backends: "true" will automatically be set
# -- Ingress annotations in the form of key-value pairs.
annotations:
# kubernetes.io/ingress.class: nginx
# kubernetes.io/tls-acme: true
# -- Secret that contains a TLS private key and certificate. Use secrets if you want to use your own certificates to secure ingresses.
secrets:
## If you're providing your own certificates, please use this to add the certificates as secrets
## key and certificate should start with -----BEGIN CERTIFICATE----- or
## -----BEGIN RSA PRIVATE KEY-----
##
## name should line up with a tlsSecret set further up
## If you're using kube-lego, this is unneeded, as it will create the secret for you if it is not set
##
## It is also possible to create and manage the certificates outside of this helm chart
## Please see README.md for more information
# - name: longhorn.local-tls
# key:
# certificate:
httproute:
# -- Setting that allows Longhorn to generate HTTPRoute records for the Longhorn UI service using Gateway API.
enabled: false
# -- Gateway references for HTTPRoute. Specify which Gateway(s) should handle this route.
parentRefs: []
## Example:
# - name: gateway-name
# namespace: gateway-namespace
# # Optional fields with defaults:
# # group: gateway.networking.k8s.io # default
# # kind: Gateway # default
# # sectionName: https # optional, targets a specific listener
# -- List of hostnames for the HTTPRoute. Multiple hostnames are supported.
hostnames: []
## Example:
# - longhorn.example.com
# - longhorn.example.org
# -- Default path for HTTPRoute. You can access the Longhorn UI by following the full path.
path: /
# -- Path match type for HTTPRoute. (Options: "Exact", "PathPrefix")
pathType: PathPrefix
# -- Annotations for the HTTPRoute resource in the form of key-value pairs.
annotations: {}
## Example:
# annotation-key1: "annotation-value1"
# -- Setting that allows you to enable pod security policies (PSPs) that allow privileged Longhorn pods to start. This setting applies only to clusters running Kubernetes 1.25 and earlier, and with the built-in Pod Security admission controller enabled.
enablePSP: false
# -- Specify override namespace, specifically this is useful for using longhorn as sub-chart and its release namespace is not the `longhorn-system`.
namespaceOverride: ""
# -- Annotation for the Longhorn Manager DaemonSet pods. This setting is optional.
annotations: {}
serviceAccount:
# -- Annotations to add to the service account
annotations: {}
metrics:
serviceMonitor:
# -- Setting that allows the creation of a Prometheus ServiceMonitor resource for Longhorn Manager components.
enabled: false
# -- Additional labels for the Prometheus ServiceMonitor resource.
additionalLabels: {}
# -- Annotations for the Prometheus ServiceMonitor resource.
annotations: {}
# -- Interval at which Prometheus scrapes the metrics from the target.
interval: ""
# -- Timeout after which Prometheus considers the scrape to be failed.
scrapeTimeout: ""
# -- Configures the relabeling rules to apply the targets metadata labels. See the [Prometheus Operator
# documentation](https://prometheus-operator.dev/docs/api-reference/api/#monitoring.coreos.com/v1.Endpoint) for
# formatting details.
relabelings: []
# -- Configures the relabeling rules to apply to the samples before ingestion. See the [Prometheus Operator
# documentation](https://prometheus-operator.dev/docs/api-reference/api/#monitoring.coreos.com/v1.Endpoint) for
# formatting details.
metricRelabelings: []
## openshift settings
openshift:
# -- Setting that allows Longhorn to integrate with OpenShift.
enabled: false
ui:
# -- Route for connections between Longhorn and the OpenShift web console.
route: "longhorn-ui"
# -- Port for accessing the OpenShift web console.
port: 443
# -- Port for proxy that provides access to the OpenShift web console.
proxy: 8443
# -- Setting that allows Longhorn to generate code coverage profiles.
enableGoCoverDir: false
# -- Add extra objects manifests
extraObjects: []

View File

@@ -0,0 +1,60 @@
这是一个非常敏锐的问题。既然 Longhorn 已经通过多副本Replica实现了节点级的高可用为什么还要“多此一举”去对接 S3 呢?
简单来说:**副本是“容灾”,而 S3 是“避难所”。**
以下是 Longhorn 对接 S3 的核心理由:
---
## 1. 应对“全集群覆灭”风险
Longhorn 的默认副本(比如你设定的 3 副本)通常都分布在同一个 K3s 集群的物理节点上。
* **如果不接 S3** 如果发生极端情况(如机房断电导致文件系统损坏、人为误删整个 `longhorn-system` 命名空间、或者由于某种 Bug 导致数据卷损坏),你的数据就彻底丢了。
* **对接 S3 后:** S3 是**独立于 K3s 集群之外**的存储。即使你把整个 K3s 集群删掉再重装,只要 S3 里的备份还在,你就能在新集群里瞬间恢复 Redis 或 PG16 的数据。
---
## 2. 异地备份与低成本归档
Longhorn 将数据存在高性能的 SSD 上,价格较贵。而 S3尤其是自建的 MinIO 或云端 S3适合存放不经常访问的长久数据。
* **快照Snapshot**:存在本地,方便快速回滚。
* **备份Backup**:传到 S3。
你可以配置策略:每小时做一个快照,每天将快照同步到 S3 一次。这符合数据备份的 **3-2-1 原则**3个副本2种介质1个异地
---
## 3. 跨集群迁移与灾备 (DR)
如果你想把业务从“开发环境”迁移到“生产环境”,或者在另一个城市搭建一个备份集群:
1. 在 A 集群中,将 Longhorn 卷备份到 S3。
2. 在 B 集群中,连接同一个 S3。
3. 点击 **Restore**几分钟内B 集群就长出了一个一模一样的 Redis 卷。
Longhorn 甚至支持 **DR Volume灾备卷**,它会持续从 S3 拉取更新,处于“热备”状态。
---
## 4. 节省本地磁盘空间
Longhorn 的快照Snapshot如果积攒多了会占用大量的本地磁盘空间。
对接 S3 后,你可以将旧的备份保留在 S3 中而定期清理本地磁盘。S3 的存储空间几乎是无限的。
---
## 5. 你的 Redis 场景如何受益?
对于你的 Redis 独立 Pod
1. **崩溃恢复:** 如果你的 Redis AOF 文件因为某种原因在本地损坏了,你可以从 S3 找回昨天下午 4 点的备份。
2. **版本回滚:** 如果由于代码错误导致 Redis 里的数据被污染了,你可以通过 S3 的备份快速“穿越”回事故发生前。
---
### 配置建议
在你的 `/home/fei/k3s/009-基础设施/004-longhorn/values.yaml` 中,你会看到 `defaultSetting` 下有 `backupTarget`。你应该将其指向你的 S3 桶地址,例如:
`s3://longhorn-backup@us-east-1/`
**由于你已经有了 S3 服务,这等于是“免费”的数据保险。你需要我提供在 Longhorn 中配置 S3 的具体参数格式吗?**

View File

@@ -0,0 +1,43 @@
{
email admin@u6.net3w.com
}
# 示例域名配置
test.u6.net3w.com {
reverse_proxy traefik.kube-system.svc.cluster.local:80
}
# MinIO S3 API
s3.u6.net3w.com {
reverse_proxy traefik.kube-system.svc.cluster.local:80
}
# MinIO Console
console.s3.u6.net3w.com {
reverse_proxy traefik.kube-system.svc.cluster.local:80
}
# Longhorn 存储管理
longhorn.u6.net3w.com {
reverse_proxy traefik.kube-system.svc.cluster.local:80
}
# Grafana 监控仪表板
grafana.u6.net3w.com {
reverse_proxy traefik.kube-system.svc.cluster.local:80
}
# Prometheus 监控
prometheus.u6.net3w.com {
reverse_proxy traefik.kube-system.svc.cluster.local:80
}
# Alertmanager 告警管理
alertmanager.u6.net3w.com {
reverse_proxy traefik.kube-system.svc.cluster.local:80
}
# 导航页面
dh.u6.net3w.com {
reverse_proxy traefik.kube-system.svc.cluster.local:80
}

View File

@@ -0,0 +1,16 @@
#!/bin/bash
# 应用 Longhorn Ingress
echo "创建 Longhorn Ingress..."
kubectl apply -f longhorn-ingress.yaml
# 显示 Ingress 状态
echo ""
echo "Ingress 状态:"
kubectl get ingress -n longhorn-system
echo ""
echo "访问 Longhorn UI"
echo " URL: http://longhorn.local"
echo " 需要在 /etc/hosts 中添加:"
echo " <节点IP> longhorn.local"

View File

@@ -0,0 +1,19 @@
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: longhorn-ingress
namespace: longhorn-system
annotations:
traefik.ingress.kubernetes.io/router.entrypoints: web
spec:
rules:
- host: longhorn.u6.net3w.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: longhorn-frontend
port:
number: 80

View File

@@ -0,0 +1,202 @@
# Traefik Ingress 控制器配置
## 当前状态
K3s 默认已安装 Traefik 作为 Ingress 控制器。
- **命名空间**: kube-system
- **服务类型**: ClusterIP
- **端口**: 80 (HTTP), 443 (HTTPS)
## Traefik 配置信息
查看 Traefik 配置:
```bash
kubectl get deployment traefik -n kube-system -o yaml
```
查看 Traefik 服务:
```bash
kubectl get svc traefik -n kube-system
```
## 使用 Ingress
### 基本 HTTP Ingress 示例
```yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: example-ingress
namespace: default
annotations:
traefik.ingress.kubernetes.io/router.entrypoints: web
spec:
rules:
- host: example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: example-service
port:
number: 80
```
### HTTPS Ingress 示例(使用 TLS
```yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: example-ingress-tls
namespace: default
annotations:
traefik.ingress.kubernetes.io/router.entrypoints: websecure
traefik.ingress.kubernetes.io/router.tls: "true"
spec:
tls:
- hosts:
- example.com
secretName: example-tls-secret
rules:
- host: example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: example-service
port:
number: 80
```
## 创建 TLS 证书
### 使用 Let's Encrypt (cert-manager)
1. 安装 cert-manager
```bash
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml
```
2. 创建 ClusterIssuer
```yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: your-email@example.com
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- http01:
ingress:
class: traefik
```
### 使用自签名证书
```bash
openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
-keyout tls.key -out tls.crt \
-subj "/CN=example.com/O=example"
kubectl create secret tls example-tls-secret \
--key tls.key --cert tls.crt -n default
```
## Traefik Dashboard
访问 Traefik Dashboard
```bash
kubectl port-forward -n kube-system $(kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik -o name) 9000:9000
```
然后访问: http://localhost:9000/dashboard/
## 常用注解
### 重定向 HTTP 到 HTTPS
```yaml
annotations:
traefik.ingress.kubernetes.io/redirect-entry-point: https
traefik.ingress.kubernetes.io/redirect-permanent: "true"
```
### 设置超时
```yaml
annotations:
traefik.ingress.kubernetes.io/router.middlewares: default-timeout@kubernetescrd
```
### 启用 CORS
```yaml
annotations:
traefik.ingress.kubernetes.io/router.middlewares: default-cors@kubernetescrd
```
## 中间件示例
### 创建超时中间件
```yaml
apiVersion: traefik.containo.us/v1alpha1
kind: Middleware
metadata:
name: timeout
namespace: default
spec:
forwardAuth:
address: http://auth-service
trustForwardHeader: true
```
## 监控和日志
查看 Traefik 日志:
```bash
kubectl logs -n kube-system -l app.kubernetes.io/name=traefik -f
```
## 故障排查
### 检查 Ingress 状态
```bash
kubectl get ingress -A
kubectl describe ingress <ingress-name> -n <namespace>
```
### 检查 Traefik 配置
```bash
kubectl get ingressroute -A
kubectl get middleware -A
```
## 外部访问配置
如果需要从外部访问,可以:
1. **使用 NodePort**
```bash
kubectl patch svc traefik -n kube-system -p '{"spec":{"type":"NodePort"}}'
```
2. **使用 LoadBalancer**(需要云环境或 MetalLB
```bash
kubectl patch svc traefik -n kube-system -p '{"spec":{"type":"LoadBalancer"}}'
```
3. **使用 HostPort**(直接绑定到节点端口 80/443
## 参考资源
- Traefik 官方文档: https://doc.traefik.io/traefik/
- K3s Traefik 配置: https://docs.k3s.io/networking#traefik-ingress-controller

View File

@@ -0,0 +1,34 @@
#!/bin/bash
# 添加 Prometheus 社区 Helm 仓库
echo "添加 Prometheus Helm 仓库..."
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# 创建命名空间
echo "创建 monitoring 命名空间..."
kubectl create namespace monitoring
# 安装 kube-prometheus-stack (包含 Prometheus, Grafana, Alertmanager)
echo "安装 kube-prometheus-stack..."
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
-f values.yaml
# 等待部署完成
echo "等待 Prometheus 和 Grafana 启动..."
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=grafana -n monitoring --timeout=300s
# 显示状态
echo ""
echo "监控系统部署完成!"
kubectl get pods -n monitoring
kubectl get svc -n monitoring
echo ""
echo "访问信息:"
echo " Grafana: http://grafana.local (需要配置 Ingress)"
echo " 默认用户名: admin"
echo " 默认密码: prom-operator"
echo ""
echo " Prometheus: http://prometheus.local (需要配置 Ingress)"

View File

@@ -0,0 +1,59 @@
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: grafana-ingress
namespace: monitoring
annotations:
traefik.ingress.kubernetes.io/router.entrypoints: web
spec:
rules:
- host: grafana.u6.net3w.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: kube-prometheus-stack-grafana
port:
number: 80
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: prometheus-ingress
namespace: monitoring
annotations:
traefik.ingress.kubernetes.io/router.entrypoints: web
spec:
rules:
- host: prometheus.u6.net3w.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: kube-prometheus-stack-prometheus
port:
number: 9090
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: alertmanager-ingress
namespace: monitoring
annotations:
traefik.ingress.kubernetes.io/router.entrypoints: web
spec:
rules:
- host: alertmanager.u6.net3w.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: kube-prometheus-stack-alertmanager
port:
number: 9093

View File

@@ -0,0 +1,241 @@
# Prometheus + Grafana 监控系统
## 组件说明
### Prometheus
- **功能**: 时间序列数据库,收集和存储指标数据
- **存储**: 20Gi Longhorn 卷
- **数据保留**: 15 天
- **访问**: http://prometheus.local
### Grafana
- **功能**: 可视化仪表板
- **存储**: 5Gi Longhorn 卷
- **默认用户**: admin
- **默认密码**: prom-operator
- **访问**: http://grafana.local
### Alertmanager
- **功能**: 告警管理和通知
- **存储**: 5Gi Longhorn 卷
- **访问**: http://alertmanager.local
### Node Exporter
- **功能**: 收集节点级别的系统指标CPU、内存、磁盘等
### Kube State Metrics
- **功能**: 收集 Kubernetes 资源状态指标
## 部署方式
```bash
bash deploy.sh
```
## 部署后配置
### 1. 应用 Ingress
```bash
kubectl apply -f ingress.yaml
```
### 2. 配置 /etc/hosts
```
<节点IP> grafana.local
<节点IP> prometheus.local
<节点IP> alertmanager.local
```
### 3. 访问 Grafana
1. 打开浏览器访问: http://grafana.local
2. 使用默认凭证登录:
- 用户名: admin
- 密码: prom-operator
3. 首次登录后建议修改密码
## 预置仪表板
Grafana 已预装多个仪表板:
1. **Kubernetes / Compute Resources / Cluster**
- 集群整体资源使用情况
2. **Kubernetes / Compute Resources / Namespace (Pods)**
- 按命名空间查看 Pod 资源使用
3. **Kubernetes / Compute Resources / Node (Pods)**
- 按节点查看 Pod 资源使用
4. **Kubernetes / Networking / Cluster**
- 集群网络流量统计
5. **Node Exporter / Nodes**
- 节点详细指标CPU、内存、磁盘、网络
## 监控目标
系统会自动监控:
- ✅ Kubernetes API Server
- ✅ Kubelet
- ✅ Node Exporter (节点指标)
- ✅ Kube State Metrics (K8s 资源状态)
- ✅ CoreDNS
- ✅ Prometheus 自身
- ✅ Grafana
## 添加自定义监控
### 监控 Redis
创建 ServiceMonitor
```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: redis-monitor
namespace: monitoring
spec:
selector:
matchLabels:
app: redis
namespaceSelector:
matchNames:
- redis
endpoints:
- port: redis
interval: 30s
```
### 监控 PostgreSQL
需要部署 postgres-exporter
```bash
helm install postgres-exporter prometheus-community/prometheus-postgres-exporter \
--namespace postgresql \
--set config.datasource.host=postgresql-service.postgresql.svc.cluster.local \
--set config.datasource.user=postgres \
--set config.datasource.password=postgres123
```
## 告警配置
### 查看告警规则
```bash
kubectl get prometheusrules -n monitoring
```
### 自定义告警规则
创建 PrometheusRule
```yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: custom-alerts
namespace: monitoring
spec:
groups:
- name: custom
interval: 30s
rules:
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "节点内存使用率超过 90%"
description: "节点 {{ $labels.instance }} 内存使用率为 {{ $value | humanizePercentage }}"
```
## 配置告警通知
编辑 Alertmanager 配置:
```bash
kubectl edit secret alertmanager-kube-prometheus-stack-alertmanager -n monitoring
```
添加邮件、Slack、钉钉等通知渠道。
## 数据持久化
所有数据都存储在 Longhorn 卷上:
- Prometheus 数据: 20Gi
- Grafana 配置: 5Gi
- Alertmanager 数据: 5Gi
可以通过 Longhorn UI 创建快照和备份到 S3。
## 常用操作
### 查看 Prometheus 目标
访问: http://prometheus.local/targets
### 查看告警
访问: http://alertmanager.local
### 导入自定义仪表板
1. 访问 Grafana
2. 点击 "+" -> "Import"
3. 输入仪表板 ID 或上传 JSON
推荐仪表板:
- Node Exporter Full: 1860
- Kubernetes Cluster Monitoring: 7249
- Longhorn: 13032
### 查看日志
```bash
# Prometheus 日志
kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus -f
# Grafana 日志
kubectl logs -n monitoring -l app.kubernetes.io/name=grafana -f
```
## 性能优化
### 调整数据保留时间
编辑 values.yaml 中的 `retention` 参数,然后:
```bash
helm upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring -f values.yaml
```
### 调整采集间隔
默认采集间隔为 30 秒,可以在 ServiceMonitor 中调整。
## 故障排查
### Prometheus 无法采集数据
```bash
# 检查 ServiceMonitor
kubectl get servicemonitor -A
# 检查 Prometheus 配置
kubectl get prometheus -n monitoring -o yaml
```
### Grafana 无法连接 Prometheus
检查 Grafana 数据源配置:
1. 登录 Grafana
2. Configuration -> Data Sources
3. 确认 Prometheus URL 正确
## 卸载
```bash
helm uninstall kube-prometheus-stack -n monitoring
kubectl delete namespace monitoring
```
## 参考资源
- Prometheus 文档: https://prometheus.io/docs/
- Grafana 文档: https://grafana.com/docs/
- kube-prometheus-stack: https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack

View File

@@ -0,0 +1,89 @@
# Prometheus Operator 配置
prometheusOperator:
enabled: true
# Prometheus 配置
prometheus:
enabled: true
prometheusSpec:
retention: 15d
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: longhorn
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 20Gi
resources:
requests:
memory: 512Mi
cpu: 250m
limits:
memory: 2Gi
cpu: 1000m
# Grafana 配置
grafana:
enabled: true
adminPassword: prom-operator
persistence:
enabled: true
storageClassName: longhorn
size: 5Gi
resources:
requests:
memory: 256Mi
cpu: 100m
limits:
memory: 512Mi
cpu: 500m
# Alertmanager 配置
alertmanager:
enabled: true
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
storageClassName: longhorn
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 5Gi
# Node Exporter (收集节点指标)
nodeExporter:
enabled: true
# Kube State Metrics (收集 K8s 资源指标)
kubeStateMetrics:
enabled: true
# 默认监控规则
defaultRules:
create: true
rules:
alertmanager: true
etcd: true
configReloaders: true
general: true
k8s: true
kubeApiserverAvailability: true
kubeApiserverSlos: true
kubelet: true
kubeProxy: true
kubePrometheusGeneral: true
kubePrometheusNodeRecording: true
kubernetesApps: true
kubernetesResources: true
kubernetesStorage: true
kubernetesSystem: true
kubeScheduler: true
kubeStateMetrics: true
network: true
node: true
nodeExporterAlerting: true
nodeExporterRecording: true
prometheus: true
prometheusOperator: true

View File

@@ -0,0 +1,40 @@
#!/bin/bash
# KEDA 部署脚本
echo "开始部署 KEDA..."
# 设置 KUBECONFIG
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
# 添加 KEDA Helm 仓库
echo "添加 KEDA Helm 仓库..."
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
# 创建命名空间
echo "创建 keda 命名空间..."
kubectl create namespace keda --dry-run=client -o yaml | kubectl apply -f -
# 安装 KEDA
echo "安装 KEDA..."
helm install keda kedacore/keda \
--namespace keda \
-f values.yaml
# 等待 KEDA 组件就绪
echo "等待 KEDA 组件启动..."
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=keda-operator -n keda --timeout=300s
# 显示状态
echo ""
echo "KEDA 部署完成!"
kubectl get pods -n keda
kubectl get svc -n keda
echo ""
echo "验证 KEDA CRD"
kubectl get crd | grep keda
echo ""
echo "KEDA 已成功部署到命名空间: keda"

View File

@@ -0,0 +1,16 @@
apiVersion: http.keda.sh/v1alpha1
kind: HTTPScaledObject
metadata:
name: my-web-app-scaler
spec:
host: my-app.example.com # 你的域名
targetPendingRequests: 100
scaleTargetRef:
name: your-deployment-name # 你想缩放到 0 的应用名
kind: Deployment
apiVersion: apps/v1
service: your-service-name
port: 80
replicas:
min: 0 # 核心:无人访问时缩放为 0
max: 10

View File

@@ -0,0 +1,22 @@
#!/bin/bash
# 安装 KEDA HTTP Add-on
echo "安装 KEDA HTTP Add-on..."
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
# 安装 HTTP Add-on使用默认配置
helm install http-add-on kedacore/keda-add-ons-http \
--namespace keda
echo "等待 HTTP Add-on 组件启动..."
sleep 10
echo ""
echo "HTTP Add-on 部署完成!"
kubectl get pods -n keda | grep http
echo ""
echo "HTTP Add-on 服务:"
kubectl get svc -n keda | grep http

View File

@@ -0,0 +1,458 @@
# KEDA 自动扩缩容
## 功能说明
KEDA (Kubernetes Event Driven Autoscaling) 为 K3s 集群提供基于事件驱动的自动扩缩容能力。
### 核心功能
- **按需启动/停止服务**:空闲时自动缩容到 0节省资源
- **基于指标自动扩缩容**:根据实际负载动态调整副本数
- **多种触发器支持**CPU、内存、Prometheus 指标、数据库连接等
- **与 Prometheus 集成**:利用现有监控数据进行扩缩容决策
## 部署方式
```bash
cd /home/fei/k3s/009-基础设施/007-keda
bash deploy.sh
```
## 已配置的服务
### 1. Navigation 导航服务 ✅
- **最小副本数**: 0空闲时完全停止
- **最大副本数**: 10
- **触发条件**:
- HTTP 请求速率 > 10 req/min
- CPU 使用率 > 60%
- **冷却期**: 3 分钟
**配置文件**: `scalers/navigation-scaler.yaml`
### 2. Redis 缓存服务 ⏳
- **最小副本数**: 0空闲时完全停止
- **最大副本数**: 5
- **触发条件**:
- 有客户端连接
- CPU 使用率 > 70%
- **冷却期**: 5 分钟
**配置文件**: `scalers/redis-scaler.yaml`
**状态**: 待应用(需要先为 Redis 添加 Prometheus exporter
### 3. PostgreSQL 数据库 ❌
**不推荐使用 KEDA 扩展 PostgreSQL**
原因:
- PostgreSQL 是有状态服务,多个副本会导致存储冲突
- 需要配置主从复制才能安全扩展
- 建议使用 PostgreSQL Operator 或 PgBouncer + KEDA
详细说明:`scalers/postgresql-说明.md`
## 应用 ScaledObject
### 部署所有 Scaler
```bash
# 应用 Navigation Scaler
kubectl apply -f scalers/navigation-scaler.yaml
# 应用 Redis Scaler需要先配置 Redis exporter
kubectl apply -f scalers/redis-scaler.yaml
# ⚠️ PostgreSQL 不推荐使用 KEDA 扩展
# 详见: scalers/postgresql-说明.md
```
### 查看 ScaledObject 状态
```bash
# 查看所有 ScaledObject
kubectl get scaledobject -A
# 查看详细信息
kubectl describe scaledobject navigation-scaler -n navigation
kubectl describe scaledobject redis-scaler -n redis
kubectl describe scaledobject postgresql-scaler -n postgresql
```
### 查看自动创建的 HPA
```bash
# KEDA 会自动创建 HorizontalPodAutoscaler
kubectl get hpa -A
```
## 支持的触发器类型
### 1. Prometheus 指标
```yaml
triggers:
- type: prometheus
metadata:
serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090
metricName: custom_metric
query: sum(rate(http_requests_total[1m]))
threshold: "100"
```
### 2. CPU/内存使用率
```yaml
triggers:
- type: cpu
metadata:
type: Utilization
value: "70"
- type: memory
metadata:
type: Utilization
value: "80"
```
### 3. Redis 队列长度
```yaml
triggers:
- type: redis
metadata:
address: redis.redis.svc.cluster.local:6379
listName: mylist
listLength: "5"
```
### 4. PostgreSQL 查询
```yaml
triggers:
- type: postgresql
metadata:
connectionString: postgresql://user:pass@host:5432/db
query: "SELECT COUNT(*) FROM tasks WHERE status='pending'"
targetQueryValue: "10"
```
### 5. Cron 定时触发
```yaml
triggers:
- type: cron
metadata:
timezone: Asia/Shanghai
start: 0 8 * * * # 每天 8:00 扩容
end: 0 18 * * * # 每天 18:00 缩容
desiredReplicas: "3"
```
## 为新服务添加自动扩缩容
### 步骤 1: 确保服务配置正确
服务的 Deployment 必须配置 `resources.requests`
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
# 不要设置 replicas由 KEDA 管理
template:
spec:
containers:
- name: myapp
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
```
### 步骤 2: 创建 ScaledObject
```yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: myapp-scaler
namespace: myapp
spec:
scaleTargetRef:
name: myapp
minReplicaCount: 0
maxReplicaCount: 10
pollingInterval: 30
cooldownPeriod: 300
triggers:
- type: prometheus
metadata:
serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090
metricName: myapp_requests
query: sum(rate(http_requests_total{app="myapp"}[1m]))
threshold: "50"
```
### 步骤 3: 应用配置
```bash
kubectl apply -f myapp-scaler.yaml
```
## 监控和调试
### 查看 KEDA 日志
```bash
# Operator 日志
kubectl logs -n keda -l app.kubernetes.io/name=keda-operator -f
# Metrics Server 日志
kubectl logs -n keda -l app.kubernetes.io/name=keda-metrics-apiserver -f
```
### 查看扩缩容事件
```bash
# 查看 HPA 事件
kubectl describe hpa -n <namespace>
# 查看 Pod 事件
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
```
### 在 Prometheus 中查询 KEDA 指标
访问 https://prometheus.u6.net3w.com查询
```promql
# KEDA Scaler 活跃状态
keda_scaler_active
# KEDA Scaler 错误
keda_scaler_errors_total
# 当前指标值
keda_scaler_metrics_value
```
### 在 Grafana 中查看 KEDA 仪表板
1. 访问 https://grafana.u6.net3w.com
2. 导入 KEDA 官方仪表板 ID: **14691**
3. 查看实时扩缩容状态
## 测试自动扩缩容
### 测试 Navigation 服务
**测试缩容到 0**
```bash
# 1. 停止访问导航页面,等待 3 分钟
sleep 180
# 2. 检查副本数
kubectl get deployment navigation -n navigation
# 预期输出READY 0/0
```
**测试从 0 扩容:**
```bash
# 1. 访问导航页面
curl https://dh.u6.net3w.com
# 2. 监控副本数变化
kubectl get deployment navigation -n navigation -w
# 预期:副本数从 0 变为 1约 10-30 秒)
```
### 测试 Redis 服务
**测试基于连接数扩容:**
```bash
# 1. 连接 Redis
kubectl run redis-client --rm -it --image=redis:7-alpine -- redis-cli -h redis.redis.svc.cluster.local
# 2. 在另一个终端监控
kubectl get deployment redis -n redis -w
# 预期:有连接时副本数从 0 变为 1
```
### 测试 PostgreSQL 服务
**测试基于连接数扩容:**
```bash
# 1. 创建多个数据库连接
for i in {1..15}; do
kubectl run pg-client-$i --image=postgres:16-alpine --restart=Never -- \
psql -h postgresql-service.postgresql.svc.cluster.local -U postgres -c "SELECT pg_sleep(60);" &
done
# 2. 监控副本数
kubectl get statefulset postgresql -n postgresql -w
# 预期:连接数超过 10 时,副本数从 1 增加到 2
```
## 故障排查
### ScaledObject 未生效
**检查 ScaledObject 状态:**
```bash
kubectl describe scaledobject <name> -n <namespace>
```
**常见问题:**
1. **Deployment 设置了固定 replicas**
- 解决:移除 Deployment 中的 `replicas` 字段
2. **缺少 resources.requests**
- 解决:为容器添加 `resources.requests` 配置
3. **Prometheus 查询错误**
- 解决:在 Prometheus UI 中测试查询语句
### 服务无法缩容到 0
**可能原因:**
1. **仍有活跃连接或请求**
- 检查:查看 Prometheus 指标值
2. **cooldownPeriod 未到**
- 检查:等待冷却期结束
3. **minReplicaCount 设置错误**
- 检查:确认 `minReplicaCount: 0`
### 扩容速度慢
**优化建议:**
1. **减少 pollingInterval**
```yaml
pollingInterval: 15 # 从 30 秒改为 15 秒
```
2. **降低 threshold**
```yaml
threshold: "5" # 降低触发阈值
```
3. **使用多个触发器**
```yaml
triggers:
- type: prometheus
# ...
- type: cpu
# ...
```
## 最佳实践
### 1. 合理设置副本数范围
- **无状态服务**`minReplicaCount: 0`,节省资源
- **有状态服务**`minReplicaCount: 1`,保证可用性
- **关键服务**`minReplicaCount: 2`,保证高可用
### 2. 选择合适的冷却期
- **快速响应服务**`cooldownPeriod: 60-180`1-3 分钟)
- **一般服务**`cooldownPeriod: 300`5 分钟)
- **数据库服务**`cooldownPeriod: 600-900`10-15 分钟)
### 3. 监控扩缩容行为
- 定期查看 Grafana 仪表板
- 设置告警规则
- 分析扩缩容历史
### 4. 测试冷启动时间
- 测量从 0 扩容到可用的时间
- 优化镜像大小和启动脚本
- 考虑使用 `minReplicaCount: 1` 避免冷启动
## 配置参考
### ScaledObject 完整配置示例
```yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: example-scaler
namespace: example
spec:
scaleTargetRef:
name: example-deployment
kind: Deployment # 可选Deployment, StatefulSet
apiVersion: apps/v1 # 可选
minReplicaCount: 0 # 最小副本数
maxReplicaCount: 10 # 最大副本数
pollingInterval: 30 # 轮询间隔(秒)
cooldownPeriod: 300 # 缩容冷却期(秒)
idleReplicaCount: 0 # 空闲时的副本数
fallback: # 故障回退配置
failureThreshold: 3
replicas: 2
advanced: # 高级配置
restoreToOriginalReplicaCount: false
horizontalPodAutoscalerConfig:
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: custom_metric
query: sum(rate(metric[1m]))
threshold: "100"
```
## 卸载 KEDA
```bash
# 删除所有 ScaledObject
kubectl delete scaledobject --all -A
# 卸载 KEDA
helm uninstall keda -n keda
# 删除命名空间
kubectl delete namespace keda
```
## 参考资源
- KEDA 官方文档: https://keda.sh/docs/
- KEDA Scalers: https://keda.sh/docs/scalers/
- KEDA GitHub: https://github.com/kedacore/keda
- Grafana 仪表板: https://grafana.com/grafana/dashboards/14691
---
**KEDA 让您的 K3s 集群更智能、更高效!** 🚀

View File

@@ -0,0 +1,380 @@
# KEDA HTTP Add-on 自动缩容到 0 配置指南
本指南说明如何使用 KEDA HTTP Add-on 实现应用在无流量时自动缩容到 0有访问时自动启动。
## 前提条件
1. K3s 集群已安装
2. KEDA 已安装
3. KEDA HTTP Add-on 已安装
4. Traefik 作为 Ingress Controller
### 检查 KEDA HTTP Add-on 是否已安装
```bash
kubectl get pods -n keda | grep http
```
应该看到类似输出:
```
keda-add-ons-http-controller-manager-xxx 1/1 Running
keda-add-ons-http-external-scaler-xxx 1/1 Running
keda-add-ons-http-interceptor-xxx 1/1 Running
```
### 如果未安装,执行以下命令安装
```bash
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install http-add-on kedacore/keda-add-ons-http --namespace keda
```
## 配置步骤
### 1. 准备应用的基础资源
确保你的应用已经有以下资源:
- Deployment
- Service
- Namespace
示例:
```yaml
apiVersion: v1
kind: Namespace
metadata:
name: myapp
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
namespace: myapp
spec:
replicas: 1
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: your-image:tag
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: myapp
namespace: myapp
spec:
selector:
app: myapp
ports:
- port: 80
targetPort: 80
```
### 2. 创建 HTTPScaledObject
这是实现自动缩容到 0 的核心配置。
```yaml
apiVersion: http.keda.sh/v1alpha1
kind: HTTPScaledObject
metadata:
name: myapp-http-scaler
namespace: myapp # 必须与应用在同一个 namespace
spec:
hosts:
- myapp.example.com # 你的域名
pathPrefixes:
- / # 匹配的路径前缀
scaleTargetRef:
name: myapp # Deployment 名称
kind: Deployment
apiVersion: apps/v1
service: myapp # Service 名称
port: 80 # Service 端口
replicas:
min: 0 # 空闲时缩容到 0
max: 10 # 最多扩容到 10 个副本
scalingMetric:
requestRate:
granularity: 1s
targetValue: 100 # 每秒 100 个请求时扩容
window: 1m
scaledownPeriod: 300 # 5 分钟300秒无流量后缩容到 0
```
**重要参数说明:**
- `hosts`: 你的应用域名
- `scaleTargetRef.name`: 你的 Deployment 名称
- `scaleTargetRef.service`: 你的 Service 名称
- `scaleTargetRef.port`: 你的 Service 端口
- `replicas.min: 0`: 允许缩容到 0
- `scaledownPeriod`: 无流量后多久缩容(秒)
### 3. 创建 Traefik IngressRoute
**重要IngressRoute 必须在 keda namespace 中创建**,因为它需要引用 keda namespace 的拦截器服务。
```yaml
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: myapp-ingress
namespace: keda # 注意:必须在 keda namespace
spec:
entryPoints:
- web # HTTP 入口
# - websecure # 如果需要 HTTPS添加这个
routes:
- match: Host(`myapp.example.com`) # 你的域名
kind: Rule
services:
- name: keda-add-ons-http-interceptor-proxy
port: 8080
```
**如果需要 HTTPS添加 TLS 配置:**
```yaml
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: myapp-ingress
namespace: keda
spec:
entryPoints:
- websecure
routes:
- match: Host(`myapp.example.com`)
kind: Rule
services:
- name: keda-add-ons-http-interceptor-proxy
port: 8080
tls:
certResolver: letsencrypt # 你的证书解析器
```
### 4. 完整配置文件模板
将以下内容保存为 `myapp-keda-scaler.yaml`,并根据你的应用修改相应的值:
```yaml
---
# HTTPScaledObject - 实现自动缩容到 0
apiVersion: http.keda.sh/v1alpha1
kind: HTTPScaledObject
metadata:
name: myapp-http-scaler
namespace: myapp # 改为你的 namespace
spec:
hosts:
- myapp.example.com # 改为你的域名
pathPrefixes:
- /
scaleTargetRef:
name: myapp # 改为你的 Deployment 名称
kind: Deployment
apiVersion: apps/v1
service: myapp # 改为你的 Service 名称
port: 80 # 改为你的 Service 端口
replicas:
min: 0
max: 10
scalingMetric:
requestRate:
granularity: 1s
targetValue: 100
window: 1m
scaledownPeriod: 300 # 5 分钟无流量后缩容
---
# Traefik IngressRoute - 路由流量到 KEDA 拦截器
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: myapp-ingress
namespace: keda # 必须在 keda namespace
spec:
entryPoints:
- web
routes:
- match: Host(`myapp.example.com`) # 改为你的域名
kind: Rule
services:
- name: keda-add-ons-http-interceptor-proxy
port: 8080
```
### 5. 应用配置
```bash
kubectl apply -f myapp-keda-scaler.yaml
```
### 6. 验证配置
```bash
# 查看 HTTPScaledObject 状态
kubectl get httpscaledobject -n myapp
# 应该看到 READY = True
# NAME TARGETWORKLOAD TARGETSERVICE MINREPLICAS MAXREPLICAS AGE READY
# myapp-http-scaler apps/v1/Deployment/myapp myapp:80 0 10 10s True
# 查看 IngressRoute
kubectl get ingressroute -n keda
# 查看当前 Pod 数量
kubectl get pods -n myapp
```
## 工作原理
1. **有流量时**
- 用户访问 `myapp.example.com`
- Traefik 将流量路由到 KEDA HTTP 拦截器
- 拦截器检测到请求,通知 KEDA 启动 Pod
- Pod 启动后5-10秒拦截器将流量转发到应用
- 用户看到正常响应(首次访问可能有延迟)
2. **无流量时**
- 5 分钟scaledownPeriod无请求后
- KEDA 自动将 Deployment 缩容到 0
- 不消耗任何计算资源
## 常见问题排查
### 1. 访问返回 404
**检查 IngressRoute 是否在 keda namespace**
```bash
kubectl get ingressroute -n keda | grep myapp
```
如果不在,删除并重新创建:
```bash
kubectl delete ingressroute myapp-ingress -n myapp # 删除错误的
kubectl apply -f myapp-keda-scaler.yaml # 重新创建
```
### 2. HTTPScaledObject READY = False
**查看详细错误信息:**
```bash
kubectl describe httpscaledobject myapp-http-scaler -n myapp
```
**常见错误:**
- `workload already managed by ScaledObject`: 删除旧的 ScaledObject
```bash
kubectl delete scaledobject myapp-scaler -n myapp
```
### 3. Pod 没有自动缩容到 0
**检查是否有旧的 ScaledObject 阻止缩容:**
```bash
kubectl get scaledobject -n myapp
```
如果有,删除它:
```bash
kubectl delete scaledobject <name> -n myapp
```
### 4. 查看 KEDA 拦截器日志
```bash
kubectl logs -n keda -l app.kubernetes.io/name=keda-add-ons-http-interceptor --tail=50
```
### 5. 测试拦截器是否工作
```bash
# 获取拦截器服务 IP
kubectl get svc keda-add-ons-http-interceptor-proxy -n keda
# 直接测试拦截器
curl -H "Host: myapp.example.com" http://<CLUSTER-IP>:8080
```
## 调优建议
### 调整缩容时间
根据你的应用特点调整 `scaledownPeriod`
- **频繁访问的应用**:设置较长时间(如 600 秒 = 10 分钟)
- **偶尔访问的应用**:设置较短时间(如 180 秒 = 3 分钟)
- **演示/测试环境**:可以设置很短(如 60 秒 = 1 分钟)
```yaml
scaledownPeriod: 600 # 10 分钟
```
### 调整扩容阈值
根据应用负载调整 `targetValue`
```yaml
scalingMetric:
requestRate:
targetValue: 50 # 每秒 50 个请求时扩容(更敏感)
```
### 调整最大副本数
```yaml
replicas:
min: 0
max: 20 # 根据你的资源和需求调整
```
## 监控和观察
### 实时监控 Pod 变化
```bash
watch -n 2 'kubectl get pods -n myapp'
```
### 查看 HTTPScaledObject 事件
```bash
kubectl describe httpscaledobject myapp-http-scaler -n myapp
```
### 查看 Deployment 副本数变化
```bash
kubectl get deployment myapp -n myapp -w
```
## 完整示例navigation 应用
参考 `navigation-complete.yaml` 文件,这是一个完整的工作示例。
## 注意事项
1. **首次访问延迟**Pod 从 0 启动需要 5-10 秒,用户首次访问会有延迟
2. **数据库连接**:确保应用能够快速重新建立数据库连接
3. **会话状态**:不要在 Pod 中存储会话状态,使用 Redis 等外部存储
4. **健康检查**:配置合理的 readinessProbe确保 Pod 就绪后才接收流量
5. **资源限制**:设置合理的 resources limits避免启动过慢
## 参考资源
- KEDA 官方文档: https://keda.sh/
- KEDA HTTP Add-on: https://github.com/kedacore/http-add-on
- Traefik IngressRoute: https://doc.traefik.io/traefik/routing/providers/kubernetes-crd/

View File

@@ -0,0 +1,45 @@
---
# HTTPScaledObject - 用于实现缩容到 0 的核心配置
apiVersion: http.keda.sh/v1alpha1
kind: HTTPScaledObject
metadata:
name: navigation-http-scaler
namespace: navigation
spec:
hosts:
- dh.u6.net3w.com
pathPrefixes:
- /
scaleTargetRef:
name: navigation
kind: Deployment
apiVersion: apps/v1
service: navigation
port: 80
replicas:
min: 0 # 空闲时缩容到 0
max: 10 # 最多 10 个副本
scalingMetric:
requestRate:
granularity: 1s
targetValue: 100 # 每秒 100 个请求时扩容
window: 1m
scaledownPeriod: 300 # 5 分钟无流量后缩容到 0
---
# Traefik IngressRoute - 将流量路由到 KEDA HTTP Add-on 的拦截器
# 注意:必须在 keda namespace 中才能引用该 namespace 的服务
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: navigation-ingress
namespace: keda
spec:
entryPoints:
- web
routes:
- match: Host(`dh.u6.net3w.com`)
kind: Rule
services:
- name: keda-add-ons-http-interceptor-proxy
port: 8080

View File

@@ -0,0 +1,24 @@
apiVersion: http.keda.sh/v1alpha1
kind: HTTPScaledObject
metadata:
name: navigation-http-scaler
namespace: navigation
spec:
hosts:
- dh.u6.net3w.com
pathPrefixes:
- /
scaleTargetRef:
name: navigation
kind: Deployment
apiVersion: apps/v1
service: navigation
port: 80
replicas:
min: 0 # 空闲时缩容到 0
max: 10 # 最多 10 个副本
scalingMetric:
requestRate:
granularity: 1s
targetValue: 100 # 每秒 100 个请求时扩容
window: 1m

View File

@@ -0,0 +1,19 @@
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: navigation-ingress
namespace: navigation
annotations:
traefik.ingress.kubernetes.io/router.entrypoints: web
spec:
rules:
- host: dh.u6.net3w.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: keda-add-ons-http-interceptor-proxy
port:
number: 8080

View File

@@ -0,0 +1,23 @@
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: navigation-scaler
namespace: navigation
spec:
scaleTargetRef:
name: navigation
minReplicaCount: 1 # 至少保持 1 个副本HPA 限制)
maxReplicaCount: 10 # 最多 10 个副本
pollingInterval: 15 # 每 15 秒检查一次
cooldownPeriod: 180 # 缩容冷却期 3 分钟
triggers:
- type: prometheus
metadata:
serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090
metricName: nginx_http_requests_total
query: sum(rate(nginx_http_requests_total{namespace="navigation"}[1m]))
threshold: "10" # 每分钟超过 10 个请求时启动
- type: cpu
metricType: Utilization
metadata:
value: "60" # CPU 使用率超过 60% 时扩容

View File

@@ -0,0 +1,261 @@
# ⚠️ PostgreSQL 不适合使用 KEDA 自动扩缩容
## 问题说明
对于传统的 PostgreSQL 架构,直接通过 KEDA 增加副本数会导致:
### 1. 存储冲突
- 多个 Pod 尝试挂载同一个 PVC
- ReadWriteOnce 存储只能被一个 Pod 使用
- 会导致 Pod 启动失败
### 2. 数据损坏风险
- 如果使用 ReadWriteMany 存储,多个实例同时写入会导致数据损坏
- PostgreSQL 不支持多主写入
- 没有锁机制保护数据一致性
### 3. 缺少主从复制
- 需要配置 PostgreSQL 流复制Streaming Replication
- 需要配置主从切换机制
- 需要使用专门的 PostgreSQL Operator
## 正确的 PostgreSQL 扩展方案
### 方案 1: 使用 PostgreSQL Operator
推荐使用专业的 PostgreSQL Operator
#### Zalando PostgreSQL Operator
```bash
# 添加 Helm 仓库
helm repo add postgres-operator-charts https://opensource.zalando.com/postgres-operator/charts/postgres-operator
# 安装 Operator
helm install postgres-operator postgres-operator-charts/postgres-operator
# 创建 PostgreSQL 集群
apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
name: acid-minimal-cluster
spec:
teamId: "acid"
volume:
size: 10Gi
storageClass: longhorn
numberOfInstances: 3 # 1 主 + 2 从
users:
zalando:
- superuser
- createdb
databases:
foo: zalando
postgresql:
version: "16"
```
#### CloudNativePG Operator
```bash
# 安装 CloudNativePG
kubectl apply -f https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.22/releases/cnpg-1.22.0.yaml
# 创建集群
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: cluster-example
spec:
instances: 3
storage:
storageClass: longhorn
size: 10Gi
```
### 方案 2: 读写分离 + KEDA
如果需要使用 KEDA正确的架构是
```
┌─────────────────┐
│ 主库 (Master) │ ← 固定 1 个副本,处理写入
│ StatefulSet │
└─────────────────┘
│ 流复制
┌─────────────────┐
│ 从库 (Replica) │ ← KEDA 管理,处理只读查询
│ Deployment │ 可以 0-N 个副本
└─────────────────┘
```
**配置示例:**
```yaml
# 主库 - 固定副本
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgresql-master
spec:
replicas: 1 # 固定 1 个
# ... 配置主库
---
# 从库 - KEDA 管理
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgresql-replica
spec:
# replicas 由 KEDA 管理
# ... 配置从库(只读)
---
# KEDA ScaledObject - 只扩展从库
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: postgresql-replica-scaler
spec:
scaleTargetRef:
name: postgresql-replica # 只针对从库
minReplicaCount: 0
maxReplicaCount: 5
triggers:
- type: postgresql
metadata:
connectionString: postgresql://user:pass@postgresql-master:5432/db
query: "SELECT COUNT(*) FROM pg_stat_activity WHERE state = 'active' AND query NOT LIKE '%pg_stat_activity%'"
targetQueryValue: "10"
```
### 方案 3: 垂直扩展(推荐用于单实例)
对于单实例 PostgreSQL使用 VPA (Vertical Pod Autoscaler) 更合适:
```yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: postgresql-vpa
spec:
targetRef:
apiVersion: "apps/v1"
kind: StatefulSet
name: postgresql
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: postgresql
minAllowed:
cpu: 250m
memory: 512Mi
maxAllowed:
cpu: 2000m
memory: 4Gi
```
## 当前部署建议
对于您当前的 PostgreSQL 部署(`/home/fei/k3s/010-中间件/002-postgresql/`
### ❌ 不要使用 KEDA 水平扩展
- 当前是单实例 StatefulSet
- 没有配置主从复制
- 直接扩展会导致数据问题
### ✅ 推荐的优化方案
1. **保持单实例运行**
```yaml
replicas: 1 # 固定不变
```
2. **优化资源配置**
```yaml
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 2000m
memory: 4Gi
```
3. **配置连接池**
- 使用 PgBouncer 作为连接池
- PgBouncer 可以使用 KEDA 扩展
4. **定期备份**
- 使用 Longhorn 快照
- 备份到 S3
## PgBouncer + KEDA 方案
这是最实用的方案PostgreSQL 保持单实例PgBouncer 使用 KEDA 扩展。
```yaml
# PostgreSQL - 固定单实例
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgresql
spec:
replicas: 1 # 固定
# ...
---
# PgBouncer - 连接池
apiVersion: apps/v1
kind: Deployment
metadata:
name: pgbouncer
spec:
# replicas 由 KEDA 管理
template:
spec:
containers:
- name: pgbouncer
image: pgbouncer/pgbouncer:latest
# ...
---
# KEDA ScaledObject - 扩展 PgBouncer
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: pgbouncer-scaler
spec:
scaleTargetRef:
name: pgbouncer
minReplicaCount: 1
maxReplicaCount: 10
triggers:
- type: postgresql
metadata:
connectionString: postgresql://postgres:postgres123@postgresql:5432/postgres
query: "SELECT COUNT(*) FROM pg_stat_activity WHERE state = 'active'"
targetQueryValue: "20"
```
## 总结
| 方案 | 适用场景 | 复杂度 | 推荐度 |
|------|---------|--------|--------|
| PostgreSQL Operator | 生产环境,需要高可用 | 高 | ⭐⭐⭐⭐⭐ |
| 读写分离 + KEDA | 读多写少场景 | 中 | ⭐⭐⭐⭐ |
| PgBouncer + KEDA | 连接数波动大 | 低 | ⭐⭐⭐⭐⭐ |
| VPA 垂直扩展 | 单实例,资源需求变化 | 低 | ⭐⭐⭐ |
| 直接 KEDA 扩展 | ❌ 不适用 | - | ❌ |
**对于当前部署,建议保持 PostgreSQL 单实例运行,不使用 KEDA 扩展。**
如果需要扩展能力,优先考虑:
1. 部署 PgBouncer 连接池 + KEDA
2. 或者迁移到 PostgreSQL Operator
---
**重要提醒:有状态服务的扩展需要特殊处理,不能简单地增加副本数!** ⚠️

View File

@@ -0,0 +1,23 @@
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: redis-scaler
namespace: redis
spec:
scaleTargetRef:
name: redis
minReplicaCount: 0 # 空闲时缩容到 0
maxReplicaCount: 5 # 最多 5 个副本
pollingInterval: 30 # 每 30 秒检查一次
cooldownPeriod: 300 # 缩容冷却期 5 分钟
triggers:
- type: prometheus
metadata:
serverAddress: http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090
metricName: redis_connected_clients
query: sum(redis_connected_clients{namespace="redis"})
threshold: "1" # 有连接时启动
- type: cpu
metricType: Utilization
metadata:
value: "70" # CPU 使用率超过 70% 时扩容

View File

@@ -0,0 +1,41 @@
# KEDA Helm 配置
# Operator 配置
operator:
replicaCount: 1
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
# Metrics Server 配置
metricsServer:
replicaCount: 1
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
# 与 Prometheus 集成
prometheus:
metricServer:
enabled: true
port: 9022
path: /metrics
operator:
enabled: true
port: 8080
path: /metrics
# ServiceMonitor 用于 Prometheus 抓取
serviceMonitor:
enabled: true
namespace: keda
additionalLabels:
release: kube-prometheus-stack

View File

@@ -0,0 +1,197 @@
# KEDA 部署最终总结
## ✅ 成功部署
### KEDA 核心组件
- **keda-operator**: ✅ 运行中
- **keda-metrics-apiserver**: ✅ 运行中
- **keda-admission-webhooks**: ✅ 运行中
- **命名空间**: keda
### 已配置的服务
| 服务 | 状态 | 最小副本 | 最大副本 | 说明 |
|------|------|---------|---------|------|
| Navigation | ✅ 已应用 | 0 | 10 | 空闲时自动缩容到 0 |
| Redis | ⏳ 待应用 | 0 | 5 | 需要先配置 Prometheus exporter |
| PostgreSQL | ❌ 不适用 | - | - | 有状态服务,不能直接扩展 |
## ⚠️ 重要修正PostgreSQL
### 问题说明
PostgreSQL 是有状态服务,**不能**直接使用 KEDA 扩展副本数,原因:
1. **存储冲突**: 多个 Pod 尝试挂载同一个 PVC 会失败
2. **数据损坏**: 如果使用 ReadWriteMany多实例写入会导致数据损坏
3. **缺少复制**: 没有配置主从复制,无法保证数据一致性
### 正确方案
已创建详细说明文档:`/home/fei/k3s/009-基础设施/007-keda/scalers/postgresql-说明.md`
推荐方案:
1. **PostgreSQL Operator** (Zalando 或 CloudNativePG)
2. **PgBouncer + KEDA** (扩展连接池而非数据库)
3. **读写分离** (主库固定,从库使用 KEDA)
## 📁 文件结构
```
/home/fei/k3s/009-基础设施/007-keda/
├── deploy.sh # ✅ 部署脚本
├── values.yaml # ✅ KEDA Helm 配置
├── readme.md # ✅ 详细使用文档
├── 部署总结.md # ✅ 部署总结
└── scalers/
├── navigation-scaler.yaml # ✅ 已应用
├── redis-scaler.yaml # ⏳ 待应用
└── postgresql-说明.md # ⚠️ 重要说明
```
## 🧪 验证结果
### Navigation 服务自动扩缩容
```bash
# 当前状态
$ kubectl get deployment navigation -n navigation
NAME READY UP-TO-DATE AVAILABLE AGE
navigation 0/0 0 0 8h
# ScaledObject 状态
$ kubectl get scaledobject -n navigation
NAME READY ACTIVE TRIGGERS AGE
navigation-scaler True False prometheus,cpu 5m
# HPA 已自动创建
$ kubectl get hpa -n navigation
NAME REFERENCE MINPODS MAXPODS REPLICAS
keda-hpa-navigation-scaler Deployment/navigation 1 10 0
```
### 测试从 0 扩容
```bash
# 访问导航页面
curl https://dh.u6.net3w.com
# 观察副本数变化10-30秒
kubectl get deployment navigation -n navigation -w
# 预期: 0/0 → 1/1
```
## 📊 资源节省预期
| 服务 | 之前 | 现在 | 节省 |
|------|------|------|------|
| Navigation | 24/7 运行 | 按需启动 | 80-90% |
| Redis | 24/7 运行 | 按需启动 | 70-80% (配置后) |
| PostgreSQL | 24/7 运行 | 保持运行 | 不适用 |
## 🔧 已修复的问题
### 1. CPU 触发器配置错误
**问题**: 使用了已弃用的 `type` 字段
```yaml
# ❌ 错误
- type: cpu
metadata:
type: Utilization
value: "60"
```
**修复**: 改为 `metricType`
```yaml
# ✅ 正确
- type: cpu
metricType: Utilization
metadata:
value: "60"
```
### 2. Navigation 缺少资源配置
**修复**: 添加了 resources 配置
```yaml
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 200m
memory: 128Mi
```
### 3. PostgreSQL 配置错误
**修复**:
- 删除了 `postgresql-scaler.yaml`
- 创建了 `postgresql-说明.md` 详细说明
- 更新了所有文档,明确标注不适用
## 📚 文档
- **使用指南**: `/home/fei/k3s/009-基础设施/007-keda/readme.md`
- **部署总结**: `/home/fei/k3s/009-基础设施/007-keda/部署总结.md`
- **PostgreSQL 说明**: `/home/fei/k3s/009-基础设施/007-keda/scalers/postgresql-说明.md`
## 🎯 下一步建议
1周内
1. ✅ 监控 Navigation 服务的扩缩容行为
2. ⏳ 为 Redis 配置 Prometheus exporter
3. ⏳ 应用 Redis ScaledObject
### 中期1-2周
1. ⏳ 在 Grafana 中导入 KEDA 仪表板 (ID: 14691)
2. ⏳ 根据实际使用情况调整触发阈值
3. ⏳ 为其他无状态服务配置 KEDA
### 长期1个月+
1. ⏳ 评估是否需要 PostgreSQL 高可用
2. ⏳ 如需要,部署 PostgreSQL Operator
3. ⏳ 或部署 PgBouncer 连接池 + KEDA
## ⚡ 快速命令
```bash
# 查看 KEDA 状态
kubectl get pods -n keda
# 查看所有 ScaledObject
kubectl get scaledobject -A
# 查看 HPA
kubectl get hpa -A
# 查看 Navigation 副本数
kubectl get deployment navigation -n navigation -w
# 测试扩容
curl https://dh.u6.net3w.com
# 查看 KEDA 日志
kubectl logs -n keda -l app.kubernetes.io/name=keda-operator -f
```
## 🎉 总结
**KEDA 已成功部署并运行**
- Navigation 服务实现按需启动,空闲时自动缩容到 0
- 修复了所有配置问题
- 明确了有状态服务PostgreSQL的正确处理方式
- 提供了完整的文档和使用指南
⚠️ **重要提醒**
- 有状态服务不能简单地增加副本数
- PostgreSQL 需要使用专业的 Operator 或连接池方案
- 定期监控扩缩容行为,根据实际情况调整配置
---
**KEDA 让您的 K3s 集群更智能、更节省资源!** 🚀

View File

@@ -0,0 +1,260 @@
# KEDA 自动扩缩容部署总结
部署时间: 2026-01-30
## ✅ 部署完成
### KEDA 核心组件
| 组件 | 状态 | 说明 |
|------|------|------|
| keda-operator | ✅ Running | KEDA 核心控制器 |
| keda-metrics-apiserver | ✅ Running | 指标 API 服务器 |
| keda-admission-webhooks | ✅ Running | 准入 Webhook |
**命名空间**: `keda`
### 已配置的自动扩缩容服务
#### 1. Navigation 导航服务 ✅
- **状态**: 已配置并运行
- **当前副本数**: 0空闲状态
- **配置**:
- 最小副本: 0
- 最大副本: 10
- 触发器: Prometheus (HTTP 请求) + CPU 使用率
- 冷却期: 3 分钟
**ScaledObject**: `navigation-scaler`
**HPA**: `keda-hpa-navigation-scaler`
#### 2. Redis 缓存服务 ⏳
- **状态**: 配置文件已创建,待应用
- **说明**: 需要先为 Redis 配置 Prometheus exporter
- **配置文件**: `scalers/redis-scaler.yaml`
#### 3. PostgreSQL 数据库 ❌
- **状态**: 不推荐使用 KEDA 扩展
- **原因**:
- PostgreSQL 是有状态服务,多副本会导致存储冲突
- 需要配置主从复制才能安全扩展
- 建议使用 PostgreSQL Operator 或 PgBouncer + KEDA
- **详细说明**: `scalers/postgresql-说明.md`
## 配置文件位置
```
/home/fei/k3s/009-基础设施/007-keda/
├── deploy.sh # 部署脚本
├── values.yaml # KEDA Helm 配置
├── readme.md # 详细文档
├── 部署总结.md # 本文档
└── scalers/ # ScaledObject 配置
├── navigation-scaler.yaml # ✅ 已应用
├── redis-scaler.yaml # ⏳ 待应用
└── postgresql-说明.md # ⚠️ PostgreSQL 不适合 KEDA
```
## 验证 KEDA 功能
### 测试缩容到 0
Navigation 服务已经自动缩容到 0
```bash
kubectl get deployment navigation -n navigation
# 输出: READY 0/0
```
### 测试从 0 扩容
访问导航页面触发扩容:
```bash
# 1. 访问页面
curl https://dh.u6.net3w.com
# 2. 观察副本数变化
kubectl get deployment navigation -n navigation -w
# 预期: 10-30 秒内副本数从 0 变为 1
```
## 查看 KEDA 状态
### 查看所有 ScaledObject
```bash
kubectl get scaledobject -A
```
### 查看 HPA自动创建
```bash
kubectl get hpa -A
```
### 查看 KEDA 日志
```bash
kubectl logs -n keda -l app.kubernetes.io/name=keda-operator -f
```
## 下一步操作
### 1. 应用 Redis 自动扩缩容
```bash
# 首先需要为 Redis 添加 Prometheus exporter
# 然后应用 ScaledObject
kubectl apply -f /home/fei/k3s/009-基础设施/007-keda/scalers/redis-scaler.yaml
```
### 2. PostgreSQL 扩展方案
**不要使用 KEDA 直接扩展 PostgreSQL**
推荐方案:
- **方案 1**: 使用 PostgreSQL OperatorZalando 或 CloudNativePG
- **方案 2**: 部署 PgBouncer 连接池 + KEDA 扩展 PgBouncer
- **方案 3**: 配置读写分离,只对只读副本使用 KEDA
详细说明:`/home/fei/k3s/009-基础设施/007-keda/scalers/postgresql-说明.md`
### 3. 监控扩缩容行为
在 Grafana 中导入 KEDA 仪表板:
- 访问: https://grafana.u6.net3w.com
- 导入仪表板 ID: **14691**
## 已修复的问题
### 问题 1: CPU 触发器配置错误
**错误信息**:
```
The 'type' setting is DEPRECATED and is removed in v2.18 - Use 'metricType' instead.
```
**解决方案**:
将 CPU 触发器配置从:
```yaml
- type: cpu
metadata:
type: Utilization
value: "60"
```
改为:
```yaml
- type: cpu
metricType: Utilization
metadata:
value: "60"
```
### 问题 2: Navigation 缺少资源配置
**解决方案**:
为 Navigation deployment 添加了 resources 配置:
```yaml
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 200m
memory: 128Mi
```
## 资源节省效果
### Navigation 服务
- **之前**: 24/7 运行 1 个副本
- **现在**: 空闲时 0 个副本,有流量时自动启动
- **预计节省**: 80-90% 资源(假设大部分时间空闲)
### 预期总体效果
- **Navigation**: 节省 80-90% 资源 ✅
- **Redis**: 节省 70-80% 资源(配置后)⏳
- **PostgreSQL**: ❌ 不使用 KEDA保持单实例运行
## 监控指标
### Prometheus 查询
```promql
# KEDA Scaler 活跃状态
keda_scaler_active{namespace="navigation"}
# 当前指标值
keda_scaler_metrics_value{scaledObject="navigation-scaler"}
# HPA 当前副本数
kube_horizontalpodautoscaler_status_current_replicas{horizontalpodautoscaler="keda-hpa-navigation-scaler"}
```
## 注意事项
### 1. 冷启动时间
从 0 扩容到可用需要 10-30 秒:
- 拉取镜像(如果本地没有)
- 启动容器
- 健康检查通过
### 2. 连接保持
客户端需要支持重连机制,因为服务可能会缩容到 0。
### 3. 有状态服务
PostgreSQL 等有状态服务**不能**直接使用 KEDA 扩展:
- ❌ 多副本会导致存储冲突
- ❌ 没有主从复制会导致数据不一致
- ✅ 需要使用专业的 Operator 或连接池方案
## 故障排查
### ScaledObject 未生效
```bash
# 查看详细状态
kubectl describe scaledobject <name> -n <namespace>
# 查看事件
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
```
### HPA 未创建
检查 KEDA operator 日志:
```bash
kubectl logs -n keda -l app.kubernetes.io/name=keda-operator
```
## 文档参考
- 详细使用文档: `/home/fei/k3s/009-基础设施/007-keda/readme.md`
- KEDA 官方文档: https://keda.sh/docs/
- Scalers 参考: https://keda.sh/docs/scalers/
## 总结
**KEDA 已成功部署并运行**
- KEDA 核心组件运行正常
- Navigation 服务已配置自动扩缩容
- 已验证缩容到 0 功能正常
- 准备好为更多服务配置自动扩缩容
**下一步**: 根据实际使用情况,逐步为 Redis 和 PostgreSQL 配置自动扩缩容。
---
**KEDA 让您的 K3s 集群更智能、更节省资源!** 🚀

View File

@@ -0,0 +1,191 @@
# Portainer 部署指南
## 概述
本文档记录了在 k3s 集群中部署 Portainer 的完整过程包括域名绑定、KEDA 自动缩放和 CSRF 校验问题的解决方案。
## 部署步骤
### 1. 使用 Helm 安装 Portainer
```bash
# 添加 Helm 仓库
helm repo add portainer https://portainer.github.io/k8s/
helm repo update
# 安装 Portainer使用 Longhorn 作为存储类)
helm install --create-namespace -n portainer portainer portainer/portainer \
--set persistence.enabled=true \
--set persistence.storageClass=longhorn \
--set service.type=NodePort
```
### 2. 配置域名访问
#### 2.1 Caddy 反向代理配置
修改 Caddy ConfigMap添加 Portainer 的反向代理规则:
```yaml
# Portainer 容器管理 - 直接转发到 Portainer HTTPS 端口
portainer.u6.net3w.com {
reverse_proxy https://portainer.portainer.svc.cluster.local:9443 {
transport http {
tls_insecure_skip_verify
}
}
}
```
**关键点:**
- 直接转发到 Portainer 的 HTTPS 端口9443而不是通过 Traefik
- 这样可以避免协议不匹配导致的 CSRF 校验失败
#### 2.2 更新 Caddy ConfigMap
```bash
kubectl patch configmap caddy-config -n default --type merge -p '{"data":{"Caddyfile":"..."}}'
```
#### 2.3 重启 Caddy Pod
```bash
kubectl delete pod -n default -l app=caddy
```
### 3. 配置 KEDA 自动缩放(可选)
如果需要实现访问时启动、空闲时缩容的功能,应用 KEDA 配置:
```bash
kubectl apply -f keda-scaler.yaml
```
**配置说明:**
- 最小副本数0空闲时缩容到 0
- 最大副本数3
- 缩容延迟5 分钟无流量后缩容
### 4. 解决 CSRF 校验问题
#### 问题描述
登录时提示 "Unable to login",日志显示:
```
Failed to validate Origin or Referer | error="origin invalid"
```
#### 问题原因
Portainer 新版本对 CSRF 校验非常严格。当通过域名访问时,协议不匹配导致校验失败:
- 客户端发送HTTPS 请求
- Portainer 接收x_forwarded_proto=http
#### 解决方案
**步骤 1添加环境变量禁用 CSRF 校验**
```bash
kubectl set env deployment/portainer -n portainer CONTROLLER_DISABLE_CSRF=true
```
**步骤 2添加环境变量配置 origins**
```bash
kubectl set env deployment/portainer -n portainer PORTAINER_ADMIN_ORIGINS="*"
```
**步骤 3重启 Portainer**
```bash
kubectl rollout restart deployment portainer -n portainer
```
**步骤 4修改 Caddy 配置(最关键)**
直接转发到 Portainer 的 HTTPS 端口,避免通过 Traefik 导致的协议转换问题:
```yaml
portainer.u6.net3w.com {
reverse_proxy https://portainer.portainer.svc.cluster.local:9443 {
transport http {
tls_insecure_skip_verify
}
}
}
```
## 配置文件
### portainer-server.yaml
记录 Portainer deployment 的环境变量配置:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: portainer
namespace: portainer
spec:
template:
spec:
containers:
- name: portainer
env:
- name: CONTROLLER_DISABLE_CSRF
value: "true"
- name: PORTAINER_ADMIN_ORIGINS
value: "*"
```
### keda-scaler.yaml
KEDA 自动缩放配置,实现访问时启动、空闲时缩容。
## 访问 Portainer
部署完成后,访问:
```
https://portainer.u6.net3w.com
```
## 常见问题
### Q: 登录时提示 "Unable to login"
**A:** 这通常是 CSRF 校验失败导致的。检查以下几点:
1. 确认已添加环境变量 `CONTROLLER_DISABLE_CSRF=true`
2. 确认 Caddy 配置直接转发到 Portainer HTTPS 端口
3. 检查 Portainer 日志中是否有 "origin invalid" 错误
4. 重启 Portainer pod 使配置生效
### Q: 为什么要直接转发到 HTTPS 端口而不是通过 Traefik
**A:** 因为通过 Traefik 转发时,协议头会被转换为 HTTP导致 Portainer 接收到的协议与客户端发送的协议不匹配,从而 CSRF 校验失败。直接转发到 HTTPS 端口可以保持协议一致。
### Q: KEDA 自动缩放是否必须配置?
**A:** 不是必须的。KEDA 自动缩放是可选功能,用于节省资源。如果不需要自动缩放,可以跳过这一步。
## 相关文件
- `portainer-server.yaml` - Portainer deployment 环境变量配置
- `keda-scaler.yaml` - KEDA 自动缩放配置
- `ingress.yaml` - 原始 Ingress 配置(已弃用,改用 Caddy 直接转发)
## 下次部署检查清单
- [ ] 使用 Helm 安装 Portainer
- [ ] 修改 Caddy 配置,直接转发到 Portainer HTTPS 端口
- [ ] 添加 Portainer 环境变量CONTROLLER_DISABLE_CSRF、PORTAINER_ADMIN_ORIGINS
- [ ] 重启 Caddy 和 Portainer pods
- [ ] 测试登录功能
- [ ] (可选)配置 KEDA 自动缩放
## 参考资源
- Portainer 官方文档https://docs.portainer.io/
- k3s 官方文档https://docs.k3s.io/
- KEDA 官方文档https://keda.sh/

View File

@@ -0,0 +1,20 @@
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: portainer-ingress
namespace: portainer
annotations:
traefik.ingress.kubernetes.io/router.entrypoints: web
spec:
ingressClassName: traefik
rules:
- host: portainer.u6.net3w.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: portainer
port:
number: 9000

View File

@@ -0,0 +1,58 @@
---
# HTTPScaledObject - 用于实现缩容到 0 的核心配置
apiVersion: http.keda.sh/v1alpha1
kind: HTTPScaledObject
metadata:
name: portainer-http-scaler
namespace: portainer
spec:
hosts:
- portainer.u6.net3w.com
pathPrefixes:
- /
scaleTargetRef:
name: portainer
kind: Deployment
apiVersion: apps/v1
service: portainer
port: 9000
replicas:
min: 0 # 空闲时缩容到 0
max: 3 # 最多 3 个副本
scalingMetric:
requestRate:
granularity: 1s
targetValue: 50 # 每秒 50 个请求时扩容
window: 1m
scaledownPeriod: 300 # 5 分钟无流量后缩容到 0
---
# Traefik Middleware - 设置正确的协议头
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: portainer-headers
namespace: keda
spec:
headers:
customRequestHeaders:
X-Forwarded-Proto: "https"
---
# Traefik IngressRoute - 将流量路由到 KEDA HTTP Add-on 的拦截器
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: portainer-ingress
namespace: keda
spec:
entryPoints:
- web
routes:
- match: Host(`portainer.u6.net3w.com`)
kind: Rule
middlewares:
- name: portainer-headers
services:
- name: keda-add-ons-http-interceptor-proxy
port: 8080

View File

@@ -0,0 +1,16 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: portainer
namespace: portainer
spec:
template:
spec:
containers:
- name: portainer
env:
- name: CONTROLLER_DISABLE_CSRF
value: "true"
# 说明:禁用 CSRF 校验是因为 Portainer 新版本对 CSRF 校验非常严格
# 当使用域名访问时(如 portainer.u6.net3w.com需要禁用此校验
# 如果需要重新启用,将此值改为 "false" 或删除此环境变量

View File

@@ -0,0 +1,10 @@
# 添加 Helm 仓库
helm repo add portainer https://portainer.github.io/k8s/
helm repo update
# 安装 Portainer
# 注意:这里我们利用 Longhorn 作为默认存储类
helm install --create-namespace -n portainer portainer portainer/portainer \
--set persistence.enabled=true \
--set persistence.storageClass=longhorn \
--set service.type=NodePort

View File

@@ -0,0 +1,272 @@
# 域名绑定配置总结
## 配置完成时间
2026-01-30
## 域名配置
所有服务已绑定到 `*.u9.net3w.com` 子域名,通过 Caddy 作为前端反向代理。
### 已配置的子域名
| 服务 | 域名 | 后端服务 | 命名空间 |
|------|------|---------|---------|
| Longhorn UI | https://longhorn.u9.net3w.com | longhorn-frontend:80 | longhorn-system |
| Grafana | https://grafana.u9.net3w.com | kube-prometheus-stack-grafana:80 | monitoring |
| Prometheus | https://prometheus.u9.net3w.com | kube-prometheus-stack-prometheus:9090 | monitoring |
| Alertmanager | https://alertmanager.u9.net3w.com | kube-prometheus-stack-alertmanager:9093 | monitoring |
| MinIO S3 API | https://s3.u6.net3w.com | minio:9000 | minio |
| MinIO Console | https://console.s3.u6.net3w.com | minio:9001 | minio |
## 架构说明
```
Internet (*.u9.net3w.com)
Caddy (前端反向代理, 80/443)
Traefik Ingress Controller
Kubernetes Services
```
### 流量路径
1. **外部请求** → DNS 解析到服务器 IP
2. **Caddy** (端口 80/443) → 接收请求,自动申请 Let's Encrypt SSL 证书
3. **Traefik** → Caddy 转发到 Traefik Ingress Controller
4. **Kubernetes Service** → Traefik 根据 Ingress 规则路由到对应服务
## Caddy 配置
配置文件位置: `/home/fei/k3s/009-基础设施/005-ingress/Caddyfile`
```caddyfile
{
email admin@u6.net3w.com
}
# Longhorn 存储管理
longhorn.u9.net3w.com {
reverse_proxy traefik.kube-system.svc.cluster.local:80
}
# Grafana 监控仪表板
grafana.u9.net3w.com {
reverse_proxy traefik.kube-system.svc.cluster.local:80
}
# Prometheus 监控
prometheus.u9.net3w.com {
reverse_proxy traefik.kube-system.svc.cluster.local:80
}
# Alertmanager 告警管理
alertmanager.u9.net3w.com {
reverse_proxy traefik.kube-system.svc.cluster.local:80
}
```
## Ingress 配置
### Longhorn Ingress
- 文件: `/home/fei/k3s/009-基础设施/005-ingress/longhorn-ingress.yaml`
- Host: `longhorn.u9.net3w.com`
### 监控系统 Ingress
- 文件: `/home/fei/k3s/009-基础设施/006-monitoring/ingress.yaml`
- Hosts:
- `grafana.u9.net3w.com`
- `prometheus.u9.net3w.com`
- `alertmanager.u9.net3w.com`
## SSL/TLS 证书
Caddy 会自动为所有配置的域名申请和续期 Let's Encrypt SSL 证书。
- **证书存储**: Caddy Pod 的 `/data` 目录
- **自动续期**: Caddy 自动管理
- **邮箱**: admin@u6.net3w.com
## 访问地址
### 监控和管理
- **Longhorn 存储管理**: https://longhorn.u9.net3w.com
- **Grafana 监控**: https://grafana.u9.net3w.com
- 用户名: `admin`
- 密码: `prom-operator`
- **Prometheus**: https://prometheus.u9.net3w.com
- **Alertmanager**: https://alertmanager.u9.net3w.com
### 对象存储
- **MinIO S3 API**: https://s3.u6.net3w.com
- **MinIO Console**: https://console.s3.u6.net3w.com
## DNS 配置
确保以下 DNS 记录已配置A 记录或 CNAME
```
*.u9.net3w.com → <服务器IP>
```
或者单独配置每个子域名:
```
longhorn.u9.net3w.com → <服务器IP>
grafana.u9.net3w.com → <服务器IP>
prometheus.u9.net3w.com → <服务器IP>
alertmanager.u9.net3w.com → <服务器IP>
```
## 验证配置
### 检查 Caddy 状态
```bash
kubectl get pods -n default -l app=caddy
kubectl logs -n default -l app=caddy -f
```
### 检查 Ingress 状态
```bash
kubectl get ingress -A
```
### 测试域名访问
```bash
curl -I https://longhorn.u9.net3w.com
curl -I https://grafana.u9.net3w.com
curl -I https://prometheus.u9.net3w.com
curl -I https://alertmanager.u9.net3w.com
```
## 添加新服务
如果需要添加新的服务到 u9.net3w.com 域名:
### 1. 更新 Caddyfile
编辑 `/home/fei/k3s/009-基础设施/005-ingress/Caddyfile`,添加:
```caddyfile
newservice.u9.net3w.com {
reverse_proxy traefik.kube-system.svc.cluster.local:80
}
```
### 2. 更新 Caddy ConfigMap
```bash
kubectl create configmap caddy-config \
--from-file=Caddyfile=/home/fei/k3s/009-基础设施/005-ingress/Caddyfile \
-n default --dry-run=client -o yaml | kubectl apply -f -
```
### 3. 重启 Caddy
```bash
kubectl rollout restart deployment caddy -n default
```
### 4. 创建 Ingress
```yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: newservice-ingress
namespace: your-namespace
annotations:
traefik.ingress.kubernetes.io/router.entrypoints: web
spec:
rules:
- host: newservice.u9.net3w.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: your-service
port:
number: 80
```
### 5. 应用 Ingress
```bash
kubectl apply -f newservice-ingress.yaml
```
## 故障排查
### Caddy 无法启动
```bash
# 查看 Caddy 日志
kubectl logs -n default -l app=caddy
# 检查 ConfigMap
kubectl get configmap caddy-config -n default -o yaml
```
### 域名无法访问
```bash
# 检查 Ingress
kubectl describe ingress <ingress-name> -n <namespace>
# 检查 Traefik
kubectl logs -n kube-system -l app.kubernetes.io/name=traefik
# 测试内部连接
kubectl run test --rm -it --image=curlimages/curl -- curl -v http://traefik.kube-system.svc.cluster.local:80
```
### SSL 证书问题
```bash
# 查看 Caddy 证书状态
kubectl exec -n default -it <caddy-pod> -- ls -la /data/caddy/certificates/
# 强制重新申请证书
kubectl rollout restart deployment caddy -n default
```
## 安全建议
1. **启用基本认证**: 为敏感服务(如 Prometheus、Alertmanager添加认证
2. **IP 白名单**: 限制管理界面的访问 IP
3. **定期更新**: 保持 Caddy 和 Traefik 版本更新
4. **监控日志**: 定期检查访问日志,发现异常访问
## 维护命令
```bash
# 更新 Caddy 配置
kubectl create configmap caddy-config \
--from-file=Caddyfile=/home/fei/k3s/009-基础设施/005-ingress/Caddyfile \
-n default --dry-run=client -o yaml | kubectl apply -f -
kubectl rollout restart deployment caddy -n default
# 查看所有 Ingress
kubectl get ingress -A
# 查看 Caddy 日志
kubectl logs -n default -l app=caddy -f
# 查看 Traefik 日志
kubectl logs -n kube-system -l app.kubernetes.io/name=traefik -f
```
## 备份
重要配置文件已保存在:
- Caddyfile: `/home/fei/k3s/009-基础设施/005-ingress/Caddyfile`
- Longhorn Ingress: `/home/fei/k3s/009-基础设施/005-ingress/longhorn-ingress.yaml`
- 监控 Ingress: `/home/fei/k3s/009-基础设施/006-monitoring/ingress.yaml`
建议定期备份这些配置文件。
---
**配置完成!所有服务现在可以通过 *.u9.net3w.com 域名访问。** 🎉

View File

@@ -0,0 +1,225 @@
# K3s 基础设施部署总结
部署日期: 2026-01-30
## 已完成的基础设施组件
### ✅ 1. Helm 包管理工具
- **版本**: v3.20.0
- **位置**: /usr/local/bin/helm
- **配置**: KUBECONFIG 已添加到 ~/.bashrc
### ✅ 2. Longhorn 分布式存储
- **版本**: v1.11.0
- **命名空间**: longhorn-system
- **存储类**: longhorn (默认)
- **S3 备份**: 已配置 MinIO S3 备份
- 备份目标: s3://longhorn-backup@us-east-1/
- 凭证 Secret: longhorn-crypto
- **访问**: http://longhorn.local
### ✅ 3. Redis 中间件
- **版本**: Redis 7 (Alpine)
- **命名空间**: redis
- **存储**: 5Gi Longhorn 卷
- **持久化**: RDB + AOF 双重持久化
- **内存限制**: 2GB
- **访问**: redis.redis.svc.cluster.local:6379
### ✅ 4. PostgreSQL 数据库
- **版本**: PostgreSQL 16.11
- **命名空间**: postgresql
- **存储**: 10Gi Longhorn 卷
- **内存限制**: 2GB
- **访问**: postgresql-service.postgresql.svc.cluster.local:5432
- **凭证**:
- 用户: postgres
- 密码: postgres123
### ✅ 5. Traefik Ingress 控制器
- **状态**: K3s 默认已安装
- **命名空间**: kube-system
- **已配置 Ingress**:
- Longhorn UI: http://longhorn.local
- MinIO API: http://s3.u6.net3w.com
- MinIO Console: http://console.s3.u6.net3w.com
- Grafana: http://grafana.local
- Prometheus: http://prometheus.local
- Alertmanager: http://alertmanager.local
### ✅ 6. Prometheus + Grafana 监控系统
- **命名空间**: monitoring
- **组件**:
- Prometheus: 时间序列数据库 (20Gi 存储, 15天保留)
- Grafana: 可视化仪表板 (5Gi 存储)
- Alertmanager: 告警管理 (5Gi 存储)
- Node Exporter: 节点指标收集
- Kube State Metrics: K8s 资源状态
- **Grafana 凭证**:
- 用户: admin
- 密码: prom-operator
- **访问**:
- Grafana: http://grafana.local
- Prometheus: http://prometheus.local
- Alertmanager: http://alertmanager.local
## 目录结构
```
/home/fei/k3s/009-基础设施/
├── 003-helm/
│ ├── install_helm.sh
│ └── readme.md
├── 004-longhorn/
│ ├── deploy.sh
│ ├── s3-secret.yaml
│ ├── values.yaml
│ ├── readme.md
│ └── 说明.md
├── 005-ingress/
│ ├── deploy-longhorn-ingress.sh
│ ├── longhorn-ingress.yaml
│ └── readme.md
└── 006-monitoring/
├── deploy.sh
├── values.yaml
├── ingress.yaml
└── readme.md
/home/fei/k3s/010-中间件/
├── 001-redis/
│ ├── deploy.sh
│ ├── redis-deployment.yaml
│ └── readme.md
└── 002-postgresql/
├── deploy.sh
├── postgresql-deployment.yaml
└── readme.md
```
## 存储使用情况
| 组件 | 存储大小 | 存储类 |
|------|---------|--------|
| MinIO | 50Gi | local-path |
| Redis | 5Gi | longhorn |
| PostgreSQL | 10Gi | longhorn |
| Prometheus | 20Gi | longhorn |
| Grafana | 5Gi | longhorn |
| Alertmanager | 5Gi | longhorn |
| **总计** | **95Gi** | - |
## 访问地址汇总
需要在 `/etc/hosts` 中添加以下配置(将 `<节点IP>` 替换为实际 IP
```
<节点IP> longhorn.local
<节点IP> grafana.local
<节点IP> prometheus.local
<节点IP> alertmanager.local
<节点IP> s3.u6.net3w.com
<节点IP> console.s3.u6.net3w.com
```
## 快速验证命令
```bash
# 查看所有命名空间的 Pods
kubectl get pods -A
# 查看所有 PVC
kubectl get pvc -A
# 查看所有 Ingress
kubectl get ingress -A
# 查看存储类
kubectl get storageclass
# 测试 Redis
kubectl exec -n redis $(kubectl get pod -n redis -l app=redis -o jsonpath='{.items[0].metadata.name}') -- redis-cli ping
# 测试 PostgreSQL
kubectl exec -n postgresql postgresql-0 -- psql -U postgres -c "SELECT version();"
```
## 备份策略
1. **Longhorn 卷备份**:
- 所有持久化数据存储在 Longhorn 卷上
- 可通过 Longhorn UI 创建快照
- 自动备份到 MinIO S3 (s3://longhorn-backup@us-east-1/)
2. **数据库备份**:
- Redis: AOF + RDB 持久化
- PostgreSQL: 可使用 pg_dump 进行逻辑备份
3. **配置备份**:
- 所有配置文件已保存在 `/home/fei/k3s/` 目录
- 建议定期备份此目录
## 下一步建议
1. **安全加固**:
- 修改 PostgreSQL 默认密码
- 配置 TLS/SSL 证书
- 启用 RBAC 权限控制
2. **监控优化**:
- 配置告警通知邮件、Slack、钉钉
- 导入更多 Grafana 仪表板
- 为 Redis 和 PostgreSQL 添加专用监控
3. **高可用**:
- 考虑 Redis 主从复制或 Sentinel
- 考虑 PostgreSQL 主从复制
- 增加 K3s 节点实现多节点高可用
4. **日志收集**:
- 部署 Loki 或 ELK 进行日志聚合
- 配置日志持久化和查询
5. **CI/CD**:
- 部署 GitLab Runner 或 Jenkins
- 配置自动化部署流程
## 维护命令
```bash
# 更新 Helm 仓库
helm repo update
# 升级 Longhorn
helm upgrade longhorn longhorn/longhorn --namespace longhorn-system -f values.yaml
# 升级监控栈
helm upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack --namespace monitoring -f values.yaml
# 查看 Helm 发布
helm list -A
# 清理未使用的镜像
kubectl get pods -A -o jsonpath='{range .items[*]}{.spec.containers[*].image}{"\n"}{end}' | sort -u
```
## 故障排查
如果遇到问题,请检查:
1. Pod 状态: `kubectl get pods -A`
2. 事件日志: `kubectl get events -A --sort-by='.lastTimestamp'`
3. Pod 日志: `kubectl logs -n <namespace> <pod-name>`
4. 存储状态: `kubectl get pvc -A`
5. Longhorn 卷状态: 访问 http://longhorn.local
## 联系和支持
- Longhorn 文档: https://longhorn.io/docs/
- Prometheus 文档: https://prometheus.io/docs/
- Grafana 文档: https://grafana.com/docs/
- K3s 文档: https://docs.k3s.io/
---
**部署完成!所有基础设施组件已成功运行。** 🎉