生产环境ETCD高可用集群

etcd 是分布式键值对数据库,应用于存储 k8s 集群的配置、运行数据等。 虽然 k8s 官方部署工具 kubeadm 可以快速地搭建起 etcd 集群,但是也造成一个不好的影响:etcd 的管理跟 k8s 集群耦合在一起。在生产环境中,我们更希望数据库是独立于其他系统。 先部署独立的外部 etcd 集群,再部署 k8s 集群,既有利于后续部署高可用集群,又方便运维管理,是更加稳健的生产方案。 外部 etcd 集群部署极其灵活:在节点数量上,既可以是单个节点,也可以是多个节点;在部署位置上,既可以部署在 k8s 集群节点上,也可以部署在独立的非 k8s 集群节点上。 除了二进制手动部署之外,还有自动化工具 etcdadm,但是目前还是开发版本,不建议用它部署生产环境。

节点名称节点内网IP

etcd1

172.22.0.12

etcd2

172.22.0.14

etcd3

172.22.0.4

1. 在每台服务器中将三台服务器的内、外网 IP 和对应的 hostname 写入 hosts 文件

我们只走内网,所有不用加外网

cat >>  /etc/hosts << EOF
172.22.0.12  etcd1
172.22.0.14  etcd2
172.22.0.4   etcd3  
EOF

2. 创建存储目录(所有节点)

mkdir  /data/etcd
mkdir  /data/etcd/data
mkdir  /data/etcd/bin
mkdir  /data/etcd/ssl

3. 制作安全证书(主节点)

etcd 通过证书来实现安全验证。可以通过 cfssl 或者 openssl 工具制作证书。本文采用 cfssl。 在主节点 etcd1 安装 cfssl 套件,制作证书后传输给其他节点。

wget https://github.com/cloudflare/cfssl/releases/download/v1.6.2/cfssl_1.6.2_linux_amd64 -O /data/etcd/bin/cfssl
wget https://github.com/cloudflare/cfssl/releases/download/v1.6.2/cfssljson_1.6.2_linux_amd64 -O /data/etcd/bin/cfssljson
wget https://github.com/cloudflare/cfssl/releases/download/v1.6.2/cfssl-certinfo_1.6.2_linux_amd64 -O /data/etcd/bin/cfssl-certinfo


chmod +x /data/etcd/bin/cfssl*

4. 制作 CA 根证书(主节点)

    1. 生成 csr 文件

    1. 创建证书

填写 CA 的 csr 信息,json 格式 csr 全称 Certificate Signing Request,即“证书签名请求”,类似于申请表,填写申请人的基本信息

4.1 填写 CA 的 csr 信息,json 格式

/data/etcd/bin/cfssl print-defaults csr > /data/etcd/ssl/ca-csr.json

vim /data/etcd/ssl/ca-csr.json
{
    "CN": "etcd",
    "key": {
        "algo": "rsa",
        "size": 2048
    },
    "names": [
        {
            "C": "CN",
            "ST": "Guangdong",
            "L": "GuangZhou",
            "O": "etcd"
        }
    ]
}

4.2 创建 CA 根证书,以 ca 为前缀,保存在 /data/etcd/ssl/

ca.pem 是公钥,ca-key.pem 是私钥。 有了根证书,我们这台服务器就可以算一个 CA 机构了,能够给 etcd 颁发证书。

/data/etcd/bin/cfssl gencert -initca /data/etcd/ssl/ca-csr.json | /data/etcd/bin/cfssljson -bare /data/etcd/ssl/ca


ls /data/etcd/ssl
ca.csr  ca-csr.json  ca-key.pem  ca.pem

4.3 配置证书策略

profiles:为不同角色配置不同的证书参数,此处只设了 etcd 一个角色,有需要的话可以添加多个角色。 重要的参数包括有效期、用途(签名、密钥加密、服务端认证、客户端认证)

/data/etcd/bin/cfssl print-defaults config > /data/etcd/ssl/ca-config.json

vim /data/etcd/ssl/ca-config.json
{
    "signing": {
        "default": {
            "expiry": "87600h"
        },
        "profiles": {
            "etcd": {
                "expiry": "87600h",
                "usages": [
                    "signing",
                    "key encipherment",
                    "server auth",
                    "client auth"
                ]
            }
        }
    }
}

4.4 颁发 etcd 的安全证书

此处多了个 hosts 字段,需要包括所有 etcd 节点的内、外网 IP 地址(这里只使用内网)。 如果 etcd 的配置包含了本地回环地址,也需要加上去。 如果后面新增加 etcd 节点,需要先更新 etcd 的 csr 信息,重新制作证书。

/data/etcd/bin/cfssl print-defaults csr > /data/etcd/ssl/etcd-csr.json

vim /data/etcd/ssl/etcd-csr.json
{
    "CN": "etcd",
    "hosts": [
        "127.0.0.1",
        "172.22.0.12",
        "172.22.0.14",
        "172.22.0.4"
    ],
    "key": {
        "algo": "rsa",
        "size": 2048
    },
    "names": [
        {
            "C": "CN",
            "ST": "Guangdong",
            "L": "GuangZhou",
            "O": "etcd"
        }
    ]
}

4.5 创建 etcd 的安全证书,以 etcd 为前缀,保存在 /data/etcd/ssl/

/data/etcd/bin/cfssl gencert \
-ca=/data/etcd/ssl/ca.pem \
-ca-key=/data/etcd/ssl/ca-key.pem \
--config=/data/etcd/ssl/ca-config.json --profile=etcd \
/data/etcd/ssl/etcd-csr.json | /data/etcd/bin/cfssljson -bare /data/etcd/ssl/etcd


ls /data/etcd/ssl/
ca-config.json  ca.csr  ca-csr.json  ca-key.pem  ca.pem  etcd.csr  etcd-csr.json  etcd-key.pem  etcd.pem

4.6 分发 etcd 证书到其他 etcd 节点, 要提前在 etcd2 和 etcd3 创建 ssl 目录

scp /data/etcd/ssl/*  etcd2:/data/etcd/ssl/
scp /data/etcd/ssl/*  etcd3:/data/etcd/ssl/

5. 部署 etcd 集群(所有节点)

cd /usr/local/src
wget https://github.com/etcd-io/etcd/releases/download/v3.5.4/etcd-v3.5.4-linux-amd64.tar.gz
tar zxvf etcd-v3.5.4-linux-amd64.tar.gz
cp  etcd-v3.5.4-linux-amd64/etcd*  /data/etcd/bin/
  • 成员标记的环境变量,其中的 url 是内部可见的 IP(内网 IP、本地回环 IP)

  • ETCD_NAME:节点的 hostname

  • ETCD_LISTEN_PEER_URLS:本地 etcd 端对端监听 url

  • ETCD_LISTEN_CLIENT_URLS:本地 etcd 客户端监听 url

  • 集群标记的环境变量,其中的 url 可用内网IP || (外网 IP)

  • ETCD_INITIAL_ADVERTISE_PEER_URLS:初始化对外端对端通讯 url

  • ETCD_INITIAL_CLUSTER:初始化集群的所有 etcd 节点 url 的集合

  • ETCD_ADVERTISE_CLIENT_URLS:对外客户端监听 url

  • 安全标记部分用到了刚刚创建的证书,所有 url 用 https 协议。

  • 此处列举的是 etcd1 的配置,其他 etcd 节点的配置类似,只需要修改 ETCD_NAME 和与 url 相关的所有环境变量。

cat >> /data/etcd/etcd.conf << EOF
# [Member tag] 成员标记部分
ETCD_NAME="etcd1"
ETCD_DATA_DIR="/data/etcd/data"
ETCD_LISTEN_PEER_URLS="https://172.22.0.12:2380"
ETCD_LISTEN_CLIENT_URLS="https://172.22.0.12:2379, https://127.0.0.1:2379"

# [Cluster tag] 成员标记部分
ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://172.22.0.12:2380"
ETCD_INITIAL_CLUSTER="etcd1=https://172.22.0.12:2380,etcd2=https://172.22.0.14:2380,etcd3=https://172.22.0.4:2380"
ETCD_ADVERTISE_CLIENT_URLS="https://172.22.0.12:2379"

# [Safety mark] 安全标记部分
ETCD_PEER_CLIENT_CERT_AUTH="true"
ETCD_PEER_CERT_FILE="/data/etcd/ssl/etcd.pem"
ETCD_PEER_KEY_FILE="/data/etcd/ssl/etcd-key.pem"
ETCD_PEER_TRUSTED_CA_FILE="/data/etcd/ssl/ca.pem"

ETCD_CERT_FILE="/data/etcd/ssl/etcd.pem"
ETCD_KEY_FILE="/data/etcd/ssl/etcd-key.pem"
ETCD_TRUSTED_CA_FILE="/data/etcd/ssl/ca.pem"
EOF

6. 用 systemd 服务启动 etcd (所有节点)

cat >> /usr/lib/systemd/system/etcd.service << EOF
[Unit]
Description=Etcd Server
After=network.target
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
EnvironmentFile=/data/etcd/etcd.conf
WorkingDirectory=/data/etcd/data/
ExecStart=/data/etcd/bin/etcd   
Restart=on-failure
RestartSec=5
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable etcd.service
systemctl start etcd.service
systemctl status etcd.service # 查看etcd运行状况

7. 在任意的 etcd 节点检查 etcd 集群健康状况

/data/etcd/bin/etcdctl -w table \
--cacert=/data/etcd/ssl/ca.pem \
--cert=/data/etcd/ssl/etcd.pem \
--key=/data/etcd/ssl/etcd-key.pem \
--endpoints=https://172.22.0.12:2379,https://172.22.0.14:2379,https://172.22.0.4:2379 \
endpoint health  # 查询是否健康


+--------------------------+--------+-------------+-------+
|         ENDPOINT         | HEALTH |    TOOK     | ERROR |
+--------------------------+--------+-------------+-------+
| https://172.22.0.14:2379 |   true | 12.935264ms |       |
|  https://172.22.0.4:2379 |   true | 13.266081ms |       |
| https://172.22.0.12:2379 |   true | 16.062599ms |       |
+--------------------------+--------+-------------+-------+


/data/etcd/bin/etcdctl -w table \
--cacert=/data/etcd/ssl/ca.pem \
--cert=/data/etcd/ssl/etcd.pem \
--key=/data/etcd/ssl/etcd-key.pem \
--endpoints=https://172.22.0.12:2379,https://172.22.0.14:2379,https://172.22.0.4:2379 \
member list       # started 证明我们的集群处于正常运行状态   
+------------------+---------+-------+--------------------------+--------------------------+------------+
|        ID        | STATUS  | NAME  |        PEER ADDRS        |       CLIENT ADDRS       | IS LEARNER |
+------------------+---------+-------+--------------------------+--------------------------+------------+
| 9197f28cdea6dd9c | started | etcd2 | https://172.22.0.14:2380 | https://172.22.0.14:2379 |      false |
| adcfd99939b3aee8 | started | etcd1 | https://172.22.0.12:2380 | https://172.22.0.12:2379 |      false |
| c2218527ef41ca14 | started | etcd3 |  https://172.22.0.4:2380 |  https://172.22.0.4:2379 |      false |
+------------------+---------+-------+--------------------------+--------------------------+------------+


/data/etcd/bin/etcdctl -w table \
--cacert=/data/etcd/ssl/ca.pem \
--cert=/data/etcd/ssl/etcd.pem \
--key=/data/etcd/ssl/etcd-key.pem \
--endpoints=https://172.22.0.12:2379,https://172.22.0.14:2379,https://172.22.0.4:2379 \
endpoint status   # 查询集群内哪个节点是Leader节点

+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://172.22.0.12:2379 | adcfd99939b3aee8 |   3.5.4 |   20 kB |     false |      false |         3 |         23 |                 23 |        |
| https://172.22.0.14:2379 | 9197f28cdea6dd9c |   3.5.4 |   20 kB |     false |      false |         3 |         23 |                 23 |        |
|  https://172.22.0.4:2379 | c2218527ef41ca14 |   3.5.4 |   25 kB |      true |      false |         3 |         23 |                 23 |        |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+




8. 优化一下

cp /etc/security/limits.conf /etc/security/limits.conf.bak
cat >>/etc/security/limits.conf <<EOF
* soft nproc 65535
* hard nproc 65535
* soft nofile 65535
* hard nofile 65535
EOF
echo "ulimit -SHn 65535" >> /etc/profile
echo "ulimit -SHn 65535" >> /etc/rc.local

9. 时间同步一下

mv /etc/localtime /etc/localtime.bak
ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
ntpdate cn.pool.ntp.org && hwclock -w

echo "10 * * * * root /usr/sbin/ntpdate cn.pool.ntp.org >> /var/log/ntpdate.log" >> /etc/crontab

10. 重启服务器,重新查看状态

reboot

11. ETCD集群备份

[root@VM-0-12-centos ~]# /data/etcd/bin/etcdctl  --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/etcd.pem --key=/data/etcd/ssl/etcd-key.pem --endpoints=https://172.22.0.12:2379 snapshot save /data/backup/etcd-202205081723.db
{"level":"info","ts":"2023-10-08T21:36:27.784+0800","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/data/backup/etcd-202205081723.db.part"}
{"level":"info","ts":"2023-10-08T21:36:27.791+0800","logger":"client","caller":"v3/maintenance.go:211","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":"2023-10-08T21:36:27.791+0800","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://172.22.0.12:2379"}
{"level":"info","ts":"2023-10-08T21:36:27.794+0800","logger":"client","caller":"v3/maintenance.go:219","msg":"completed snapshot read; closing"}
{"level":"info","ts":"2023-10-08T21:36:27.796+0800","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://172.22.0.12:2379","size":"20 kB","took":"now"}
{"level":"info","ts":"2023-10-08T21:36:27.796+0800","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/data/backup/etcd-202205081723.db"}
Snapshot saved at /data/backup/etcd-202310082136.db

12. ETCD集群恢复

    1. 找到备份的 ETCD 数据文件

    1. 通过命令恢复

/data/etcd/bin/etcdctl  --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/etcd.pem --key=/data/etcd/ssl/etcd-key.pem --endpoints=https://172.22.0.12:2379 snapshot restore /data/backup/etcd-202310082136.db

13. ETCD 节点扩容

例如新加节点IP: 172.22.0.8

    1. 所有同步 /etc/hosts 节点信息

    1. 添加节点到 /data/etcd/ssl/etcd-csr.json , 重新生成新的证书

    1. 同步证书到所有节点 scp /data/etcd/ssl/* 节点地址:/data/etcd/ssl/

    1. 新节点安装 etcd 二进制

    1. 修改配置文件 etcd.conf, 在 ETCD_INITIAL_CLUSTER 后添加 etcd4=https://172.22.0.8:2380, 并同步所有,重启服务

    1. 如下

    1. 启动新添加的节点

/data/etcd/bin/etcdctl --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/etcd.pem --key=/data/etcd/ssl/etcd-key.pem --endpoints="https://172.22.0.14:2379"  member add etcd4 --peer-urls=https://172.22.0.8:2380

 /data/etcd/bin/etcdctl -w table --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/etcd.pem --key=/data/etcd/ssl/etcd-key.pem --endpoints=https://172.22.0.12:2379,https://172.22.0.14:2379,https://172.22.0.4:2379 member list
+------------------+-----------+-------+--------------------------+--------------------------+------------+
|        ID        |  STATUS   | NAME  |        PEER ADDRS        |       CLIENT ADDRS       | IS LEARNER |
+------------------+-----------+-------+--------------------------+--------------------------+------------+
| 9197f28cdea6dd9c |   started | etcd2 | https://172.22.0.14:2380 | https://172.22.0.14:2379 |      false |
| adcfd99939b3aee8 |   started | etcd1 | https://172.22.0.12:2380 | https://172.22.0.12:2379 |      false |
| c2218527ef41ca14 |   started | etcd3 |  https://172.22.0.4:2380 |  https://172.22.0.4:2379 |      false |
| ee2c6ab6d3897479 | unstarted |       |  https://172.22.0.8:2380 |                          |      false |

~... member list -w table 查看是信息

14. ETCD 节点扩容

  1. 先获取节点ID

  2. 删除节点ID

/data/etcd/bin/etcdctl -w table --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/etcd.pem --key=/data/etcd/ssl/etcd-key.pem --endpoints=https://172.22.0.12:2379,https://172.22.0.14:2379,https://172.22.0.4:2379,https://172.22.0.8:2379 endpoint status


+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://172.22.0.12:2379 | adcfd99939b3aee8 |   3.5.4 |   20 kB |     false |      false |        15 |        107 |                107 |        |
| https://172.22.0.14:2379 | 9197f28cdea6dd9c |   3.5.4 |   20 kB |      true |      false |        15 |        107 |                107 |        |
|  https://172.22.0.4:2379 | c2218527ef41ca14 |   3.5.4 |   25 kB |     false |      false |        15 |        107 |                107 |        |
|  https://172.22.0.8:2379 | ee2c6ab6d3897479 |   3.5.4 |   20 kB |     false |      false |        15 |        107 |                107 |        |
+--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+


/data/etcd/bin/etcdctl  --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/etcd.pem --key=/data/etcd/ssl/etcd-key.pem --endpoints=https://172.22.0.12:2379,https://172.22.0.14:2379,https://172.22.0.4:2379 member remove <ID>

Last updated