AWS EKS 中运行 MySQL 的架构决策从 RDS 到自建 Operator 的深度剖析

云原生

文章字数: 4k

阅读时长: 17 分

在 Kubernetes 这种为无状态应用设计的编排平台上运行像 MySQL 这样的有状态核心数据服务，本身就是一场架构上的博弈。问题的起点从来不是“如何部署”，而是“我们是否应该以及如何承担其全生命周期的管理责任”。一个简单的 kubectl apply 背后，是关于数据持久化、高可用、灾备、升级和监控等一系列复杂且不容有失的工程挑战。

我们的目标是定义一个生产级的 MySQL 服务方案，它必须满足以下几个非功能性需求：

声明式配置：基础设施和应用配置应通过代码进行版本化管理。
高可用性：数据库服务必须能够容忍单个节点甚至整个可用区（AZ）的故障。
自动化运维：备份、恢复、扩缩容、版本升级等日常操作应尽可能自动化，减少人为干预。
成本效益：在满足性能和可用性前提下，控制资源成本。

这一切的终极形态，或许是一个这样的 YAML 文件：

apiVersion: db.my-company.com/v1alpha1
kind: PerconaXtraDBCluster
metadata:
  name: main-cluster
  namespace: database
spec:
  # 集群规模与资源配置
  replicas: 3
  resources:
    requests:
      memory: "8Gi"
      cpu: "4000m"
    limits:
      memory: "8Gi"
      cpu: "4000m"
  
  # 数据持久化配置
  storage:
    storageClassName: "gp3-io-optimized"
    size: "500Gi"

  # MySQL 版本与配置
  image: "percona/percona-xtradb-cluster:8.0.32"
  configSecretName: "main-cluster-pxc-config"

  # 备份策略
  backup:
    schedule: "0 2 * * *" # 每日凌晨2点备份
    destination: "s3://my-company-db-backups/main-cluster/"
    credentialsSecret: "aws-s3-backup-creds"

这个 PerconaXtraDBCluster 资源清晰地描述了我们期望的数据库状态。而如何从零开始，让 Kubernetes 集群能够理解并实现这个期望状态，正是本文要探讨的架构决策过程。

方案A：AWS RDS - 托管服务的诱惑与代价

对于在 AWS EKS 中运行的应用，最直接、最省事的方案就是使用 AWS RDS (Relational Database Service)。

优势分析:

极低的运维负担：AWS 负责硬件、操作系统、数据库安装、补丁、备份、高可用（Multi-AZ）和只读副本。SRE 团队几乎不需要关心数据库本身的运维细节。
成熟的生态集成：与 IAM、CloudWatch、VPC 等 AWS 服务无缝集成，权限和监控体系非常成熟。
可靠性保障：Multi-AZ 部署提供了跨可用区的高可用性，经过了大规模生产环境的验证。

劣势与权衡:

网络延迟与边界：EKS Pods 与 RDS 实例之间的通信虽然在同一 VPC 内，但仍然是网络调用，而非 Pod 间的直接通信。对于延迟敏感型应用，这可能是个问题。更重要的是，它打破了 Kubernetes 集群的资源管理边界。数据库的生命周期、访问控制和网络策略游离于 kubectl 和 Kubernetes API 之外。
成本问题：RDS 的便利性是有价格的。在同等规格下，RDS 的费用通常显著高于自行在 EC2 或 EKS 上部署。对于大规模集群，这笔开销不容忽视。
配置灵活性的丧失：虽然 RDS 提供了参数组，但对 MySQL 的底层配置（如特定存储引擎参数、内核参数调优）的控制力远不如自建方案。
供应商锁定：深度依赖 RDS 会增加未来迁移到其他云或本地数据中心的难度和成本。

在真实项目中，对于业务早期、团队规模较小或对数据库运维经验不足的场景，RDS 毫无疑问是最佳选择。它用金钱换取了宝贵的时间和稳定性。但随着业务规模扩大、成本压力增加以及对数据库掌控力需求的提升，自建方案的吸引力便开始显现。

方案B：Helm + StatefulSet - 云原生之路的初步尝试

进入 EKS 的世界，我们自然会想到使用社区成熟的 Helm Chart（如 Bitnami 或 Percona 提供的官方 Chart）来部署 MySQL。这种方案通常基于 StatefulSet。

优势分析:

云原生集成：数据库作为 StatefulSet 运行在 EKS 集群内部，其网络、存储（通过 PVC）、配置（通过 ConfigMap/Secret）都由 Kubernetes 统一管理。应用可以利用 Kubernetes Service DNS 进行服务发现，网络策略可以精确控制访问。
完全控制权：你可以完全控制 MySQL 的版本、配置、插件和底层操作系统环境。
成本节约：直接使用 EC2 Worker Node 的计算和 EBS 存储资源，相比 RDS 有明显的成本优势。

劣势与权衡:

这里的坑在于，StatefulSet 只解决了有状态应用最基本的问题：稳定的网络标识和独立的持久化存储。它并没有解决有状态应用生命周期管理这一核心难题。

Day-2 运维的噩梦：
- 备份与恢复：你需要手动或通过 CronJob 部署脚本来执行 mysqldump 或 xtrabackup，并将备份上传到 S3。恢复过程更为复杂，涉及 PVC 的重建、数据拷贝和集群状态的重新同步，极易出错。
- 高可用与故障转移：虽然可以使用 Percona XtraDB Cluster (PXC) 或 Galera Cluster 来实现多节点同步复制，但节点的自动故障发现和主节点切换逻辑需要外部组件（如 ProxySQL）或复杂的脚本来管理。StatefulSet 本身不会处理应用级别的故障转移。
- 升级：使用 helm upgrade 来更新一个正在运行的数据库集群是一项高风险操作。配置的微小变动都可能导致集群脑裂或数据不一致。滚动更新策略需要精心设计，并确保应用层兼容。
声明式管理的缺失：Helm 提供了模板化的部署能力，但它本质上是一个“一次性”的安装工具。后续的运维操作（如“执行一次备份”、“添加一个只读副本”）大多是命令式的，而非声明式的。我们无法像文章开头那样，通过修改一个 YAML 文件来驱动数据库的备份策略或集群拓扑发生变更。

Helm + StatefulSet 方案将运维的复杂性从基础设施层转移到了应用管理层，对于没有强大数据库自动化运维能力的团队来说，这无异于打开了潘多拉的魔盒。

最终选择：构建自定义 Operator - 将运维知识编码

对比上述方案后，我们决定选择一条前期投入最大，但长期收益最高的路径：为我们的 MySQL 集群（我们选择 Percona XtraDB Cluster 作为高可用基础）构建一个自定义的 Kubernetes Operator。

Operator 模式的核心思想是：将人类运维专家的知识和操作流程，通过代码实现为一个持续运行的、能够主动调节系统状态的控制器（Controller）。

选择理由：

真正的声明式API：我们可以定义自己的 Custom Resource Definition (CRD)，如 PerconaXtraDBCluster，用它来完整描述一个数据库集群的期望状态。运维人员只需修改这个 CR 对象，Operator 就会自动执行所有必要操作来使实际状态与期望状态保持一致。
自动化 Day-2 运维：备份、恢复、故障转移、升级等复杂逻辑可以被编码到 Operator 的调谐循环（Reconciliation Loop）中。例如，当 Operator 检测到一个 PXC 节点 Pod 异常时，它可以自动执行一系列安全检查，然后决定是否将其从集群中剔除，并尝试重建。
与云原生生态深度融合：Operator 本身就是 Kubernetes 的一等公民，可以利用 Kubernetes API 的所有能力，如 OwnerReferences 来管理资源的生命周期，使用 Finalizers 来确保资源清理的安全性，与 Prometheus Operator 集成来自动化监控配置等。

核心实现概览

我们将使用 Operator SDK (基于 Go 语言) 来构建这个 Operator。下面是整个架构的核心组件和它们之间的交互关系。

graph TD
    subgraph User Interaction
        A[运维工程师] -- kubectl apply --> B(PerconaXtraDBCluster CR);
    end

    subgraph Kubernetes API Server
        B;
        C(CRD Definition);
        D(StatefulSet);
        E(Service);
        F(ConfigMap/Secret);
        G(CronJob for Backup);
        H(Pods);
    end

    subgraph Operator Pod
        I[Controller Manager] -- watches --> B;
        I -- reconciles --> D;
        I -- reconciles --> E;
        I -- reconciles --> F;
        I -- reconciles --> G;
    end

    I -- reads status --> H;
    
    B -- defines --> D & E & F & G;
    D -- creates --> H;
    H -- mounts --> F;
    G -- triggers --> H;
    
    style I fill:#f9f,stroke:#333,stroke-width:2px

代码深潜 1: CRD 类型定义

Operator 的第一步是定义 CRD 的数据结构。这在 Go 中通过结构体和特定的代码注解来完成。

api/v1alpha1/perconaxtradbcluster_types.go:

package v1alpha1

import (
	corev1 "k8s.io/api/core/v1"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

// PerconaXtraDBClusterSpec defines the desired state of PerconaXtraDBCluster
type PerconaXtraDBClusterSpec struct {
	// +kubebuilder:validation:Minimum=1
	// +kubebuilder:validation:Required
	Replicas int32 `json:"replicas"`

	// Image is the Docker image to use for the PXC nodes.
	// +kubebuilder:validation:Required
	Image string `json:"image"`

	// Resources defines the compute resource requirements.
	Resources corev1.ResourceRequirements `json:"resources,omitempty"`

	// Storage defines the persistent storage configuration.
	// +kubebuilder:validation:Required
	Storage StorageSpec `json:"storage"`
	
	// Name of the Secret containing MySQL configuration (my.cnf).
	ConfigSecretName string `json:"configSecretName,omitempty"`

	// Backup defines the automated backup strategy.
	Backup *BackupSpec `json:"backup,omitempty"`
}

type StorageSpec struct {
	// Storage class to use for PersistentVolumeClaims.
	StorageClassName string `json:"storageClassName"`
	// Size of the persistent volume.
	Size string `json:"size"`
}

type BackupSpec struct {
	// Cron schedule for backups.
	// +kubebuilder:validation:Pattern=`^(@(annually|yearly|monthly|weekly|daily|hourly|reboot))|(@every (\d+(ns|us|µs|ms|s|m|h))+)|((((\d+,)+\d+|(\d+(\/|-)\d+)|\d+|\*) ?){5,7})$`
	Schedule string `json:"schedule"`
	// S3 destination URL for backups.
	Destination string `json:"destination"`
	// Name of the Secret containing credentials for the backup destination.
	CredentialsSecret string `json:"credentialsSecret"`
}


// PerconaXtraDBClusterStatus defines the observed state of PerconaXtraDBCluster
type PerconaXtraDBClusterStatus struct {
	// ReadyReplicas is the number of PXC nodes that are ready and part of the cluster.
	ReadyReplicas int32 `json:"readyReplicas"`
	// State represents the current state of the cluster (e.g., Initializing, Ready, Failed).
	State string `json:"state"`
	// Conditions store the status conditions of the cluster.
	Conditions []metav1.Condition `json:"conditions,omitempty"`
}

//+kubebuilder:object:root=true
//+kubebuilder:subresource:status
//+kubebuilder:printcolumn:name="Replicas",type="integer",JSONPath=".spec.replicas"
//+kubebuilder:printcolumn:name="Ready",type="integer",JSONPath=".status.readyReplicas"
//+kubebuilder:printcolumn:name="Status",type="string",JSONPath=".status.state"
//+kubebuilder:printcolumn:name="Age",type="date",JSONPath=".metadata.creationTimestamp"

// PerconaXtraDBCluster is the Schema for the perconaxtradbclusters API
type PerconaXtraDBCluster struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty"`

	Spec   PerconaXtraDBClusterSpec   `json:"spec,omitempty"`
	Status PerconaXtraDBClusterStatus `json:"status,omitempty"`
}

//+kubebuilder:object:root=true

// PerconaXtraDBClusterList contains a list of PerconaXtraDBCluster
type PerconaXtraDBClusterList struct {
	metav1.TypeMeta `json:",inline"`
	metav1.ListMeta `json:"metadata,omitempty"`
	Items           []PerconaXtraDBCluster `json:"items"`
}

func init() {
	SchemeBuilder.Register(&PerconaXtraDBCluster{}, &PerconaXtraDBClusterList{})
}

注释是关键：// +kubebuilder:... 这样的注释被称为 “markers”，Operator SDK 使用它们来生成 CRD YAML、RBAC 权限和验证代码。
Spec vs Status：Spec 是用户定义的“期望状态”，是 Operator 的输入。Status 是 Operator 观察到的“实际状态”，是 Operator 的输出。这个分离至关重要。

代码深潜 2: 调谐循环 (Reconciliation Loop)

这是 Operator 的大脑。每次 PerconaXtraDBCluster 资源或其拥有的子资源（如 StatefulSet）发生变化时，这个函数就会被调用。

internal/controller/perconaxtradbcluster_controller.go:

package controller

import (
	// ... imports
)

// Reconcile is part of the main kubernetes reconciliation loop which aims to
// move the current state of the cluster closer to the desired state.
func (r *PerconaXtraDBClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	log := log.FromContext(ctx)

	// 1. Fetch the PerconaXtraDBCluster instance
	var pxcCluster dbv1alpha1.PerconaXtraDBCluster
	if err := r.Get(ctx, req.NamespacedName, &pxcCluster); err != nil {
		if errors.IsNotFound(err) {
			// Request object not found, could have been deleted after reconcile request.
			// Owned objects are automatically garbage collected. For additional cleanup logic use finalizers.
			log.Info("PerconaXtraDBCluster resource not found. Ignoring since object must be deleted")
			return ctrl.Result{}, nil
		}
		log.Error(err, "Failed to get PerconaXtraDBCluster")
		return ctrl.Result{}, err
	}

	// 2. Reconcile the StatefulSet for PXC nodes
	// This is the core logic for managing the database nodes themselves.
	sts := &appsv1.StatefulSet{}
	err := r.Get(ctx, types.NamespacedName{Name: pxcCluster.Name, Namespace: pxcCluster.Namespace}, sts)
	if err != nil && errors.IsNotFound(err) {
		// Define a new StatefulSet
		desiredSts := r.statefulSetForPXC(&pxcCluster)
		log.Info("Creating a new StatefulSet", "StatefulSet.Namespace", desiredSts.Namespace, "StatefulSet.Name", desiredSts.Name)
		if err = r.Create(ctx, desiredSts); err != nil {
			log.Error(err, "Failed to create new StatefulSet", "StatefulSet.Namespace", desiredSts.Namespace, "StatefulSet.Name", desiredSts.Name)
			return ctrl.Result{}, err
		}
		// StatefulSet created successfully - return and requeue
		return ctrl.Result{Requeue: true}, nil
	} else if err != nil {
		log.Error(err, "Failed to get StatefulSet")
		return ctrl.Result{}, err
	}
	
	// Here you would add logic to check if the existing StatefulSet matches the desired state
	// (e.g., replica count, image tag, resource requests) and update it if necessary.
	// A common mistake is to always update, which can cause unnecessary pod restarts.
	// A deep comparison of the specs is required.

	// 3. Reconcile the client-facing Service
	// ... similar logic for creating/updating a Kubernetes Service ...
	
	// 4. Reconcile the headless Service for peer discovery
	// ... similar logic for creating a headless service used by PXC nodes to find each other ...

	// 5. Reconcile the Backup CronJob
	if pxcCluster.Spec.Backup != nil {
		// Logic to create or update a CronJob based on the backup spec.
		// If spec.Backup is removed, ensure the CronJob is deleted.
		// This demonstrates how the operator handles optional features declaratively.
	}


	// 6. Update the Status
	// After all reconciliation steps, update the status based on the real world state.
	readyReplicas := sts.Status.ReadyReplicas
	if pxcCluster.Status.ReadyReplicas != readyReplicas {
		pxcCluster.Status.ReadyReplicas = readyReplicas
		// Logic to determine the overall cluster state based on ready replicas.
		if readyReplicas == *pxcCluster.Spec.Replicas {
			pxcCluster.Status.State = "Ready"
		} else {
			pxcCluster.Status.State = "Reconciling"
		}
		
		if err := r.Status().Update(ctx, &pxcCluster); err != nil {
			log.Error(err, "Failed to update PerconaXtraDBCluster status")
			return ctrl.Result{}, err
		}
	}

	return ctrl.Result{}, nil
}

// statefulSetForPXC returns a PXC StatefulSet object
func (r *PerconaXtraDBClusterReconciler) statefulSetForPXC(pxc *dbv1alpha1.PerconaXtraDBCluster) *appsv1.StatefulSet {
	// ... Here you construct the entire StatefulSet object in Go code.
	// This includes setting labels, replicas, pod template, container spec,
	// volume mounts, and the crucial PersistentVolumeClaimTemplate.
	
	// Key considerations in this function:
	// - Owner References: Must set the pxc object as the owner so that when the CR is deleted,
	//   the StatefulSet and its Pods are garbage collected by Kubernetes.
	// - Pod Anti-Affinity: To ensure high availability across AZs, you must configure
	//   pod anti-affinity rules to spread pods across different nodes/zones.
	// - Liveness and Readiness Probes: Define probes that accurately reflect PXC node health.
	//   A simple `mysql -e "SELECT 1"` is not enough. It should check the cluster status (`wsrep_local_state_comment`).
	// - Init Containers: Use init containers to perform bootstrapping logic, like preparing the
	//   data directory or waiting for the first node to be ready before others join.

	// Example of setting owner reference:
	// ctrl.SetControllerReference(pxc, sts, r.Scheme)

	// This function becomes the single source of truth for the StatefulSet's configuration.
	// A change here, a rebuild of the operator, and all managed clusters will be updated on the next reconcile.
	return &appsv1.StatefulSet{ /* ... full definition ... */ }
}

**幂等性 (Idempotency)**：调谐循环必须是幂等的。无论它运行多少次，对于相同的输入（CR spec），它都应该产生相同的输出（子资源的状态）。Create 操作之前必须检查资源是否存在，Update 之前必须检查是否有变更。
错误处理与重试：函数返回 error 会导致 Kubernetes 进行指数退避式的重试。返回 ctrl.Result{Requeue: true} 可以立即触发下一次调谐。
**所有权 (Ownership)**：通过 ctrl.SetControllerReference 设置所有权关系，可以实现级联删除，这是 Kubernetes 资源管理的核心。

架构的扩展性与局限性

这个自建 Operator 方案并非银弹。它的运维成本从手动操作转移到了 Go 代码的开发和维护上。

当前方案的局限性:

复杂性：开发一个生产级的 Operator 需要深厚的 Kubernetes API 和 Go 语言知识。逻辑中的任何一个 bug 都可能导致所有数据库集群出现问题。
版本升级逻辑：本文展示的调谐循环主要处理了资源的创建和基本更新。一个完整的数据库版本升级（例如从 PXC 8.0.32 到 8.0.33）需要更复杂的协调逻辑，例如逐个节点更新，并在每个节点更新后验证集群健康状况，这在 Operator 中实现起来相当复杂。
**Point-in-Time Recovery (PITR)**：当前的备份方案只实现了定时的全量备份。实现 PITR 需要 Operator 能够管理和应用二进制日志（binlogs），这大大增加了其复杂性。

未来的迭代路径:

实现 Restore CRD：引入一个新的 CRD PerconaXtraDBClusterRestore，用户可以通过创建这个资源来触发从指定备份中恢复数据的流程。Operator 将会监视这个资源，并编排数据恢复的整个过程。
集成监控：与 Prometheus Operator 集成。当 PerconaXtraDBCluster 创建时，自动创建 ServiceMonitor 资源，让 Prometheus 能够自动发现并抓取 MySQL 的监控指标。
更智能的故障转移：增强调谐逻辑，使其能够更智能地处理网络分区（脑裂）等复杂故障场景，而不仅仅是 Pod 级别的故障。

选择自建 Operator，实质上是选择将数据库运维这个领域知识（Domain Knowledge）沉淀为公司内部的一项核心基础设施能力。这是一项战略投资，对于需要大规模、标准化、低成本部署数据服务的场景，其长期回报是巨大的。

Go AWS EKS MySQL Kubernetes Operator