redis 数据库不停机拆分扩容

服务开发之始,难以估算最终的数据规模,如按最大容量规划,则会增加项目起步时的复杂性,还有就是资源浪费。

所以很多时候,数据都是塞在一个 redis 实例中,当服务规模扩大,单个 redis 实例不足以支撑未来的访问量时,再拆分数据(Partitioning)。

Redis 有很多数据迁移工具,如:redis-copyredis-copy.pymigrate 等,但是迁移的数据量大时需要不短的时间,会对业务稳定性造成影响。

真正可靠的迁移手段估计只有 Redis replication 方式。

引用自 Partitioning: how to split data among multiple Redis instances. – Redis

Using Redis replication you will likely be able to do the move with minimal or no downtime for your users:

  • Start empty instances in your new server.
  • Move data configuring these new instances as slaves for your source instances.
  • Stop your clients.
  • Update the configuration of the moved instances with the new server IP address.
  • Send the SLAVEOF NO ONE command to the slaves in the new server.
  • Restart your clients with the new updated configuration.
  • Finally shut down the no longer used instances in the old server.

引用自 Redis Administration – Redis

Upgrading or restarting a Redis instance without downtime

Redis is designed to be a very long running process in your server. For instance many configuration options can be modified without any kind of restart using the CONFIG SET command.

Starting from Redis 2.2 it is even possible to switch from AOF to RDB snapshots persistence or the other way around without restarting Redis. Check the output of the CONFIG GET * command for more information.

However from time to time a restart is mandatory, for instance in order to upgrade the Redis process to a newer version, or when you need to modify some configuration parameter that is currently not supported by the CONFIG command.

The following steps provide a very commonly used way in order to avoid any downtime.

  • Setup your new Redis instance as a slave for your current Redis instance. In order to do so you need a different server, or a server that has enough RAM to keep two instances of Redis running at the same time.
  • If you use a single server, make sure that the slave is started in a different port than the master instance, otherwise the slave will not be able to start at all.
  • Wait for the replication initial synchronization to complete (check the slave log file).
  • Make sure using INFO that there are the same number of keys in the master and in the slave. Check with redis-cli that the slave is working as you wish and is replying to your commands.
  • Allow writes to the slave using CONFIG SET slave-read-only no
  • Configure all your clients in order to use the new instance (that is, the slave).
  • Once you are sure that the master is no longer receiving any query (you can check this with the MONITOR command), elect the slave to master using the SLAVEOF NO ONE command, and shut down your master.

以下步骤可以不断进行,直到将数据拆到很细的粒度,值得注意的是这种拆分方法只支持将一部分数据拆分到全新的 Redis 实例。

  • 创建新 Redis 实例为旧 Redis 实例的 Slave
  • 服务同时连接新旧 Redis 实例

    迁移时代码需要更新并重启服务,服务需支持优雅重启:服务进程依次重启使得客户感觉不到服务被中断。

    通过预先连接新旧 Redis 实例,使得接下来的迁移动作不需要重启服务,一键瞬间完成。

    迁移后,清除新旧 Redis 实例中的删除脏数据可能耗时较长,对于通过 scan 扫描数据的业务逻辑部分,需容忍脏数据:根据 hash 规则,扫描到数据不属于当前 Redis 实例时忽略掉,避免使用脏数据。

    Slave 的数据复制进度追上后,进行下一步。

  • 让新 Redis 实例可写

    config set slave-read-only no
    

    新 Redis 实例也可写入,旧 Redis 的写请求还会同步到新的 Redis 实例,使得迁移过程中数据基本不丢失。

    要求新旧 Redis 实例比较稳定,发生全量同步会导致数据丢失。

  • 服务从新 Redis 实例访问迁移走的数据

    可以通过给所有服务结点广播消息方式实现,将服务的 Redis 访问快速切到新 Redis 实例上。

    正常情况下,旧 Redis 中已迁移的数据应该不会再有读写,如果有的话可能是还没有迁移干净,应该立即找到访问源,进行中断或迁移。

  • 新 Redis 实例断开与旧实例的 Master-Slave 关系

    新 Redis 实例改为角色为 Master,恢复 slave-read-only 配置项为 yes

    新的 Redis 实例可以进一步使用 Redis Sentinel 来监控以实现高可用。

  • 删除新 Redis 实例中多迁来的数据
  • 删除旧 Redis 实例中已迁走的数据

    所有数据都迁移走后,可以将它停掉。

最近看了《 在线数据迁移经验:如何为正在飞行的飞机更换引擎 》,发现前面的 redis 操作步骤与文章中的在线数据迁移步骤极其相似:

  • 迁移前

    写旧、读旧

  • 上线双写

    写新、写旧、读旧

  • 历史数据搬迁

    写新、写旧、读旧

  • 切读

    写新、写旧、读新

  • 清理

    写新、读新。

这是一种数据迁移的通用模式。

另一篇文章《Database migrations done right - Michael Brunton-Spall》提出了一个基本原则:

你做出的每一处改动必须与系统的其余部分保持向后兼容

Every change you make must be backward compatible with the rest of the system


redis