Efficient Data Replication with ZFS

This year, I have been working with one of our clients on a typical research-oriented server setup, including a few compute servers mounting a single shared storage over NFS, which is a common and well-tested configuration. The main difference for this project was the size of the storage. At the time when our team became engaged with this client, they were using ten assorted storage servers based on Linux and FreeNAS. In order to replicate the data between these servers, rsync was being used. Additionally, an elaborate scheme was in place to make sure that each dataset is housed on at least two different servers. All of the storage servers were outdated and out of warranty, so the client agreed to procure new hardware and build a new setup from scratch.

Following the example of Research Computing team, Ubuntu was selected as a base operating system for both compute and storage servers deployed on commodity Supermicro hardware resold by Colfax. Cost-effectiveness of the deployment was deemed as a decisive factor by the client. To achieve the maximum storage density, a client opted for a single 60-drive primary storage server with ZFS file system. ZFS brings with it all the advantages of a copy-on-write file system with features, such as instant copy, snapshots, flexible volume management, built-in NFS sharing, error resilience and correction.

Below I am going to discuss SEND/RECEIVE feature of ZFS, which allows to easily and efficiently replicate large volumes of data.

The Scope of the Task

Two identical storage servers are being utilized, each with 400TB of usable space. Needless to say, that multiple drives in these 60-bay storage servers were used to ensure redundancy. The drives were split into multiple RAIDZ2 groups and storage pools of the sizes necessary to accommodate the data from the volumes on the old storage servers without the need to split it across multiple storage pools. At the time the total amount of data to migrate was a bit over 200TB.

Migration of the data over 1Gbps network connections from the old storage servers ultimately  completed in approximately one month. The speed of the transfer was heavily impacted by file system overhead, due to large quantities of small files ranging from a few kilobytes to a few megabytes.

Two new storage servers were placed in two different data centers across campus and connected over redundant 10GbE network connections. Replication of the 200TB of data between will keep storage in sync, if not in real time, then frequent enough to ensure minimal data loss in case of the primary storage box failure and forced switch over to the secondary storage.

Initial Replication

Due to a delay on obtaining one of the two storage servers, users were working with the primary storage for a few weeks before the secondary was racked. Data replication with built-in ZFS feature was selected from the outset, so it did not present a problem. ZFS SEND/RECEIVE replicates the data with all the historical snapshots. Typically the replication is done using ssh like this:

zfs send tank/volume@snapshot | ssh user@receiver.domain.com zfs receive tank/new_volume

ZFS sends a snapshot over ssh to the secondary box at the same time starting zfs receive command on it to receive it and write to the file system. This workflow, being really simple and straightforward, unfortunately would have required another few weeks to complete. Even with the network connection between the servers being fast enough, the throughput left much more to be desired. Though ZFS SEND/RECEIVE feature does not deal with individual files, but rather with data blocks, it does not send and receive the data at the same time and rate. This causes either sender or receiver to stall the process frequently for short intervals when either sender or receiver waits for a counterpart.

Local in-memory caching of the data was used to compensate for the “burstiness” of the transfer utilizing an the excellent datastream buffering utility mbuffer:

On the receiver:

mbuffer -s <block_size> -m <buffer_size> -I <port> | zfs receive tank/new_volume

On the sender:

zfs send tank/volume@snapshot | mbuffer -s <block_size> -m <buffer_size> -O receiver.domain.com:<port>

This complicates the transfer just a little bit since zfs receive has to be started on the secondary box first, followed by zfs send on the primary, but it has not really been a problem since the volumes were pretty large and overall quantity small.

With the large multi-gigabyte buffers on sending and receiving sides, the entire 200TB of data has been replicated in 2.5 days.

Snapshot Replication

ZFS SEND/RECEIVE facility works by replicating the snapshots. It is extremely efficient sending only changed blocks from the previous snapshot over the wire. This allows to ship snapshots between the storage boxes frequently, which is exactly what was needed.

Snapshot creation, scheduling and replication was handled by an application with a quirky name “ZnapZend”. Everything related to ZFS definitely must have at least one “Z” in its name, but more is better. The software is free and open-source, which helps with budgetary constraints. The documentation and code are available on GitHub.

The application provides two commands: znapzend, that handles the replication jobs and znapzendzetup used to configure the replication schedule and increase the total number of “Z”s as well.

Znapzendzetup is very flexible allowing to:

  • Configure any desirable schedule of snapshots per zfs volume.
  • Retention policy for each of the scheduled snapshots in the format “7d=>1hr”, “1y=>1w”, etc. meaning keep hourly snapshots for a week and weekly snapshots for a year. You can define any number of such retention policies for a volume.
  • Delay sending of a snapshot by a configurable amount of time so that snapshots from different volumes could be staggered and not overwhelm the network bandwidth.
  • Separately configure additional, non-replicated snapshots on the source box if needed.
  • Define any format for the names of the snapshots. I prefer them in YYYYMMDD format.
  • Configure recursive snapshots i.e. snapshots for the nested filesystems.
  • Invoke any command or script prior to taking a snapshot and after it has been taken.
  • Use mbuffer to provide file stream buffering like I described above.
  • Store all these configuration within ZFS file system as options. This is very handy if you need to rebuild a server or migrate the disks to a different box.
  • Export and import configuration you created for one volume to a file or to a different volume. Saves a lot of time on repetitive setups for multiple volumes.

Examples can be found on the GitHub page for the module/software, which is also very well documented.

Znapzend itself runs as a daemon. All that is required is to copy the systemd unit provided in the repository to your /etc/systemd/system, enable and start it. Once complete, znapzend would follow the schedule creating snapshots on the primary storage and shipping them to the secondary ensuring that the state of the data on them is identical. It also removes the old snapshots according to retention policy on both primary and secondary storage. Of course, it is recommended to make the volumes on secondary system read-only using ZFS property to ensure no writes are taking place, potentially breaking the replication.  

Snapshot replication on the system proved to be efficient to where the replication interval is set at every 15 minutes using ssh without any buffering. The frequency can be increased due to its efficiency, but the current interval met the needs of the client.

Conclusion

Setting up the replication using ZFS SEND/RECEIVE feature proved to be very straightforward and effortless process due to znapzend software and efficiency of ZFS at sending snapshots. This easily allows to achieve backup RTO and RPO within a few minutes, which is enough for the majority of uses in research clusters.