Arbiter volumes and quorum options in gluster

The arbiter volume is special subset of replica volumes that is aimed at preventing split-brains and providing the same consistency guarantees as a normal replica 3 volume without consuming 3x space.

Arbiter configuration

The syntax for creating the volume is:

# gluster volume create <VOLNAME>  replica 3 arbiter 1 <NEW-BRICK> ...

Note: Volumes using the arbiter feature can only be replica 3 arbiter 1

For example:

# gluster volume create testvol replica 3 arbiter 1  server{1..6}:/bricks/brick
volume create: testvol: success: please start the volume to access data

This means that for every 3 bricks listed, 1 of them is an arbiter. We have created 6 bricks. With a replica count of three, each 3 bricks in series will be a replica subvolume. Since we have two sets of 3, this created a distribute subvolume made of up two replica subvolumes.

Each replica subvolume is defined to have 1 arbiter out of the 3 bricks. The arbiter bricks are taken from the end of each replica subvolume.

# gluster volume info
Volume Name: testvol
Type: Distributed-Replicate
Volume ID: ae6c4162-38c2-4368-ae5d-6bad141a4119
Status: Created
Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: server1:/bricks/brick
Brick2: server2:/bricks/brick
Brick3: server3:/bricks/brick (arbiter)
Brick4: server4:/bricks/brick
Brick5: server5:/bricks/brick
Brick6: server6:/bricks/brick (arbiter)
Options Reconfigured  :
transport.address-family: inet
performance.readdir-ahead: on  `

The arbiter brick will store only the file/directory names (i.e. the tree structure) and extended attributes (metadata) but not any data. i.e. the file size (as shown by ls -l) will be zero bytes. It will also store other gluster metadata like the .glusterfs folder and its contents.

Note: Enabling the arbiter feature automatically configures client-quourm to 'auto'. This setting is not to be changed.

Arbiter brick(s) sizing

Since the arbiter brick does not store file data, its disk usage will be considerably less than the other bricks of the replica. The sizing of the brick will depend on how many files you plan to store in the volume. A good estimate will be 4kb times the number of files in the replica.

Why Arbiter?

Split-brains in replica volumes

When a file is in split-brain, there is an inconsistency in either data or metadata (permissions, uid/gid, extended attributes etc.) of the file amongst the bricks of a replica and we do not have enough information to authoritatively pick a copy as being pristine and heal to the bad copies, despite all bricks being up and online. For directories, there is also an entry-split brain where a file inside it has different gfids/ file-type (say one is a file and another is a directory of the same name) across the bricks of a replica.

This document describes how to resolve files that are in split-brain using gluster cli or the mount point. Almost always, split-brains occur due to network disconnects (where a client temporarily loses connection to the bricks) and very rarely due to the gluster brick processes going down or returning an error.

Server-quorum and some pitfalls

This document provides a detailed description of this feature. The volume options for server-quorum are:

Option:cluster.server-quorum-ratio Value Description: 0 to 100

Option:cluster.server-quorum-type Value Description: none | server If set to server, this option enables the specified volume to participate in the server-side quorum. If set to none, that volume alone is not considered for volume checks.

The cluster.server-quorum-ratio is a percentage figure and is cluster wide- i.e. you cannot have a different ratio for different volumes in the same trusted pool.

For a two-node trusted storage pool it is important to set this value to be greater than 50% so that two nodes separated from each other do not both believe they have quorum simultaneously. For a 2 node plain replica volume, this would mean both nodes need to be up and running. So there is no notion of HA/failover.

There are users who create a replica 2 volume from 2 nodes and peer-probe a 'dummy' node without bricks and enable server quorum with a ratio of 51%. This does not prevent files from getting into split-brain. For example, if B1 and B2 are the bricks/nodes of the replica and B3 is the dummy node, we can still end up in split-brain like so:

  1. B1 goes down, B2 and B3 are up. server-quorum is still. File is modified by the client.
  2. B2 goes down, B1 comes back up. Server-quorum is met. Same file is modified by the client.
  3. We now have different contents for the file in B1 and B2 ==>split-brain.

In the author’s opinion, server quorum is useful if you want to avoid split-brains to the volume(s) configuration across the nodes and not in the I/O path. Unlike in client-quorum where the volume becomes read-only when quorum is lost, loss of server-quorum in a particular node makes glusterd kill the brick processes on that node (for the participating volumes) making even reads impossible.

Client Quorum

Client-quorum is a feature implemented in AFR to prevent split-brains in the I/O path for replicate/distributed-replicate volumes. By default, if the client-quorum is not met for a particular replica subvol, it becomes read-only. The other subvols (in a dist-rep volume) will still have R/W access.

The following volume set options are used to configure it:

Option: cluster.quorum-type Default Value: none Value Description: none|auto|fixed If set to "fixed", this option allows writes to a file only if the number of active bricks in that replica set (to which the file belongs) is greater than or equal to the count specified in the 'quorum-count' option. If set to "auto", this option allows writes to the file only if number of bricks that are up >= ceil (of the total number of bricks that constitute that replica/2). If the number of replicas is even, then there is a further check: If the number of up bricks is exactly equal to n/2, then the first brick must be one of the bricks that is up. If it is more than n/2 then it is not necessary that the first brick is one of the up bricks.

Option: cluster.quorum-count Value Description: The number of bricks that must be active in a replica-set to allow writes. This option is used in conjunction with cluster.quorum-type =fixed option to specify the number of bricks to be active to participate in quorum. If the quorum-type is auto then this option has no significance.

Option: cluster.quorum-reads Default Value: no Value Description: yes|no If quorum-reads is set to 'yes' (or 'true' or 'on') then even reads will be allowed only if quorum is met, without which the read (and writes) will return ENOTCONN. If set to 'no' (or 'false' or 'off'), then reads will be served even when quorum is not met, but writes will fail with EROFS.

Replica 2 and Replica 3 volumes

From the above descriptions, it is clear that client-quorum cannot really be applied to a replica 2 volume:(without costing HA). If the quorum-type is set to auto, then by the description given earlier, the first brick must always be up, irrespective of the status of the second brick. IOW, if only the second brick is up, the subvol becomes EROFS, i.e. no HA. If quorum-type is set to fixed, the the quorum-count has to be two to prevent split-brains. (otherwise a write can succeed in brick1, another in brick2 =>split-brain). So for all practical purposes, if you want high availabilty in a replica 2 volume, it is recommended not to enable client-quorum.

In a replica 3 volume, client-quorum is enabled by default and set to 'auto'. This means 2 bricks need to be up for the writes to succeed. Here is how this configuration prevents files from ending up in split-brain:

Say B1, B2 and B3 are the bricks: 1. B3 is down, quorum is met, write happens on the file on B1 and B2 2. B3 comes up, B2 is down, quorum is again met, a write happens on B1 and B3. 3. B2 comes up, B1 goes down, quorum is met. Now when a write is issued, AFR sees that B2 and B3's pending xattrs blame each other and therefore the write is not allowed and is failed with an EIO.

There is a corner case even with replica 3 volumes where the file can end up in a split-brain. AFR usually takes range locks for the {offset, length} of the write. If 3 writes happen on the same file at non-overlapping {offset, length} and each write fails on (only) one different brick, then we have AFR xattrs of the file blaming each other.

How Arbiter works

There are 2 components to the arbiter volume. One is the arbiter xlator that is loaded in the brick process of every 3rd (i.e. the arbiter) brick. The other is the arbitration logic itself that is present in AFR (the replicate xlator) loaded on the clients.

The former acts as a sort of 'filter' translator for the FOPS- i.e. it allows entry operations to hit posix, blocks certain inode operations like read (unwinds the call with ENOTCONN) and unwinds other inode operations like write, truncate etc. with success without winding it down to posix.

The latter. i.e. the arbitration logic present in AFR does the following:

  • Takes full file locks when writing to a file as opposed to range locks in a normal replicate volume. This prevents the corner-case split-brain described earlier for 3 way replicas.

The behaviour of arbiter volumes in allowing/failing write FOPS in conjunction with client-quorum can be summarized in the below steps:

  • If all 3 bricks are up (happy case), then there is no issue and the FOPs are allowed.

  • If 2 bricks are up and if one of them is the arbiter (i.e. the 3rd brick) and it blames the other up brick for a given file, then all write FOPS will fail with ENOTCONN. This is because in this scenario, the only true copy is on the brick that is down. Hence we cannot allow writes until that brick is also up. If the arbiter doesn't blame the other brick, FOPS will be allowed to proceed. 'Blaming' here is w.r.t the values of AFR changelog extended attributes.

  • If 2 bricks are up and the arbiter is down, then FOPS will be allowed. When the arbiter comes up, the entry/metadata heals to it happen. Of course data heals are not needed.

  • If only one brick is up, then client-quorum is not met and the volume becomes EROFS.

  • In all cases, if there is only one source before the FOP is initiated (even if all bricks are up) and if the FOP fails on that source, the application will receive ENOTCONN. For example, assume that a write failed on B2 and B3, i.e. B1 is the only source. Now if for some reason, the second write failed on B1 (before there was a chance for selfheal to complete despite all brick being up), the application would receive failure (ENOTCONN) for that write.

The bricks being up or down described above does not necessarily mean the brick process is offline. It can also mean the mount lost the connection to the brick due to network disconnects etc.