MySQL Group Replication – Default response to network partitions has changed

MySQL Group Replication allows you to create an highly-available replication group of servers with minimum effort. It provides automated mechanisms to detect and respond to failures in the members of the group. The response depends on the characteristics of each failure and it is configurable. In order to decrease the need for manual user intervention whenever there is a temporary network partition or a server slowdown in MySQL 8.0.21 we have changed the default values of two system variables:

  • group_replication_member_expel_timeout
  • group_replication_autorejoin_tries

In this blog post we will review MySQL Group Replication’s mechanisms to detect and respond to failures, how they can be configured and why we changed the default configuration in MySQL 8.0.21.

Introduction

MySQL Group Replication provides consistent database replication across a group of servers. It is continuously available and can tolerate failures as long as the majority of the members are working properly and are able to communicate between themselves. It has an automated failure detection mechanism to identify group members that are no longer communicating and expel them when it is likely that they have failed.

Here is a brief overview of the failure detection mechanism. Each member analyzes the exchanged messages with other members. If it doesn’t receive any message from another member for some time it creates a suspicion. If the majority of the group members (more than half) agree with this suspicion the member is expelled and, as a result, excluded from the group.

In an asynchronous distributed system it is impossible to build a perfect failure detector [Chandra&Toueg(1996)] since there is no bound on the transmission delay of messages and on the processing time of code. It is thus possible that the server of the member which is suspect of failure is still alive and working correctly but due to a transient machine slowdown or temporary network failure it takes longer than expected to respond to the other members.

Group replication provides two features to increase the likelihood of keeping the member in the group during a transient communication failure

  • defer the expulsion of the member
  • enable member auto-rejoin.

In MySQL 8.0.21 we have changed the default configuration of these features so that the member will stay longer in the group and in case it leaves it will try to rejoin automatically.

The group waits longer before expelling unreachable members

In MySQL 8.0.13 we have introduced the system variable group_replication_member_expel_timeout. It configures the additional time interval in seconds between the creation of the suspicion of member failure and its expulsion from the group. It was described in detail in the Coping with unreliable failure detection blog post.

Up to and including MySQL 8.0.20, the value of this system variable defaults to 0. Since there is a 5 seconds waiting period before the suspicion is created, the member is actually expelled 5 seconds after the communication with the group is interrupted. In MySQL 8.0.21 we have increased the default value to 5 seconds,which means that the group will wait 10 seconds before expelling the unreachable member. The member could potentially survive for a further few seconds after this timeout because the check for expired suspicions is carried out periodically.

Let’s observe what happens in a replication group with 3 members after the network connection between the third node and the rest of the group is dropped.

The network partition is kept for 10 seconds. Since the duration is not longer than 10 seconds the third member is not expelled from the group. As expected each partition sees the rest of the group as UNREACHABLE :

Then the network connectivity is restored:

Shortly after the third member becomes ONLINE :

A member still waits forever to reach the majority of the group

Whenever there is a network partition the group can decide to expel an unreachable member. A member can also decide by itself to leave voluntarily the group if it cannot contact the majority of its members. The duration of the waiting period before leaving the group is set using system variable group_replication_unreachable_majority_timeout. By default this variable is set to 0, which means that members which find themselves in a minority due to a network partition wait forever to leave the group. This value has not been changed in MySQL 8.0.21. You should keep the default value to avoid to have to manually rejoin members which left the group due to timeout during these network partitions.

An expelled member rejoins automatically

Once a member is either expelled or gave up waiting to reach the group, if it is able to restore communication with the group there is a way to rejoin it automatically without user intervention : the auto-rejoin feature. This feature was introduced in MySQL 8.0.16 and it is configurable with system variable group_replication_autorejoin_tries. See the Enabling member auto-rejoin in group replication blog post for details.

Up to and including MySQL 8.0.20, its value defaults to 0, that is, the member will not try to rejoin the group. In MySQL 8.0.21 we have changed the default value to 3, meaning that it makes three automatic attempts to rejoin the group, with a 5 minutes interval between each attempt.

It’s important to note that in the presence of a network partition auto-rejoin will kick in as soon as the network partition is resolved and regardless of the duration of the partition. In case a member is expelled it needs to resume communication with the group in order to know that it was expelled. Only after the network partition is resolved does it enter ERROR state and tries to rejoin automatically. And since the default value of group_replication_unreachable_majority_timeout is set to 0, a member waits forever to reach the group majority.

Let’s observe what happens in a replication group with 3 members when the network connection between the third node and the rest of the group is dropped for a duration longer than 10 seconds.

The network partition is kept for 30 seconds. Since the elapsed time is longer than 10 seconds the third member is expelled by the rest of the group. But since it doesn’t receive any message from the group it still considers itself as ONLINE.

Let’s restore the network connection between third member and the rest of the group:

Few seconds later the third member receives a message with the expulsion and enters ERROR state. Since auto rejoin is enabled it will try to rejoin the group automatically.

After a few seconds the third member successfully rejoins the group and becomes ONLINE.

If after 3 attempts the member cannot rejoin the group it proceeds to the action specified by the variable group_replication_exit_state_action. The behavior of this variable was explained in the Fail early fail fast preventing stale reads in group replication blog post, so it is not discussed here. At this point, in order to rejoin a member you need to add it manually or have a script to do it automatically.

Trade-offs of these changes

Increasing the value of group_replication_member_expel_timeout has its trade-offs, since a member might stay longer in state UNREACHABLE. While there is a least one unreachable member the group has the following limitations:

  • Group membership re-configurations aren’t allowed (members cannot be added or removed)
  • For single-primary mode a new group primary cannot be elected
  • Although the unreachable member does not accept writes, reads can still be made as long as it is able to communicate with its clients, which increases the likelihood of stale reads
  • If MySQL Group Replication consistency level is equal to AFTER or BEFORE_AND_AFTER then a write transaction must wait for the unreachable member to become online and apply it

If you cannot afford these limitations and want to prioritize data consistency over minimum user intervention then set the group_replication_member_expel_timeout system variable to 0.

If an unreachable member resumes communication with the group it will use XCom cache to retrieve missed messages exchanged by the group while it was away. The cache size limit is set using system variable group_replication_message_cache_size. If you previously tuned the size of the XCom message cache with reference to the expected volume of messages during the previous default waiting period before member expulsion (5 seconds), you might start to see warning messages from GCS on active group members stating that a message that is likely to be needed for recovery by a unreachable member has been removed from the message cache. If this situation occurs you should increase your group_replication_message_cache_size setting to account for the new expel timeout.

If the member is expelled and resumes communication with the group auto-rejoin will start. While auto-rejoin is happening the group can function normally : members can be added and removed and a new primary can be elected. The constraints on transaction concurrency when consistency level is AFTER or BEFORE_AND_AFTER are also absent. However, clients can still connect to the expelled member and read stale data. So if you want to avoid stale reads for any period of time, set the group_replication_autorejoin_tries system variable to 0.

Summary

To summarize, in MySQL 8.0.21 we have changed the default response of group replication to temporary network partitions and server slowdowns so that :

  • the replication group waits at least 10 seconds before expelling an member to which it cannot connect
  • a member which left the group makes 3 attempts to automatically rejoin

With this change you won’t need to manually rejoin members to the group which are expelled due to transient failures.

 6,100 total views,  8 views today

About Pedro Ribeiro

Quality Assurance and Test Automation Engineer in MySQL Components-Quality Engineering Team. Responsible for verification of MySQL Replication and Group Replication worklogs and automation of test suites. Before joining Oracle worked as system tester of Optical Networks. Has a PhD in Physics from Universidade Técnica de Lisboa.

Leave a Reply