Recovery Enhancements for Group Replication

One of the key features of MySQL Group Replication is its distributed recovery mechanism. Whenever a new member joins a server group, making use of this plugin component, it reaches to a suitable donor and fetches the data that it misses up until the point it is declared online.

Such a critical component in the plugin needs for this reason to be user friendly and more importantly fault tolerant. With this in mind we introduced several enhancements in this new release.

Random donor selection.

For simplicity, up until this release, a server joining a group would always pick the server that comes first in the list of servers that were reported to be in the group. While this is reasonable for fairly static scenarios, the risk of the same member being selected over and over and serving more than one joiner at the same time should be avoided.

Without adding new metrics to the system, a fairly straightforward change was to select a random donor from the existing online members of the group. This way there is a better chance that the same server is not selected more than once when multiple members enter the group.

Enhanced automatic donor switchover

Other main point of concern when improving recovery as a whole was to enhance its error detection mechanisms. On past versions, when reaching out to a donor, recovery could only detect connection errors due to authentication issues or some other problem. In response to it, a donor switchover would happen, and a new connection attempt was made to a different member.

From this version on, we extended this behavior to other error scenarios, so now recovery will also react to:

  • Purged data scenarios: If the selected donor contains some purged
    data that is needed for the recovery process then an error will
    occur. Recovery detects this error and a new donor is selected.
  • Duplicated data: If a joining member already contains some data
    that will conflict with the data coming from the selected donor
    during recovery then an error will happen. This may be due to some
    errand transactions present in the joiner.
    One could argue that recovery should fail instead of switching over
    to another donor, but on heterogeneous groups maybe some members
    share this conflicting transactions and others don’t. For that
    reason, upon error, recovery will select another donor from the
    group.
  • Other errors: Some other error that makes the recovery SQL or IO thread
    fail. Every stop of each of one these threads will result in the
    choice of another donor.

Better timeout and sleep routines

Given that the actual recovery data transfer relies on the binary log and existing MySQL replication framework, it is possible that some transient errors make the IO or SQL thread error out. For such cases we also reworked the donor switchover process in this new version. It now has a more familiar look and feel when compared to standard replication.

Number of attempts

First of all, we reconfigured the number of attempts a joiner makes when trying to connect to a donor. Until now, the default number of attempts made by a joining member to find a suitable donor was equal to the number of online members when it joined the group. The logic behind this was that if we tried all possible donors and none was suitable then it would be pointless to progress.

We now set default value to a more familiar number, as the old default was too small when compared to the slave default counterpart : 86400 tries.

Note that this accounts for the global number of attempts that the joiner makes connecting to each one of the suitable donors.

Sleep routines

Related to the number of attempts, we also reworked the sleep routines associated to the donor switch process on error cases.

For this we introduced a new plugin variable:

You can check its value and update it if desired.

This variable, with a default set to 60 seconds (same as for a replication slave) will then be used to determine how much time should recovery sleep between attempts.

Note however that recovery will not sleep after every donor connection attempt. Since we are connecting to different servers and not to the same one over and over again, we can assume that the problem that affects server A maybe not affect server B.

As such, recovery will suspend only when it has gone through all the possible donors. When the joiner tried to connect to all the suitable donors in the group and none remains, recovery will sleep the number of seconds configured in the reconnect interval variable.

Conclusion

This new version of Group replication brings a more sturdy recovery process that also shows a more familiar look and feel to the user. To try it go to labs releases and try the new preview release of MySQL Group Replication Plugin following the instructions at Getting started with MySQL Group Replication and send us your feedback.

Note that this is not the GA yet, so don’t use it in production and expect bugs here and there. If you do experience bugs, we are happy to fix them. All you have to do is to file a bug in the bugs DB in that case.

About Pedro Gomes

Who am I? I'm a replication developer @ MySQL since 2013, and a fan of all things distributed so it's hard not to love my job. Raised on the distributed lab of Minho's University, home of great academic research on the field, I joined Oracle following this same passion and here I am!

Leave a Reply

Your email address will not be published. Required fields are marked *

Please enter. * Time limit is exhausted. Please reload CAPTCHA.