Gone missing: the day a few group members went silent!

Group Replication introduces a new way to do replication in MySQL. With great features such as multi-master replication it brings a range of exciting deployment scenarios where some difficult problems become much easier to solve.

Relying on a Consensus protocol – Paxos, to achieve its goal, sometimes we can get into situations where we loose a much needed majority. This blog will present the cases when that might happen and how can we solve them with our new functionalities.

Why do we need a majority

In a Consensus algorithm, a majority of nodes agree upon a proposed value in order to obtain an ordinal number for a certain message sent to the group. In our specific case, we need that messages sent by Group Replication to be delivered to all members in the same exact order.

Since version 0.6.0 of Group Replication, we use an in-house implementation of a Paxos variant, in which a majority of participating members is needed for each message sent to the group. In the general case, in a group with N members, we need, in each round, to have (N/2)+1 members alive.

The above is one of the reasons that we recommend that your system is designed from scratch having HA scenarios in mind. Be reminded that in classical crash-failure scenarios the formula to calculate the ideal number of members is: ha_members = 2*(number_of_failures)+1. In a practical scenario, a setup of at least 3 nodes is recommended, since you will tolerate at least one failure without losing majority.

When can it happen to you!

Lets keep in mind that these type of occurrences are usually a catastrophic scenario. Under normal conditions, our group communication implementation will reconfigure the group if it can’t detect that some members are failed or unreachable. The problem here lies in the simultaneous failure of several members.

Looking at a typical HA installation, the most common circumstances where it may happen are:

  1. A severe crash of a majority of nodes, for instance, a power surge in a server rack:
  2. A network split in a network with an odd number of members, for instance, a broken network switch.
    • Group with [A,B,C,D,E,F] becomes 2 groups: [A,B,C] and [D,E,F], neither of them holding majority.

On case #1, we will end up with a group that can’t proceed since the majority of members is no longer there. On the second case, we end up with two partitions, in which both can’t proceed since neither has a majority to reconfigure the group and proceed.

The visible effects of such situation are:

  • Whenever you try to execute a transaction, it will block and the command will only return when we reach transaction timeout;
  • If you check the replication_group_members P_S table you will see the nodes that are out of your partition or crashed with our new status: UNREACHABLE.

UNREACHABLE is a new status given by the perception that members have from the connections that they established with all other nodes. When a server detects that it has not received heartbeats, it sets that node with the new state. A DBA must always double-check the actual state of the databases that are declared as UNREACHABLE or use a monitoring tool to have that information.

How can we get out of it

With such an event going around in your HA environment, your number #1 priority should be to restore service and then try to find out what caused your problem. Lets say it is not easy to replace an entire rack or get a new network switch!

For that, we created a new variable: group_replication_force_members. This variable is a comma separated list of “ip:port” of members that are alive and that a DBA decide that they should proceed as a group.

A practical example:

  • A group is running with 5 nodes
    • 192.168.1.1:12345
    • 192.168.1.2:12345
    • 192.168.1.3:12345
    • 192.168.1.4:12345
    • 192.168.1.5:12345
  • There is a major crash and we end up with 2 nodes
    • 192.168.1.1:12345
    • 192.168.1.2:12345
  • The DBA can go to any of the alive nodes and in only one of them execute:

This will cause that member to forward a special message to all those members in the list stating “Our new configuration is {192.168.1.1:12345, 192.168.1.2:12345}. Please override your current configuration with this new one”. The message is special since it requires no majority to be agreed by all members.

What happens next is that the nodes on that list will resume their normal work, unblock everything that was blocked and proceed.

Given the unique type of scenarios where this mechanism is use, we must take some special care with it. Before setting this variable one must make sure that only the nodes that are in that list are alive or else we might end up with garbage coming from old connections of nodes that are still reachable. So it is recommended that, in a case of a network split, you shutdown all the servers that do not belong to the new configuration.

Conclusion

This new feature is part of an ongoing work to improve the user experience and to quickly solve a situation that can happen on the daily operations of a DBA.

If you do encounter any issues, please file a bug under the ‘MySQL Server: Group Replication’ category so that we may investigate further.  Give it a go and provide us valuable inputs with field experience in how to improve this new functionality!

One thought on “Gone missing: the day a few group members went silent!

Leave a Reply

Your email address will not be published. Required fields are marked *

Please enter. * Time limit is exhausted. Please reload CAPTCHA.