Automatic member fencing with OFFLINE_MODE in Group Replication

Group Replication enables you to create fault-tolerant systems with redundancy by replicating the system state to a set of servers. Even if some of the servers subsequently fail, as long it is not all or a majority, the system is still available.
This blog post will focus on what happens to the failed servers, that is, how the group can be configured in order to prevent that the failed but still client reachable servers do not accept client requests.

A group member unintentionally leaves the group:

  1. after encountering an applier error;
  2. after encountering a recovery error;
  3. in the case of a loss of majority (if group_replication_unreachable_majority_timeout is different of 0);
  4. when another member of the group expels it due to a suspicion timing out;
  5. after an error on coordinated group changes;
  6. after a primary election error;
  7. when automatic rejoin is enabled, after its attempts are exhausted unsuccessfully.

The behaviour of the failed member after leaving the group is controlled by the option group_replication_exit_state_action.

Until 8.0.17, this behaviour could be:

  • READ_ONLY disable writes on the server (the default value);
  • ABORT_SERVER shutdown the server.

On 8.0.18 we added:

  • OFFLINE_MODE close all connections and disallow new ones from users who do not have the CONNECTION_ADMIN or SUPER privilege. This mode includes READ_ONLY, otherwise a user with CONNECTION_ADMIN or SUPER privilege would be able to do changes that would never reach the group.

These three behaviours allow the DBA to customise the failed server, and on the more severe situation the full system behaviour. For instance, in the case all members become unreachable, due to a internal network failure, all members will follow the configured behaviour.

The DBA has the capability to only block writes, if she/he goes with READ_ONLY; block all operations with OFFLINE_MODE; or even stop the server completely with ABORT_SERVER.

When a failed server, configured with group_replication_exit_state_action=OFFLINE_MODE, leaves the group we can see its ERROR state on the performance_schema.replication_group_members table:

and the offline mode can be check with:

After fixing the failure that caused the unintentionally leave, the DBA needs to unset the offline_mode

apart from rejoin the member to the group.

Conclusion

I hope this new fencing mode will help you improve and better configure the HA properties of your systems, allowing you to focus on your applications!

4,714 total views, 46 views today

About Nuno Carvalho

Nuno Carvalho is a Principal Software Engineer and MySQL Replication Service Team lead at Oracle, the team in charge of MySQL Group Replication plugin. His research interests include replication technologies, dependable systems and high availability. Before joining the MySQL team, he was a post-graduate student and a researcher at the University of Minho, Portugal, where he designed and implemented techniques to improve distributed systems scalability.

Leave a Reply