Group Replication introduces a new way to do replication in MySQL. With great features such as multi-master replication it brings a range of exciting deployment scenarios where some difficult problems become much easier to solve. Group Replication also brings a new set of options that may need to be configured to extract the highest performance from the underlying computing resources.
In this blog post we will explain the components that specifically affect the performance of Group Replication and how they can be optimized. For more information about group replication please check a few recent blog posts here, here, and here.
1. Group Replication from a performance perspective
Group Replication’s basic guarantee is that a transaction will only commit after a majority of the nodes in a group have received it and agreed on the relative order between all transactions that were sent concurrently. It uses a best-effort approach that makes no assumptions regarding when transactions will be applied or even touch storage on the group members.
Group Replication works with MySQL to:
- intercept the transactions on the client-facing server just before they are committed and build a representation of the transaction that uniquely identifies its changes – the write-set;
- send the write-sets to the network and agree on an order in which the competing transactions must be applied;
- prepare to apply the transactions that don’t intersect in the agreed upon order and cancelling those that do intersect (but the first one to commit succeeds).
It explores three main components – the group communication layer, the certification and the binary log applier – which will be described in the following sections.
1.1 The group communication layer
Group Replication implements a group communication layer based on Paxos in order to allow several servers to agree on a same transaction order. With it, servers constitute a group and messages sent to it are received in the same order in every working group member. The protocol is complex and explaining the details is beyond the scope of this post. But, to understand the performance of the variant implemented in GR, let’s just say that to process a typical message one needs to:
- send a message with the user payload from the source to every other group member (accept);
- send a vote message from each receiver to the sender;
<after a majority is reached at this point the client sees the message as delivered>
- send a learn message from the sender to every other group member.
The critical factors for the performance of Group Replication are thus the network throughput (how many parallel messages we can fit in the pipeline) and the node-to-node latency (how long does it take to reach consensus on the sender).
Group Replication implementation of Paxos includes many optimizations that make it achieve great performance – it fully explores the network potential, handling many messages in parallel and packing multiple messages in a single one to send to each node whenever possible. From the user perspective the outcome is that:
- one can send as many transactions as they fit in the network bandwidth, from the sender to all other nodes – although the outgoing bandwidth on the node is divided by the number of group members other than itself;
- each transaction will be delayed, after it’s ready to commit, at least one median network round-trip-time from the sender to the receivers.
Performance can be further helped by two advanced user options:
This option instructs Group Replication to use LZ4 to compress messages larger than n bytes, which increases the available bandwidth.
While this is particularly effective with large write-sets on network-constrained environments, it should be noted that even on fast networks it has shown some benefits, as compression/decompression is very fast and the binary logs can be very compressible – Sysbench RW messages show 3x-4x compression ratio, but we have seen more than 10x on large inserts.
This option controls how many times the send/receive thread tries to receive pending messages before going to sleep – i.e., it controls how greedy that thread is.
The best value depends on the network infrastructure used, but careful configuration can lower the transaction latency considerably (we have seen more than 50% gains in latency and also some in throughput). It provides the most benefit in infrastructures that don’t receive many messages in parallel, where the send/receive threads would sleep more often, and where there is unused CPU time available.
In network bandwidth constrained deployments this may be useful:
1.2 The certification
Once a transaction message is received from the group communication layer it is put on a queue for certification, which is the algorithm that runs in each node to decide what to do with transaction. On positive certification the transaction is either considered finished on the client-facing server and queued on the relay log for posterior execution on the remaining group members. On negative certification the transaction is discarded or rolled-back as needed. This is done independently on each node, no further network synchronization activity is needed after receiving the transaction message properly ordered.
The certification algorithm runs in a dedicated thread that follows the transaction order established by the group communication layer, writing the transactions to the relay log if they are remote. This makes certification a possible contention point one must be aware of, as certifying each transaction can take some time.
At present the major factor affecting the certification loop throughput is the relay log bandwidth. If the storage sub-system used for the local relay logs – which is where certified transactions are queued for the applier thread – cannot keep up with the write requests, then the certification loop may start to lag behind the writer members and its certification queue may start to grow.
You can see the approximate certification queue size with:
On multi-master environments there is another factor that can affect the certification performance, which is the gtid_executed set complexity. As multiple members write to the group at the same time, the large number of gtid_executed sets on the certification database can become harder to handle as a lot of small holes appear in them and the inherent range compression cannot take place. This can also be seen as the throughput balance between member nodes becomes reduced.
On those deployments, particularly on systems that can handle many thousands of transactions per second, one can make use of another user option:
This options allows nodes to acquire ranges of gtids to assign to its transactions instead of assigning them globally following the certification order. With it, the gtid_executed sets become simpler and have only a few holes, but also implies that the numeric part of gtid_executed set no longer follows a strictly increasing order.
In multi-master scenarios this will be useful:
Also, when memory is available the relay log can be also be put in RAM disk for improved throughput, but that comes at the cost of being lost when a node fails.
1.3 Remote binary log applier
Once the transactions are written to the relay log they become ready to be executed by the replication binary log applier, just like what happens with asynchronous and semi-synchronous replication.
There is however a nuance here in application that one should be aware. Once a transactions is certified in a member, the certifier marks the rows used by as changed, but the transaction may stay in the relay log for some time waiting application. A local transaction that tries to change any of those rows during this period is doomed to be rolled-back.
Please note that minimal binary log images provide slightly higher throughput on the applier than full images. This also reduces the needed bandwidth, so it’s a good companion for the optimizations in section 1.1 above.
One can further improve throughput by avoiding expensive sync operations when durability over the network is enough:
But please note that the consequences of reducing the durability settings have to be properly evaluated to avoid unpleasant surprises. Also, with modern hardware the performance benefit is not as pronounced as it used to be.
Group Replication depends on three main components: group communication layer, certification and binary log applier and each of these components has possible contention points to watch out for:
- the lack of network bandwidth will lower the number of in-transit messages and high network round-trip-time will increase the user transaction latency proportionally;
- the lack of relay-log storage bandwidth will delay certification, so watch out for growing certification queue;
- lag on the binary log applier may increase transaction aborts on multi-master scenarios.
Please let us know if you have any doubts or suggestions.