Fine Tuning the Group Communication Thread¶
There is a procedure that runs in a loop while the plugin is loaded. This procedure is executed within the context of a thread. Lets call it the group communication thread (GCT). The GCT receives messages from the group and from the plugin, handles quorum and failure detection related tasks, sends out some keep alive messages and also handles the incoming and outgoing transactions from/to the server/group. Ultimately this is a traditional producer-consumer sub-system, where the GCT is the consumer and the server and the group are producers. GCT consumes messages.
The GCT waits for incoming messages on a queue. When there are no messages, the GCT will obviously wait. However, it could wait a little longer (doing an active wait) before actually going to sleep. In some cases this active wait proves to be beneficial because the alternative is for the operating system to switch out the GCT from the processor and do a context switch.
To force the GCT do an active wait, there is an option called
group_replication_poll_spin_loops, which makes GCT loop, doing
nothing relevant for a while, before actually polling the queue for the
next message. Here is the formal description of the option:
group_replication_poll_spin_loops- the number of times the group communication thread spins waiting before polling for incoming messages. Default: 0.
An example of its usage is:
mysql> SET GLOBAL group_replication_poll_spin_loops= 10000;
When network bandwidth is a bottleneck, message compression can provide up to 30-40% throughput improvement at the group communication level. This is especially important within the context of large groups of servers under load.
The TCP peer-to-peer nature of the interconnections between N participants on the group makes the sender send the same amount of data N times. Furthermore, binary logs are likely to exhibit a high compression ratio (see table above). This makes compression a compelling feature.
Therefore, as a MySQL Group Replication user and given that my workload contains large transactions, I may want compression enabled while sending messages to the group, so that I restrict network bandwidth consumption. This is the typical user story for this feature.
Compression happens at the group communication engine level, before the data is handed over to the group communication thread, so it happens within the context of the mysql user session thread.
Transaction payloads may be compressed before being sent out to the group and decompressed when received. Compression is conditional and depends on a configured threshold. By default compression is disabled.
In addition, there is no requirement that all servers in the group have compression enabled to be able to work together. Upon receiving a message, the member will check the message envelope to verify whether it is compressed or not. If needed, then the member decompresses the transaction, before delivering it to the upper layer.
The compression algorithm used is LZ4. Compression is disabled by default.
The compression threshold, in bytes, may be set to something larger than 0. In that case, compression is activated and only transactions that have a payload larger than the threshold will be compressed. Below is an example of how to enable compression.
1 2 3 4 5 6 7
STOP GROUP_REPLICATION; -- The following statement sets the compression threshold to 1KB. -- Should a transaction generate a replication message with a payload -- larger than 1KB, i.e. a binary log transaction entry larger than -- 1KB, then it will be compressed. SET GLOBAL group_replication_compression_threshold= 1024; START GROUP_REPLICATION;
Group Replication’s guarantee is that a transaction will only commit after a majority of the members in a group have received it and agreed on the relative order between all transactions that were sent concurrently.
This approach works well if the total number of writes to the group does not exceed the write capacity of any member in the group. If it does and some of the members have less write throughput than others, particularly less than the writer members, those members will start lagging behind of the writers.
Having some members lagging behind the group brings some problematic consequences, particularly, the reads on such members may externalize very old data. Depending on why the member is lagging behind, other members in the group may have to save more or less replication context to be able to fulfil potential data transfer requests from the slow member.
There is however a mechanism in the replication protocol to avoid having too much distance, in terms of transactions applied, between fast and slow members. This is known as the flow control mechanism. It tries to address several goals:
- to keep the members close enough to make buffering and de-synchronization between members a small problem;
- to adapt quickly to changing conditions like different workloads or more writers in the group;
- to give each member a fair share of the available write capacity;
- to not reduce throughput more than strictly necessary to avoid wasting resources.
Given the design of Group Replication, the decision whether to throttle or not may be decided taking into account two work queues: (i) the certification queue; (ii) and on the binary log applier queue. Whenever the size of one of these queues exceeds a user-defined threshold, the throttling mechanism is triggered. So a user must only configure: (i) whether to do flow control at the certifier or at the applier level, or both; and (ii) what is the threshold for each queue.
The flow control depends on two basic mechanisms:
- the monitoring of members to collect some statistics on throughput and queue sizes of all group members to make educated guesses on what is the maximum write pressure each member should be subjected to;
- the throttling of members members that are trying to write beyond their fair-share of the available capacity at each moment in time.
Probes and Statistics¶
The monitoring mechanism works by having each member deploying a set of probes to collect information about its work queues and throughput. It then propagates that information to the group periodically to share that data with the other members.
Such probes are scattered throughout the plugin stack and allow one to establish metrics, such as:
- the certifier queue size;
- the replication applier queue size;
- the total number of transactions certified;
- the total number of remote transactions applied in the member;
- the total number of local transactions.
Once a member receives a message with statistics from another member, it calculates addicional metrics regarding how many transactions were certified, applied and locally executed in the last monitoring period.
Monitoring data is shared with others in the group periodically. The monitoring period must be high enough to allow the other members to decide on the current write requests, but low enough that it has minimal impact on group bandwidth. The information is shared every second, and this period is sufficient to address both concerns.
Based on the metrics gathered across all servers in the group, a throttling mechanism kicks in and decides whether to limit the rate a member is able to execute/commit new transactions.
Therefore, metrics acquired from all members are the basis for calculating the capacity of each member: if a member has a large queue (on certification or aplier), then the capacity to execute new transactions should be close to ones certified or applied in the last period.
The lowest capacity of all the members in the group determines the real capacity of the group, while the number of local transactions determines how many members are writing to it, and, consequently, how many members should that available capacity be shared with.
This means that every member has an established write quota based on the available capacity, i.e., a number of transactions it can safely issue for the next period. The writer-quota will be enforced by the throttling mechanism if the queue size of the certifier or the binary log applier exceeds a user-defined threshold.
The quota is reduced in the number of transactions that were delayed in the last period, but also further reduced in 10% to allow the queue that triggered the problem to reduce its size. In order to avoid large “lumps” in throughput once the queue size goes beyond the threshold, the throughput is only allowed to grow the same 10% per period after that.
Current throttling mechanism does not penalize transactions below quota, but delays finishing those transactions that exceed it until the end of the monitoring period. As a consequence, if the quota is very small for the write requests issued some transactions may have latencies close to the monitoring period.