STOP SLAVE Improvements for Multi-Threaded Slaves

When using a multi-threaded slave, the STOP SLAVE command could take a long time to return because the slave waited for workers to catch up processing the queue. This blog post is about improvements that have been made in MySQL 5.6.26 and later to ensure that STOP SLAVE returns quickly when using a multi-threaded slave (MTS).

Background

Multi-threaded slave here refers to inter-transactional parallelization while applying transactions to the slave’s databases. This makes replication more scalable, especially on multi-core architectures. In MTS mode a receiver thread (IO thread) receives events/transactions from the master and queues them in the local relay log; an applier thread (SQL coordinator) reads these events and schedules them to workers for execution. The following image depicts how the worker’s queue looks in the case of two workers.

Image1

Stopping workers abruptly (like KILL will do) might generate gaps in the slave execution stream. If workers are at a state as shown in Image1 above and a KILL command is executed to KILL both coordinator and workers, then the transactions which are under execution will be completed and the slave will come to a stop. In this scenario transaction t5 will not be executed and this forms a gap in the MTS execution sequence. Filling of these gaps is efficiently handled by GTIDs. The auto positioning replication protocol will take care of retrieving events from the first transaction that MTS didn’t apply, but it will also retrieve all of the transactions after that, even those that have already been applied. Fortunately, when GTIDs are enabled there is no problem in applying a transaction twice (as the second attempt will not change anything, it will simply be skipped). If GTIDs are not enabled then transactions will be reapplied which will lead to data inconsistencies.

Image2

Executing the STOP SLAVE command in MTS always guarantees that the slave stops at a gap-less state. In order to achieve this state prior to MySQL 5.6.26, the STOP SLAVE command will wait untill all workers complete their assigned tasks. Let us consider the Image1 scenario. Executing the STOP SLAVE command in this scenario will result in a state as shown in Image3 below.

Image3

The STOP SLAVE command waited until transactions T8 and T9 are executed. The above process ensures a gap-less stopping point, but the time taken for the STOP SLAVE command to complete is very high. In the case of large worker queues the command can take even more time to return.

In MySQL 5.6.26 and later, however, the STOP SLAVE command does not wait for the worker’s entire queue to be consumed, instead it identifies the nearest gap-less state and stops at that point. The following image (Image4) depicts the state of worker’s queue after this improvement–where the STOP SLAVE command did not wait until the completion of transactions T8 and T9.

Image4

How the improved STOP SLAVE algorithm works

When the MTS coordinator thread receives a STOP command, it will notify all workers to STOP by setting their current running_status to STOP. Once this STOP command is received, all workers will try to identify a maximum group index which has been already executed (or is being executed). A group index is a counter that is incremented with each new group/transaction being assigned to a worker. If a worker has events which still need to be executed, then the current group index is used, and if the worker has an empty queue then the last executed group’s index is used for calculation of max_group_index.

Once the max_group_index value is determined, then the workers who have a group index value less than or equal to the maximum group index value will apply their groups. Workers who have group index values above the maximum group index value, or if they have an empty queue, will simply exit. Note that each worker will contribute only once to the max_group_index. With this process we can ensure that the slave will stop at the nearest gap-less state. The following example will demonstrate how the improved STOP SLAVE algorithm works.

An example

We have two databases — d1 and d2 — and the following tables in these databases. The transactions and their respective group index values (g_idx) on the slave:

On the slave we begin a transaction and directly insert two rows:

On the master we execute the following transactions, which then get replicated to the slave. The slave will then block transactions T3 and T6 because we have a primary key constraint on d1.t and d2.t.

Now let’s assume that we have executed the above sequence of operations on the master and a multi-threaded slave. We’ll then wait for some time, so that the coordinator on the slave populates all of the above transactions into its worker’s queues. Now we’ll execute the STOP SLAVE command and thus release the hold on T3 and T6. Workers W1 and W2 will proceed and participate in the calculation of the max_group_index value. W1 will provide 10 and W2 will provide 7 as W1 is currently executing transaction T6 (g_idx:10) and W2 is executing T3 (g_idx:7). From these values max_group_index is identified and is set to ’10’. Then the slave stops at the first gap-less state (group index 10) with this improved algorithm.

If we allow the workers to continue and participate once again, in the case where W2 is busy, if W1 after completing T6 picks T9 and contributes once again to max_group_index then the gap-less window will be extended up to T9. Hence the slave will not be able to guarantee that it stopped at the nearest gap-less point. This will once again make the STOP SLAVE command slow. So each worker should participate only once during the calculation of max_updated_index. The new algorithm ensures that the slave will stop at #1.

The above changes were done as part of Bug#75525 in MySQL 5.6.26.

A quick demonstration

Enable MTS with slave_parallel_workers=2.
Step 1: On the Master – Create two databases and two tables.

Step 2: On the Slave – Create a trigger with a sleep to introduce some delay on the slave for each INSERT.

Step 3: On the Master: Execute a small procedure which INSERTs 1000 tuples in two databases: d1 and d2.

Step 4: On the Slave – Immediately after the completion of step #3 on the master, execute STOP SLAVE on the Slave.

Before the fix:

After the fix:

Summary

Prior to the fix for Bug#75525, issuing a STOP SLAVE command on a multi-threaded slave would always wait for all workers to finish their queues and then STOP. This meant the STOP SLAVE command could take a long time to complete. Now after the Bug#75525 fix in MySQL 5.6.26, the STOP SLAVE command will identify the nearest gap-less state and it will stop at that point. Please try it out and let us know your feedback!

Leave a Reply

Your email address will not be published. Required fields are marked *

Please enter. *