Extending replication instrumentation: an insight on transaction retries

MySQL 8.0.13 improves replication lag monitoring by extending the instrumentation for transaction transient errors. These temporary errors, which include lock timeouts caused by client transactions executing concurrently as the slave is replicating, do not stop the applier thread: instead, they cause a transaction to retry.

In this new release, we add eight new columns to the Performance Schema table replication_applier_status_by_worker reporting for both the last applied and the currently applying transaction:
– the number of times the transaction was retried by the worker after the first attempt (if the transaction was applied at the first attempt, this number is zero);
– the error number of the last transient error that caused the transaction to be retried;
– the message text for the last transient error that caused the transaction to be retried;
– the timestamp, in ‘YYYY-MM-DD HH:MM:SS[.fraction]’ format, for the last transient error that caused the transaction to be retried.

Until MySQL 8.0.12, in addition to checking the error log, we could monitor the number of retries that took place on each replication channel in the COUNT_TRANSACTIONS_RETRIES column in the Performance Schema table replication_applier_status, but this approach was not very effective for a multi-threaded slave. Additionally, existing LAST_ERROR_* columns in the performance schema table replication_applier_status_by_worker only display information about errors which lead the applier thread to stop. Now, we can easily monitor both transient and non-transient errors for the currently applying and last applied transactions directly from the Performance Schema table replication_applier_status_by_worker.

Let’s see this in action!

Considering that a master-slave replication topology has already been setup and the variable slave_transaction_retries set to a high number, for instance 10, let’s create a table on the master.

After the table is replicated on the slave, we start a transaction on the slave to lock the table

and go back to the master to execute and replicate a transaction that writes the exact same row, so that the slave will be forced to retry applying this transaction.

We can now check the new fields in performance schema table replication_applier_status_by_worker, which will tell us that the applying transaction has already been retried twice because of error 1205 and when the last retry was started.

Now let’s rollback the slave local transaction, so that the one being replicated from the master can commit.

After the replicated transaction is finally applied, we can see that it was retried six times, the last transient error it experienced and when the last retry started. Also note that since the error in the applying transaction was transient, the LAST_ERROR fields are still clear.

The information in START_APPLY_TIMESTAMP fields was also refined in this release. Prior to 8.0.13, this timestamp was refreshed every time the worker retried to apply a transaction due to a transient error. Now, START_APPLY_TIMESTAMP is set to when the worker started its first attempt to apply the transaction, as you can see in the examples above. This means the time elapsed between START_APPLY_TIMESTAMP and END_APPLY_TIMESTAMP matches the total time the worker spent applying a transaction, including retries.

This sums up the improvements in monitoring replication lag in MySQL 8.0.13. We look forward to hearing your feedback!

11,670 total views, 15 views today

Leave a Reply