“Look: I/O thread is waiting for disk space!”

MySQL 8.0.1 introduced a work with replication threads mutexes in order to improve performance. In MySQL 8.0.2 the same work was extended, focusing in usability, and revamped how replication deals with disk-full conditions, improving the responsiveness of both monitoring commands and administrative commands such as KILL, as well as making status messages much more precise and helpful.

Before this work, a troubleshooting session may look like:

The slave is not answering the replication delay monitoring. Is it okay?


Operator 1

Let me check!


Operator 2

Operator 2 connects to the MySQL instance.

Connected to it! MySQL instance is up!


Operator 2

Operator 2 queries performance_schema.replication_connection_status table.

Queried P_S replication_connection_status, but query is blocked. It doesn’t answer.


Operator 2

Can you check replication threads states?


Operator 1

Operator 2 connects a new client session and queries performance_schema.threads table.

I/O thread state is ‘Queuing master event to the relay log’, but for quite some time.
SQL thread is ‘Reading event from the relay log’ from quite some time too.
Seems I/O thread is not progressing.




Operator 2

Querying SHOW SLAVE STATUS blocked too. Dammit!


Operator 2

Hum… What about the server error log?


Operator 1

“Checking…”


Operator 2

Operator 2 connects to the server host to check the MySQL instance error log.

Ohhh! There is a ‘Disk is full’ error writing to a relay log file!
This must be the I/O thread!



Operator 2

Ack! Let’s try to free some disk space there to see if this fixes the issue.


Operator 1



The above could happen because when the replication receiver thread gets disk full while writing new events into the relay log, the server goes to a loop where it polls the disk every 60 seconds to see if enough space has been freed. It used to hold a mutex while doing so, and that prevented queries into performance_schema.replication_connection_* tables (and other monitoring statements like SHOW SLAVE STATUS) from executing.

Checking replication threads statuses also gave almost no clue about what was happening:

The only effective hint about the disk space issue was a set of messages into the server error log like:

Trying to kill the receiver thread would make the server send the kill signal to the process, but the signal would only be checked by the thread when it was actually checking for disk space. So, as it slept for 60 seconds between attempts, the receiver thread sometimes waited up to 1 minute before being actually killed.

Also, when killed, the receiver thread left any event’s partial content already written into the relay log. Trying to show the contents of the relay log (using mysqlbinlog client program) would lead to an error reading the last (incomplete) event in the relay log file. In 5.7 (since 8.0.1 this should not happen), this partial content would lead to a SQL thread error:

Improved usability

In MySQL 8.0.2 we introduced an improvement where the receiver thread no longer holds the mutex required for monitoring activities while writing into the relay log.

This means that, when blocked waiting for disk space, the receiver thread no longer blocks monitoring threads.

In order to allow operators to know what is happening without the need of inspecting server error log, the receiver thread will now state that it is waiting for disk space in any place showing replication thread states.

We also changed the wait for disk space procedure to check for the kill signal every second (freed disk space is still checked every minute). So, a KILL of the receiver thread waiting for disk space will make it stop almost promptly (no need to wait for up to a minute anymore). Also, when killed, the receiver thread will now truncate the relay log file in the end of the last completely received event (there will be no trash in the relay log file anymore).

Conclusion

When you don’t know exactly what is happening with replication threads, it takes longer to identify the effective actions to fix the issue, which leads to further replication delays or incorrect actions that don’t deal with the problem. We have now improved the usability of commands to monitor replication threads, allowing a receiver thread to report that it is waiting for disk space as a thread state, and also allowing replication monitoring commands to not be blocked while the receiver thread is waiting for disk space. Time to update monitoring procedures!

Another day, in an operations room with MySQL 8…

Slave’s delay is increasing. Is it okay?


Operator 1

Let me check!


Operator 2

Operator 2 connects to the MySQL instance and checks replication threads statuses.

Look: I/O thread is waiting for disk space!


Operator 2

Ack! Let’s try to free some disk space there to see if this fixes the issue.


Operator 1

About João Gramacho

João Gramacho is a Software Developer for the MySQL Replication Core team at Oracle. Before joining Oracle, he worked for more than ten years with IT infrastructure support (servers hardware, operating systems, networks, management and security) and did a PhD in high-performance computing.

Leave a Reply

Your email address will not be published. Required fields are marked *

Please enter. * Time limit is exhausted. Please reload CAPTCHA.