Taking advantage of new transaction length metadata

MySQL 8.0.2 introduced a small yet powerful per transaction metadata information containing the transaction length in binary logs.

MySQL binary logs are being used for many other things than MySQL replication or backup/recovery: replicate to Hadoop; replicate to other databases, such as Oracle; capture data change (CDC) and extract-transform-load (ETL); record change notification for cache invalidation; change tracking for differential backups; etc.

A quick search about “binlog” into Github listed entries about projects parsing MySQL binary log event streams in many programming languages: Ruby, PHP, Python, NodeJS, Java, Golang, etc.

Identifying that the stream reached the end of a transaction

Parsing transactions when reading from a binary log event stream (either from a file or directly from a replication connection to a MySQL server) requires a good knowledge of how transactions can be represented in order to correctly identify the transactions boundaries. This is a use case for tools wanting to give some acknowledgment when receiving (and processing) a full transaction: the tool is in the beginning of a transaction in a binary log event stream and will take an action when reaching the end of the transaction.

A transaction can be represented in many ways in a binary log event stream.

A DDL will start with a GTID event, followed by zero or more pre-statement events, followed by a single QUERY event. Pre-statement events are INTVAR, RAND and USER_VAR events.

A DML will start with a GTID event, followed by a QUERY(BEGIN) event, followed by sets of either QUERY events (with their own pre-statement events) or TABLE_MAP and ROWS events, followed by a QUERY(COMMIT|ROLLBACK) or a XID event.

A XA DML will start with a GTID event, followed by a QUERY(XA START) event, followed by sets of either QUERY events (with their own pre-statement events) or TABLE_MAP and ROWS events, followed by a QUERY(XA END) event, followed by a XA_PREPARE event.

Not putting all the cases here, the problem with this logic is that we need a parser to detect that a given event was the last one of a transaction.

MySQL 5.7 introduced a Transaction_boundary_parser to perform this parsing as accurately as possible. It solved the problem, but it requires processing (at least the header) of all events belonging to the transaction.

A tool could try to be “smart” by assuming that once it receives a new GTID event it means that the previous event was the last of the previous transaction, but this will relate a transaction acknowledge to receiving more that the transaction content itself (how to handle the last transaction of a stream?).

Since MySQL 8.0.2 the GTID event carry a new field: the transaction_length. Its value represents the full transaction size in bytes, from the beginning of the GTID event itself up to the end of the last event of the transaction.

The transaction_length field can use 1 to 9 bytes to be stored, depending on the transaction size. See the table below:

Transaction Size (in bytes) transaction_length field size
< 251 1
>= 251 and < 65536 3
>= 65536 and < 16777216 4
>= 16777216 9

With this information, a tool reading from a binary log event stream can rely on GTID’s transaction_length field to define the amount of bytes it should receive until reaching the end of the transaction.

Extracting a given transaction

Searching for specific transactions in a file to extract it would also require a good knowledge of how transactions can be represented, and would also require to iterate over the file, event by event, to determine the transaction boundaries: a tool will first find the beginning of the transaction and then should follow the same pattern as described in Identifying that the stream reached the end of a transaction.

Until MySQL 8.0.2, the logic to extract a given transaction from a binary log file had to iterate over all binary log events until finding the GTID of the desired transaction, and then, had to continue iterating but now parsing the events until finding the end of the transaction.

Since MySQL 8.0.2, the logic is more straightforward (and more efficient too). A tool can jump between GTIDs without reading intermediate events by using the GTID transaction_length field. And once reaching the desired GTID, the tool can copy the transaction_length bytes from the beginning of the transaction to extract the whole transaction.

mysqlbinlog output in MySQL 8.0

Let’s take a look at a binary log dump of MySQL 8.0.2:

We notice that the GTID event has more information than in 5.7. Two new fields (original and immediate commit timestamps) were introduced by the work described in “New monitoring replication features and more!”. We notice also the transaction_length field (with 285 bytes for the transaction ‘:3’). Yes! This is the new GTID event transaction_length field!!!

Note: the MySQL 8 mysqbinlog client tool will display 0 as transaction length for the GTID events without the information (i.e. from MySQL 5.7 binary log files).

Conclusion

Tools working with MySQL 8.0 binary log events streams relying on new GTID’s transaction_length field can accomplish their work more efficiently without the labor of parsing binary log events just to know the proper transaction boundaries.

Using MySQL 8 binary logs, a tool that needs to find transaction boundaries can now just read the length from the GTID event. Before, the tool had to read each event of the transaction and maintain a state machine to parse the transaction structure.

Using MySQL 8 binary logs, a tool that seeks to a GTID now just needs to read each GTID event, and can quickly jump directly to the next GTID event using the length information. Before, the tool had to read the header of each event within the transactions.

1,384 total views, 8 views today

About João Gramacho

João Gramacho is a Principal Software Developer for the MySQL Replication Core team at Oracle. Before joining Oracle, he worked for more than ten years with IT infrastructure support (servers hardware, operating systems, networks, management and security) and did a PhD in high-performance computing.

Leave a Reply