MySQL Group Replication – Transaction life cycle explained

The multi master plugin for MySQL is here. MySQL Group Replication provides virtually synchronous updates on any member in a group of MySQL servers, with conflict handling and automatic group membership management and failure detection.

For a better understanding on how things work, we go under the hood in this post and will analyse the transaction life cycle on multi master and which components does it interact with. But before that we need to understand first what a group is.

Group Communication Toolkit

The multi master plugin is powered by a group communication toolkit. This is what decides which servers belong to the group, performs failure detection and orders server messages. Being the ordered messaging the magic thing that allows the data to be consistent across all members. You can check the details of the group communication toolkit at Group communication behind the scenes post.

1

These strong properties, together with the same initial state of each server that joins the group, allowed us to implement a database state machine replication (DBSM)[1].

On the context of group communication system other of the base concepts is the view. When a server joins (joiner) the group, a new view is installed, being this view a logical marker that defines which servers belong to the group. A view is also installed when a server leaves the group (either voluntarily or involuntarily). This is provided by a built-in
dynamic group membership service.

Before start applying transactions, joiners will request from the group its current state, in order to catch up missing transactions, thus synchronizing to an up-to-date replica. This is called Distributed Recovery. You can check more details at Distributed Recovery behind the scenes post.

Transaction life cycle

Having the group set up and the same initial state on all group members, now we need that all group members agree on transactions outcome, that is, if any transaction is committed on any member it must also be committed on all the others non-faulty group members.
This strong requirement is only needed for write operations, read-only transactions are executed locally only.

Group communication toolkit provides us a total order broadcast primitive, that is, all messages are delivered everywhere on the same order even in the presence of failures. This means that we have a global order of messages exchanged on the group, combined
together with deterministic transaction outcome (commit or rollback) decisions can ensure that all servers eventually will reach the same state.

So whenever a client executes a write transaction, it is executed optimistically on the local server right until before being actually committed.

2

At this stage the transaction is broadcast to the group so that its outcome is decided.

3

This message is ordered by group communication toolkit and delivered to all non-faulty group members in same order.

4
Now we need to decide locally if this particular transaction should be committed or rollback. Every server keeps for each row update an associated version. This will allow us
to know that the version of the row that serves as base for this update is a older version and so these where executed in parallel in different members. This information about the
version at which a given transaction was executed, is included in transaction write set.

Certification

This decision is taken by the certification module. Lets see an example of it.

5

On the above diagram we have a group composed by three servers, on which a client will execute a write transaction on server 1 (s1). The transaction executes until the before commit stage and then it broadcasts write set and data to the group.

The transaction write set is composed by the primary keys of each updated table row and the database version at which transaction was executed. Database version is provided by GTID_EXECUTED, more precisely its last number without gaps from group UUID.
Example:  GTID_EXECUTED: UUID:1-10, UUID:12, version is 10.
Database version is implicitly increased with transactions commit.

After delivery, the transaction is certified, that is, each server will compare locally the versions included on the write set with the local versions. If the version included on write set is smaller than the one on certification database then transaction rollbacks. No version at certification module for a particular row means that there was not a update for it yet and that this particular row won’t cause any transaction rollback.

Database version during execution was 1 (dbv: 1) and current version at certification module is also 1 (cv: 1), which means that this transaction does not conflict with any other ongoing transaction. So transaction will be allowed to commit, and row version on
certification module updated to 2 (cv: 2). On the local server (s1) this means proceed to commit and return success to client. On the remote servers (s2 and s3) the transaction will be queued to be applied by applier module.
Applier module is the responsible to apply positively certified transactions that arrived from the group, like server 2 and 3 on the above figure.

6

Transaction identifier (GTID) is controlled by the certification module, in order to a given transaction has the same identifier on all group members.
All members of the group share the same UUID, so group behaves like a single server where all transactions have the same source UUID, which is the group name.
More details at Getting started with MySQL Group Replication post.

To achieve this, the certification module state is also transferred during distributed recovery when a server joins the group, again fulfilling the DBSM requirement that all servers on group must have the same initial state.

Lets see a example on which we have conflicts between transactions.

7

On the above diagram we have again a group with three servers, on which both server 1 (s1) and server 2 (s2) execute a transaction (T1 and T2) that updates the same row concurrently. Both transactions are executed locally until before commit stage on database version 1 (dbv: 1), and then the write set and write data are broadcast to the group.
Group communication toolkit, on this example, ordered T1 before T2, so since certification module version for the update row is 1 (cv: 1), transaction will be positively certified, certification module row version will be updated to 2 (cv: 2), and the transaction will continue to commit.
Then it is certification turn for T2, database version is again 1 (dbv: 1) but certification module row version is 2. This means that the changed row by this transaction was already updated by a transaction that happened after execution on server 2 and before certification (on all servers), so T2 must be negatively certified. Server 2 will rollback the transaction and return a error to the client, remote servers will discard the transaction.

One thing that you may be thinking is, will not certification module data increase forever? Without more implementation details, yes! But we have that covered, periodically all group members exchange their GTID_EXECUTED to compute the intersection of the transactions that all servers already committed – the Stable Set. Any group member cannot update rows on version lower than that executed set. Then a garbage collector procedure is executed to remove all row versions that belong to the stable set.

Certification stats are available at performance_schema replication_group_member_stats table.

You can check more details about stats and monitoring at Group Replication Monitoring post.

Conclusion

On his first steps, MySQL Group Replication is still in development. On this blog post we explain the transaction life cycle on this new replication technology. Feel free to try it and get back at us so we can make it even better for the community!

References

[1] – Fernando Pedone, Rachid Guerraoui, and Andre Schiper. The database state machine approach, 1999.

About Nuno Carvalho

Nuno Carvalho is a Principal Software Engineer and MySQL Replication Service Team lead at Oracle, the team in charge of MySQL Group Replication plugin. His research interests include replication technologies, dependable systems and high availability. Before joining the MySQL team, he was a post-graduate student and a researcher at the University of Minho, Portugal, where he designed and implemented techniques to improve distributed systems scalability.

4 thoughts on “MySQL Group Replication – Transaction life cycle explained

  1. I am confused by this claim: “that is, if any transaction is committed on any member it must also be committed on all the others non-faulty group members.”

    In the description when commit is agreed upon the replicas have the transaction queued in their relay logs. So the guarantee is that the will commit, not that they have committed the change. Maybe I am missing something but this is important for read-after-write semantics when a client commits to one replica, and then reads from another replica. How does a client guarantee that for replicas that 1) were in the majority and 2) were not in the majority?

    I will keep on reading.

  2. Mark

    I obviously have no hands on experience with MySQL group replication, but since they are based on the same underlying algorithm, for the purpose of this explanation I’m going to assume it is simply a Galera clone.

    First of all, just like you say, the “must commit” part simply means that it will commit. In particular, it means that from a certain point forward – once a transaction is assigned a global ordering id (which implies acked by the other nodes) – the remaining steps are deterministic and can therefore be completed locally on each node without cluster communication. When a node certifies the transaction to not conflict with locks held by other queued transactions, it knows that other nodes will arrive at the same conclusion. This is good for performance, and in fact galera / group replication require less communication round trips between nodes than for example 2-phase commit.

    What is a commit?
    Therefore, when someone explains group replication and says “it is committed”, this is from a cluster point of view (which, after all, is the relevant point of view). The transaction is committed, because it has passed the point of no return. It is true though, that on all but one node, the transaction hasn’t even begun to be applied to the storage engine on the node, let alone being committed (e.g. to InnoDB).

    Durability
    …is it’s own playground of arguing about what durability even means. The Postgres people will tell you that a transaction is “safe” when it is committed to disk (even just on a single node). I’m with Heikki Tuuri in believing that durability is a relativistic concept. If you believe the Postgres people, then a galera / group replication trx would be durably committed, if the queue of incoming transactions is atomically flushed to disk on each commit. Personally I think it is durable by way of having been redundantly replicated.

    Transaction isolation
    Interestingly then, galera / group replication is a synchronous replication protocol that still manages to exhibit eventually consistent behavior. At least galera implements a client option to ensure *causal reads*. With this option, the read will block until committed (i.e. queued) transactions are applied on the node, and then read a state which is equal or later to what was committed when the read was issued. From a performance point of view, the cost of consistency is paid by the reader, not the commit.

    1. Oh, as for #2, I’ve forgotten (and in any case things may have changed) how this works by default and what options are offered, but on a purely theoretical level it would be easy to implement that a node that is not part of the majority partition will refuse all reads until it is again reunited and in sync with the cluster. Arguably, this should at least be the behavior of the casual reads option.

Leave a Reply

Your email address will not be published. Required fields are marked *

Please enter. * Time limit is exhausted. Please reload CAPTCHA.