Trying out MySQL Push-Down-Join (SPJ) preview

At the 2010 MySQL User Conference, Jonas Oreland presented on the work he’s been doing on improving the performance of joins when using MySQL Cluster – the slides are available for download. While not ready for production systems, a preview version is available for you to try out. The purpose of this blog is to step through  testing an example query as well as presenting the results (SPOILER: In one configuration, I got a 50x speedup!).

SPJ is by no means complete and there are a number of constraints as to which queries benefit (and I’ll give an example of one that didn’t). For details of the current (April 2010) software and limitations, check out Jonas’s slides and then keep up to date by following his blog.

We’re anxious to get feedback – please feel free to post results as comments to this blog but also make sure that you send them to spj-feedback@sun.com – describing your schema, the query or queries you tested, the output from EXPLAIN and your before and after timings.

Joins in MySQL Cluster are implemented as nested-loop joins within the MySQL Server; this can be inefficient as it results in many trips to the data nodes to fetch the required data. SPJ works by pushing the join (actually a spec of the needed data) down into the data nodes where the data can be collected and sent back up to the MySQL Server much more efficiently.

For my tests, I used 2 different configurations. In both cases there are 2 data nodes running on 2 physical hosts. In the first configuration the MySQL Server resides on one of those 2 hosts. In the second configuration, the MySQL Server is moved to a virtual machine running on a 3rd host.

Setting up the Cluster

On each of the 3 hosts, I downloaded the software from ftp://ftp.mysql.com/pub/mysql/download/cluster_telco/mysql-5.1.44-ndb-7.1.3-spj-preview/ and then compiled and installed it. If you’re not comfortable with that then you can find instructions in this earlier blog or if you’re used to using the tools from severalnines then check out the SPJ instructions on Johan’s blog.

Create the schema

The 3 tables I used can be created with these commands from the mysql client:

Each of these tables is then populated with 100,000 rows (the files can be downloaded from here).

Once extracted, the data should be loaded into the database:

Running the tests (Config 1 – local mysqld)

To get a baseline, ensure that SPJ is turned off:

and then get the output from EXPLAIN:

and then execute the query:

Now to see the benefits of SPJ, turn it on:

Check the output from EXPLAIN again:

and then re-run the query:

In this test, the query ran almost 12x faster!

Running the tests (Config 1 – separate mysqld)

The test was then repeated with the MySQL Server running within a VM on a 3rd host – the purpose of this is to represent the more normal configuration where the MySQL servers must communicate over the network to the data nodes. As the purpose of SPJ is to reduce the messaging between the MySQL Server and the data nodes, it’s reasonable to expect the benefits from SPJ to be more pronounced with this configuration.

Again, to get a baseline, ensure that SPJ is turned off:

and then get the output from EXPLAIN:

and then execute the query:

Now to see the benefits of SPJ, turn it back on:

Check the output from EXPLAIN again:

and then re-run the query:

In this test, the query ran almost 50x faster!

Do all queries benefit from SPJ

No and that’s why it’s especially important to get feedback from real users with representative schemas so that SPJ can be extended to cover as many of the significant use cases as possible.

As an example, using the following query I saw no speedup at all (using the local mysqld configuration):

Leave a Reply

Your email address will not be published. Required fields are marked *

Please enter. * Time limit is exhausted. Please reload CAPTCHA.