This page was last modified on 10 May 2016, at 15:19.
|License||GNU Affero General Public License|
TokuMX is an open source, high-performance drop-in replacement for MongoDB® that dramatically improves performance and operational efficiency. Using advanced data compression and Fractal Tree® Indexing technology, Percona TokuMX improves the performance of MongoDB applications while significantly reducing storage costs.
TokuMX is a fork of the 2.2 version of MongoDB with many 2.4 features added. The Tokutek people claim that their database server compresses the typical MongoDB database by up to 4X or more. They also support clustered indexes to minimize I/O in scanning through related documents.
Most TokuMX source files are made available under the terms of the GNU Affero General Public License (AGPL). The TokuKV Fractal Tree Indexing library is made available under the terms of the GNU General Public License (GPL) version 2, with an additional grant of a patent license.
Fractal Tree® Indexing
One of the innovations of TokuMX is that it eliminates a long-held rule of databases: to get good write performance, the working set of your indexes should fit in memory. The standard reasoning goes along the lines of: if your indexes’ working set does not fit in memory, then your writes will induce I/O, you will become I/O bound, and performance will suffer. So, either make sure your indexes fit in memory, or make sure your indexes have an insertion pattern that keeps the working set small, like right-most insertions.
With TokuMX, THIS SIMPLY ISN’T TRUE. The innovation of Fractal Tree indexes is that as your working set grows larger than main memory, write performance stays consistent. This innovation is why Fractal Tree indexes perform so well on write-heavy benchmarks (for both MongoDB and MySQL).
So how does TokuMX achieve this write performance where many other databases struggle? By replacing B-Trees, the predominant storage data structure in many databases (MongoDB, MySQL, BerkeleyDB, etc…) with Fractal Tree indexes, a write-optimized data structure.
What do we mean by a write-optimized data structure?
To understand what we mean, we first need to understand why a B-Tree struggles when indexes no longer fit in memory. Below is a picture of a B-tree.
A B-tree is a simple (and elegant) data structure. The internal nodes store many pivots and pointers, and the leaf nodes store all the data. To insert into a B-tree, one must traverse to the leaf node where the data belongs, and place the data into the leaf node. If all of the data fits in memory, this is fast. But if most of the data does not fit in memory (as in the picture above, where only the internal nodes and very few leaf nodes fit), then retrieving that leaf node will require an I/O. In fact, nearly all insertions will incur an I/O. This is where the I/O bottleneck comes from. This is where the struggling write performance comes from. If your hard disk can do on the order of a few hundred I/O’s per second, then your B-tree can handle at most a few hundred insertions per second. This is why MongoDB and MySQL struggle with iiBench, and users are justifiably told to “keep the working set of indexes in memory”.
So why are Fractal Tree indexes so much better? In short, they drastically reduce the I/O. Here is how.
The key difference between Fractal Tree indexes and B-Trees that explains the difference in write performance can be found in the internal nodes:
- with B-trees, internal nodes store just pivots and pointers for each child
- with Fractal Tree indexes, internal nodes store pivots, pointers, and buffers for each child
Note in the picture above that in the internal node, for each child, there is a grey buffer.
The buffers batch up write operations, so that a write works as follows:
- in the root node, find out which child the write SHOULD traverse down
- serialize the pending operation into the buffer
- if the buffer associated with that child has space, return. If the node’s buffer has no space, flush the pending operations in the buffer down a level, thereby making space for future writes.
The flush of a buffer in the root node may cause a cascading flush. That is, the flush in the root node may flood the child with enough data such that now the child’s buffers are full, and the child needs to flush. This keeps happening until data eventually flushes all the way down to leaves.
So why does this algorithm result in such better performance? The short answer is reduced I/O (really, it’s ALL about the I/O). With I/O’s being so expensive, if we must do an I/O we want the benefit we receive to be worth it. With B-trees, on a write, we do an I/O to insert one measly document or row, or key/value pair. With Fractal Tree indexes, by assuming the root node is always in memory, we know that when we perform an I/O on a write, we do it to flush a buffer’s worth of data. This may contain many many documents (or rows, etc…). With each I/O servicing many more writes, Fractal Tree indexes reduce the amount of I/O done by a LARGE factor, thereby eliminating the I/O bottleneck that B-Trees have.
Because of this I/O reduction, Fractal Tree indexes don’t require indexes to fit in memory, and TokuMX is able to achieve such high sustained write performance on data that does not fit in memory.
Another interesting thing to note about these algorithmic properties is that if the data resides in memory, then Fractal Tree indexes are not providing any algorithmic advantage over B-Trees for write performance. If everything fits in memory, then algorithmically, both data structures are fast.
Comparison with MongoDB
TokuMX has a few downsides when compared to MongoDB:
- No 2d/2dsphere indexes. We haven't done this yet because we don't see many people using it. If someone needs it, we can certainly add it, but it's missing right now.
- No text indexes. Same reason.
- Slower count() queries. The issue here is that since MongoDB does all modifications in a write lock, so it can just maintain counters in internal nodes. In TokuMX, we support concurrent writers, and we use MVCC to give readers a consistent snapshot of the data. This means that a count query needs to look at every document to determine whether, according to the MVCC algorithm, it should include that document in its count. You can simulate what TokuMX is doing by running itcount() on a cursor in vanilla MongoDB for an implementation-fair comparison, we're often faster there because we're faster at range queries.
- Future features. There is always the danger that 10gen will develop a feature for MongoDB that we will not implement in TokuMX. From where I'm sitting, I think this is unlikely, but it's a risk you should evaluate, and you can talk to us more if you're worried.
Unless you need 2d, 2dsphere, or text indexes, or you really need to count() frequently, TokuMX will be faster, smaller, more concurrent, and more reliable than MongoDB.
Importing Large Collections
We frequently need to migrate data by exporting and importing collections between replica sets . However, this process can be painful because sometimes the migration rate is ridiculously slow, especially for collections with a lot of small entries and/or complicated indexes. To test importing and exporting, we performed an import/export on two representative large collections with varying object counts.
- Collection1: 143 GB collection with ~300 millions of small objects
- Collection2: 147 GB collection with ~500 thousands of large objects
Both collections are exported from our existing MongoDB collections, where collection1 took 6 days to export and collection2 took 6 hours. We used the mongoimport command to import collections to MongoDB and TokuMX instances. Benchmark results for importing collection1, with a large number of small objects: TokuMX is 3x faster to import.
# Collection1: exported from MongoDB for 6 days Database Import Time --------------------------------------------------------------------- MongoDB 58 hours 37 minutes TokuMX 14 hours 28 minutes
Benchmark results for importing collection2, with a small number of large objects: TokuMX and MongoDB are roughly in parity.
# Collection2: exported from MongoDB for 6 hours Database Import Time --------------------------------------------------------------------- MongoDB 48 minutes TokuMX 53 minutes
Handling Heavy Write Loads
One of our sample write-intensive apps issues a heavy volume of “update” requests with large object sizes. Since TokuMX is a write-optimized database, we decided to benchmark this query stream against both MongoDB and TokuMX. We recorded 10 hours of sample traffic, and replayed it against both replica sets. From the benchmark results, TokuMX performs 3x faster for this app with much smaller latencies at all histogram percentiles.
# MongoDB Benchmark Results - Ops/sec: 1695.81 - Update Latencies: P50: 5.96ms P70: 6.59ms P90: 11.57ms P95: 18.40ms P99: 44.37ms Max: 102.52ms
# TokuMX Benchmark Results - Ops/sec: 4590.97 - Update Latencies: P50: 3.98ms P70: 4.49ms P90: 6.33ms P95: 7.61ms P99: 12.04ms Max: 16.63ms
Efficiently Using Disk Space
Space efficiency is another big selling point for TokuMX. How much can TokuMX save in terms of disk utilization? To figure this out, we exported the data of one of our shared replica sets (with 2.4T data in total) and imported them into TokuMX instances. The result was stunning: TokuMX used only 379G disk space — about 15% of the original size.