Lessons learned about partitioning – p0
My favorite MySQL book covers a lot of the interesting use cases for partitioning and some of the pitfalls. In the past few months a few more pitfalls have been discovered. First let me start with the common use cases for partitioning and why they exist.
Some data becomes less useful over time. Prior to built in partitioning there weren’t very efficient way to get rid of this data without deleting it one record at a time. It was common cron up something like
delete from foo where time_created < unix_timestamp(now() – interval 30 day);
Depending on the rate at which data is inserted the cron may need to run every few minutes so the delete queries don’t clog up replication. This has a very high cost of forcing the database to do a bunch of work to mark records deleted, maintain indexes, and later purge old data. Back in that day purge thread lag was a problem and forcing it to clean up a bunch of deleted records made it worse.
Then came poor man’s partitioning which has been invented several times all around the world. The idea here is to create a table that maps to several specially named partition tables. Then tables can be dropped instead of records being deleted. This saves the purge thread workload but has a high cost on the application developer to write logic to query the correct partitions and merge the results. They also need the logic of which partition to insert into. While this usually made things faster the developer overhead was a nightmare.
Enter MySQL built in partitioning. It supports different partitioning functions, pruning, and simple syntax for adding and dropping partitions. It works well enough for the cases described in High Performance MySQL. By far the most common use case in my world is dropping old data.
Consider this very simple test case of a summary table for access logs. It is partitioned by day for the previous week of when I was writing this post.
CREATE TABLE `access_logs` (
`page` varchar(25) NOT NULL,
`ts` int(10) unsigned NOT NULL,
`page_views` int(10) unsigned NOT NULL,
PRIMARY KEY (`page`,`ts`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY RANGE (ts)
(PARTITION p0 VALUES LESS THAN (1405839600) ENGINE = InnoDB,
PARTITION p1 VALUES LESS THAN (1405926000) ENGINE = InnoDB,
PARTITION p2 VALUES LESS THAN (1406012400) ENGINE = InnoDB,
PARTITION p3 VALUES LESS THAN (1406098800) ENGINE = InnoDB,
PARTITION p4 VALUES LESS THAN (1406185200) ENGINE = InnoDB,
PARTITION p5 VALUES LESS THAN (1406271600) ENGINE = InnoDB,
PARTITION p6 VALUES LESS THAN (1406358000) ENGINE = InnoDB,
PARTITION p7 VALUES LESS THAN (1406444400) ENGINE = InnoDB) */
It is very common to create new partitions ahead of the current time and drop partitions as they age out. This table can effectively hold a weeks worth of page views that can be accessed by page, timestamp or just page. This is generally a more efficient way to expire old data than deleting rows because dropping an old partition is basically the same thing as dropping a table. When using innodb_file_per_table and xfs this is pretty quick.
For tables with a few partitions this works well enough. When the number of partitions increases it can start to introduce inefficiencies into the system. In some cases partitions are treated like tables. This means that doing an operation on the table will perform the same operation for every underlying partition. If I want to get all of the logs for a page mysql has to look inside every partition to see if there are any logs for that page.
mysql> explain partitions select * from access_logs where page=’foo';
+—-+————-+————-+————————-+——+—————+———+———+——-+——+————-+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | Extra |
+—-+————-+————-+————————-+——+—————+———+———+——-+——+————-+
| 1 | SIMPLE | access_logs | p0,p1,p2,p3,p4,p5,p6,p7 | ref | PRIMARY | PRIMARY | 27 | const | 8 | Using where |
+—-+————-+————-+————————-+——+—————+———+———+——-+——+————-+
1 row in set (0.00 sec)
As the number of partitions increases this can slow down significantly. If most of the partitions contain the page being searched for this work would have been done anyway. If the page being searched for is in few partitions the overhead of looking inside of each partition can be significant. When the data set size exceeds the buffer pool mysql may end up doing several disk reads per partition only to discover that there are no relevant rows contained within that partition. If you extrapolate this out to one partition per day for an entire year it gets very costly.
In order for queries to be efficient on this table they must contain a timestamp. You may end up storing the timestamp of the first time a page was seen in another table to provide hints for partitioning in this table. If your table has a fairly large row size you may end up creating secondary indexes to make the dives into each partition smaller and faster.
Remember before when I said that adding and dropping partitions is usually fast enough? It is fast enough when queries on a table are short lived. It is important to understand when a partitioned table is treated as a whole table and when partitions are like their own tables. When using alter table mysql will create a metadata lock on the entire table. This occurs even when adding a partition. What this means is seemingly unrelated queries can interact with each other. I’ve inserted a bunch of fake data into the table.
mysql> select count(*) from access_logs;
+———-+
| count(*) |
+———-+
| 4096 |
+———-+
1 row in set (0.00 sec)
Now I am going to run a slow query on the first partition of the table. This may happen during backups using mysqldump or by any other plain select query that needs to read a lot of data. I am pinning this query to the first partition to make a point. The result is the same. The slow query reading from partition p0 will be:
select sleep(1) from access_logs partition(p0);
This simulates a slow query reading every row from partition p0. Now I want to add a partition to the end of the table but the alter table will hang until lock_wait_timeout is exceeded which defaults to a year. I have the default set much lower:
mysql> alter table access_logs add partition (partition p8 values less than (1406530800));
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction
So what happened? Taking a look at show processlist reveals that alter table was unable to run because the select query holds the metadata lock on the table.
mysql> select state, info from information_schema.processlist;
+———————————+————————————————————————————+
| state | info |
+———————————+————————————————————————————+
| User sleep | select sleep(1) from access_logs partition(p0) |
| Waiting for table metadata lock | alter table access_logs add partition (partition p8 values less than (1406530800)) |
+———————————+————————————————————————————+
3 rows in set (0.00 sec)
In MySQL 5.6 there is a single metadata lock covering all alter table operations. This means that in order to add or drop a partition from a table all other queries must finish executing so the alter table statement can acquire the metadata lock. The simple act of adding a partition can stall every query accessing a table for quite some time if lock_wait_timeout isn’t set correctly.
I would really like to see better handling of locking for adding new partitions. I think the case can be made for acquiring a MDL for dropping a partition but adding a new one should be seamless. Manual partition management is cumbersome at best. Having the ability to automatically drop and create new partitions without fighting with MDL or writing a bunch of scripts would be very nice.
