Wednesday, May 25, 2011

Doing Partition Elimination

Let us assume, that we want to move old data to secondary filegroups. In this case, we would then partition on the timestamp of a row. However, our applications querys a lot on the primary key column "ID", so we prefer to make the ID instead of the timestamp the partition column.

The table therefore partitioned like this:

Partition 1: ID < 99 (before 01-February-2011)
Partition 2: ID 100..199 (February 2011)
Partition 3: ID 200..299 (March 2011)
Partition 4: ID 300..399 (April 2011)
Partition 5: ID > 400 (after 01-May-2011)

At the same time, there are a lot of queries with timely conditions such as
select * from table where timestamp > '2011-04-01' and error is not null

The SQL Server query optimzer would now search the whole table for entries that are younger then the 1st of April 2011 (and have the error column filled). Since the timestamp is not part of the partition column, the query optimizer doesn't know that all data from April 2011 and younger is in partition 4 and 5 and, hence, it can't do partition elimination but has to go through all partitions.

How can we help the query optimizer that he only has to look up the concrete partition.

Until now, I found three answers:

  • You can add a condition to the partition to the query:
    select * from table where timestamp > '2011-04-01' and error is not null and $partition.function(ID)>=4
    This way, the SQL Server goes through all rows with an clustered index seek and returns rows from partition 4 and greater. It's fast, but you have to change the query.

  • You can add a condition to the ID to the query:
    You can also limit the query to the boundary value of the partition - adding and ID>=300, and thus, the SQL Server uses a clustered index scan, which is faster than a clustered index seek (see Scans vs. Seeks from Freedman), and furthermore, only reads through the partitions 4 and 5.

  • You could add a non-clustered index on the timestamp:
    Thus, the ID is automatically included because it's in the clustered index. Then, you don't have to change your queries, and the SQL Server can use the index to find the IDs quite fast. This is the most convenient option because you don't change the query; however, the SQL Server needs more steps to find the result.
If you have more ideas, please let me know in a comment. :-)

No comments: