There are times when the expressive power of the SQL Server Windowing or Analytical Ranking functions is almost breathtaking. Itzik Ben-Gan observed: “The concept the Over clause represents is profound, and in my eyes this clause is the single most powerful feature in the standard SQL language.”
The other day, I solved a potentially complex query problem in an elegant manner using some of the newer SQL Server functions including row_number, dense_rank, sum, and lag. Before basking in the glow of this remarkable query, let’s take a brief tour of the some of these functions. What I hope to instill here is some familiarity with these functions and some potentially unexpected uses for them, so that when you are developing a query, you may find situations where they provide a simpler and more performant result.
Perhaps the simplest to understand, and certainly one of the most frequently used windowing functions is row_number. It first appeared in SQL Server 2005, and I end up finding a use for it in almost all my non-trivial queries. Conceptually it does two things. First, it partitions the result set based on the values in zero to many columns defined in the query. Second, it returns a numeric sequence number for each row within each subset from the partition, based on ordering criterial defined in the query.
If no partitioning clause is present, the entire result set is treated as a single partition. But the power of the function really shows when multiple partitions are needed. My most frequent use of sequence numbers in multiple partitions is to find the last item on a list, frequently the ordering criteria is time and the query finds the most recent item within a partition. Examples of this are: getting an employee’s current position, and getting the most recent shipment for an order.
The partition definition is given with a list of columns, and semantically the values in the columns are ‘and-ed’ together or combined using a set intersection operation—that is, a partition consists of the subset of query rows where all partition columns contain the same value.
The ordering criteria consists of columns that are not part of the partition criteria. The ordering for each column can be defined as ascending or descending. If the values in the columns defined for the ordering criteria, when ‘and-ed’ together do not yield unique values for all rows within a partition, row numbers are assigned arbitrarily to rows with duplicate values. To make the results deterministic, that is, yield the same result for each query execution, it is necessary to include additional columns in the ordering clause to ensure uniqueness. Such extra columns are referred to as ‘tie-breakers’. One reliable ‘uniqueifier’ is an identity column, if the table has one. In the example below, I show an imaginary employee database and create row numbers that show both the first and last position per employee.
As in the example, I often generate the row number within a Common Table Expression (CTE), and refer to it in subsequent queries.
Among the ranking functions, second in frequency of use when I am query-writing is the dense_rank function (although rank could be used as well). I used to think that if I wasn’t writing queries for a school calculating class rank, I had no use for the ranking functions. The general power of this function became apparent to me when I began to see other query problems in terms of ranking. For instance, as a means of assigning numbers to partitions of a set, and then using those numbers as unique identifiers for each partition.
I will note that using the result of an arithmetic function as an identifier is a not immediately intuitive concept that can really generalize the power of the windowing functions.
Rank is defined as the cardinality of the set of lesser values plus one. Dense rank is the cardinality of the set of distinct lesser values. When using these values as identifiers, either function will work—I prefer dense rank for perhaps no reason other than the aesthetic value of seeing the values increase sequentially. While these definitions are mathematically precise, I believe looking at an example query result will make the difference between the functions intuitively clear.
I found the syntax of the ranking functions confusing initially because I was using the rank to logically partition query results, but the partitioning criteria for this in the order by clause rather than a partition clause. The ranking functions do provide a partition by clause, as with row_number, whereby the ranking would be within each defined partition.
Analogous to creating sequential row numbers within a partition is the ability add a Partition by and Over By clause to the Sum aggregate, creating a running total. In fact, summing the constant value 1 for will yield a result identical to row_number. This capability is essential to solving the query problem solved in the second example. Though not a part of this query, when a partition clause is used for Sum, but not an ordering clause, each row of the result set contains a total for the partition which is useful for calculating percent of total for each row.
Without getting into details, the SQL Server Development Team implemented these functions such that they are generally far more performant than alternate ways of getting the same result using, which often involves correlated sub-queries. I view them, in some respects, as ‘in line subqueries’.
A short example demonstrating these functions is shown below. Let’s talk about the data for the example. We have a table containing manufacturing steps for product orders. A given order is specified uniquely by the 3-tuple of order number, sequence number, and division.
Each order in this table lists manufacturing steps involved in preparing the order for sale. Each step is uniquely specified within the order with an operation number, an arbitrary number, the sequence of which matches the order the manufacturing operations are to be performed. I have included an operation description for each operation simply to give an idea of what said operations would be like in this fictitious company. In the example, I used some coloring to visually indicate how the sample data is partitioned based on a combination of column values.
Given data organized as above, there is a request to partition the processing steps for an order such that all operations sequentially performed at a work center are grouped together. Said groupings will be referred to as Operation Sequences. To better demonstrate boundary conditions, I have added a bit more data to the table for the second example.
One potential use for such Operation Sequences would be to sum up the time an order spends at each workstation.
The first step in this approach is to identify which Operations involve the work-in-progress arriving at a new workstation. In the unlikely event that one order ends at a given workstation and the next order starts at that same one, we need to identify changes in Order Id as well. To do this, the Lag function, introduced in SQL Server 2012, provides a compact approach.
By emitting a one for each changed row, a running total, using the Sum function with the over clause, yields a unique identifier for each Operation Sequence.
For a fuller treatment of the Ranking/Windowing functions, I recommend Itzik Ben-Gan’s book SQL Server 2012 T-SQL using Windowing Functions. If you want to shorten your queries and speed them up, I recommend you get comfortable with the Ranking/Windowing functions, and begin to tap their enormous potential.