SQL Server Analytical Ranking Functions

There are times when the expressive power of the SQL Server Windowing or Analytical Ranking functions is almost breathtaking. Itzik Ben-Gan observed: “The concept the Over clause represents is profound, and in my eyes this clause is the single most powerful feature in the standard SQL language.”

The other day, I solved a potentially complex query problem in an elegant manner using some of the newer SQL Server functions including row_number, dense_rank, sum, and lag. Before basking in the glow of this remarkable query, let’s take a brief tour of the some of these functions. What I hope to instill here is some familiarity with these functions and some potentially unexpected uses for them, so that when you are developing a query, you may find situations where they provide a simpler and more performant result.

Perhaps the simplest to understand, and certainly one of the most frequently used windowing functions is row_number. It first appeared in SQL Server 2005, and I end up finding a use for it in almost all my non-trivial queries. Conceptually it does two things. First, it partitions the result set based on the values in zero to many columns defined in the query. Second, it returns a numeric sequence number for each row within each subset from the partition, based on ordering criterial defined in the query.

If no partitioning clause is present, the entire result set is treated as a single partition. But the power of the function really shows when multiple partitions are needed. My most frequent use of sequence numbers in multiple partitions is to find the last item on a list, frequently the ordering criteria is time and the query finds the most recent item within a partition. Examples of this are: getting an employee’s current position, and getting the most recent shipment for an order.

The partition definition is given with a list of columns, and semantically the values in the columns are ‘and-ed’ together or combined using a set intersection operation—that is, a partition consists of the subset of query rows where all partition columns contain the same value.

The ordering criteria consists of columns that are not part of the partition criteria. The ordering for each column can be defined as ascending or descending. If the values in the columns defined for the ordering criteria, when ‘and-ed’ together do not yield unique values for all rows within a partition, row numbers are assigned arbitrarily to rows with duplicate values. To make the results deterministic, that is, yield the same result for each query execution, it is necessary to include additional columns in the ordering clause to ensure uniqueness. Such extra columns are referred to as ‘tie-breakers’. One reliable ‘uniqueifier’ is an identity column, if the table has one. In the example below, I show an imaginary employee database and create row numbers that show both the first and last position per employee.

As in the example, I often generate the row number within a Common Table Expression (CTE), and refer to it in subsequent queries.

Among the ranking functions, second in frequency of use when I am query-writing is the dense_rank function (although rank could be used as well). I used to think that if I wasn’t writing queries for a school calculating class rank, I had no use for the ranking functions. The general power of this function became apparent to me when I began to see other query problems in terms of ranking. For instance, as a means of assigning numbers to partitions of a set, and then using those numbers as unique identifiers for each partition.

I will note that using the result of an arithmetic function as an identifier is a not immediately intuitive concept that can really generalize the power of the windowing functions.

Rank is defined as the cardinality of the set of lesser values plus one. Dense rank is the cardinality of the set of distinct lesser values. When using these values as identifiers, either function will work—I prefer dense rank for perhaps no reason other than the aesthetic value of seeing the values increase sequentially. While these definitions are mathematically precise, I believe looking at an example query result will make the difference between the functions intuitively clear.

I found the syntax of the ranking functions confusing initially because I was using the rank to logically partition query results, but the partitioning criteria for this in the order by clause rather than a partition clause. The ranking functions do provide a partition by clause, as with row_number, whereby the ranking would be within each defined partition.

Analogous to creating sequential row numbers within a partition is the ability add a Partition by and Over By clause to the Sum aggregate, creating a running total. In fact, summing the constant value 1 for will yield a result identical to row_number. This capability is essential to solving the query problem solved in the second example. Though not a part of this query, when a partition clause is used for Sum, but not an ordering clause, each row of the result set contains a total for the partition which is useful for calculating percent of total for each row.

Without getting into details, the SQL Server Development Team implemented these functions such that they are generally far more performant than alternate ways of getting the same result using, which often involves correlated sub-queries. I view them, in some respects, as ‘in line subqueries’.

A short example demonstrating these functions is shown below. Let’s talk about the data for the example. We have a table containing manufacturing steps for product orders. A given order is specified uniquely by the 3-tuple of order number, sequence number, and division.

Each order in this table lists manufacturing steps involved in preparing the order for sale. Each step is uniquely specified within the order with an op­eration number, an arbitrary number, the sequence of which matches the order the manufacturing operations are to be performed. I have included an operation description for each operation simply to give an idea of what said operations would be like in this fictitious company. In the example, I used some coloring to visually indicate how the sample data is partitioned based on a combination of column values.

RankingFunctionsExample1

Given data organized as above, there is a re­quest to partition the processing steps for an order such that all operations sequentially performed at a work center are grouped together. Said groupings will be referred to as Operation Sequences. To better demonstrate boundary conditions, I have added a bit more data to the table for the second example.

One potential use for such Operation Sequences would be to sum up the time an order spends at each workstation.

The first step in this approach is to identify which Operations involve the work-in-progress arriving at a new workstation. In the unlikely event that one order ends at a given workstation and the next order starts at that same one, we need to identify changes in Order Id as well. To do this, the Lag function, introduced in SQL Server 2012, provides a compact approach.

By emitting a one for each changed row, a running total, using the Sum function with the over clause, yields a unique identifier for each Operation Se­quence.

RankingFunctionsExample2

For a fuller treatment of the Ranking/Windowing functions, I recommend Itzik Ben-Gan’s book SQL Server 2012 T-SQL using Windowing Functions. If you want to shorten your queries and speed them up, I recommend you get comfortable with the Ranking/Windowing functions, and begin to tap their enormous potential.

Remembering Lesley Gore

A nod to Lesley Gore who recently departed the mortal realm. At the age of 17, in 1963, she hit the charts with the iconic “Its my party and I’ll cry if I want to”, and the defiant “You don’t own me.” To give an idea of the enormous shadow he casts over popular music, Quincy Jones discovered here and produced these early hits. The attached youtube performance may actually be a live performance of Lesley, as she’s a bit more defiant than the hit version. Enjoy, and bless her soul.

Of Trojan Horses, chatty devices, and Wireshark, part 1

One of the things that bugs me is wondering if at night some trojan horse on my mac is sending all my data to some Eastern Bloc hacker while I am sleeping. It strikes me that the best tool to look into this is the free network sniffer program, Wireshark. I also got a managed gigabit switch with port mirroring ($130) so I can see that the other computers in the home network are sending out into the world when we aren’t looking.

My router/firewall is configured for stealth mode, which means it will not respond to pings from outside, and it has stateful packet inspection. I am pretty confident that nobody can get in from the internet, except maybe for some back-door that the federal government secretly forced vendors to install on all routers… The only internet traffic allowed in the firewall is that which is part of an exchange initiated from the local network.

So I bring up Wireshark. First thing I notice is that the router is constantly broadcasting arps for non-existent IP addresses on the local network. I am wondering if this is the result of some internet traffic getting in, or maybe a bug in the router program. Also I wonder what sort of arp cache it has as it arps existing addresses every second or so as well. There is no router setting about how long the arp cache entries are kept, so nothing to be done.

I get a new router, and this seems a bit less chatty, at least in the arp department. Then I notice the router is emitting STP broadcasts. How cute, my 4 port switch on the back of the router wants to have a spanning tree protocol root bridge election. Just for the fun of it, I log into the managed switch and enable STP. By the time I get back to the Wireshark window, the STP broadcasts have stopped, so the switches must have already had their election.

Now the mac (mavericks) is constantly emitting broadcasts for things like network printers and other discovery protocols. As is too often the case with OSX, there is no user interface to shut these things off, and one has to fire up VI and turn off some daemons. Not a priority right now and I am not eager to delve into FreeBSD network configuration.

At this point, I am looking at a well behaved mac, chatty with its broadcasts, but nothing leaving the lan. Then I bring up Safari… Now I am seeing all sorts of TCP misbehavior with the outside world. TCP conversations initiated on the mac and going to ip addresses located in Ireland and various soviet countries. Packets are highlighted in red indicating protocol misbehavior such as out of sequence acks, and uncompleted TCP handshakes. Wireshark has no way of knowing what processes running on the mac are initiating network traffic, but I notice it goes away when Safari closes. I look at Safari a bit and see some sort of uninvited Safari extension has been installed.

I purchased some backup software online, and apparently the vendor was in the hacker’s paradise in Russia and Eastern Bloc countries–I learned after I made payment. I guess installing the program, which of course required administrator privileges to install, put its own little trojan horse in my browsers. I make a note to look for other inappropriate things this install may have slipped under the covers.

Once I deleted the Safari extension, the interior of the mac appeared to transmit nothing but the usual chatty local discovery broadcasts I mentioned before. Not a bad catch for the first couple of times I cast the net for Trojan Horses…

😉

Inflation in ancient Rome

My view of human nature, and behavior in groups and governments, is that it has changed little over the last few thousand years. Perhaps less barbaric in the last few hundred years, but the dynamics are basically the same. The advance of technology makes dramatic changes in the appearance of things, and the fallacious view currently held that human progress inevitably advances can distract from what I consider underlying common threads.

Machiavelli’s Prince, published in 1532, provides political advice to a prince in a monarchical setting. Yet the advice he gives, which some decry as horribly cynical, is in most respects the same sorts of things things that help a politician get elected in 21st century America, a nominally democratic form of government. For instance Machiavelli advises the prince to attend church in a public manner to assure the populous that he shares their beliefs and is devout.

Swift published Gulliver’s Travels in 1726, and his wry and dark satire portrays human vanities and vices that we can easily recognize in our contemporary world, with some political commentary that again translates easily from a monarchy to our own democracy. In one journey he even goes so far as to satirize academic types who are proud to have developed things that are patently useless such as hairless sheep.

Given the notion that human and governmental behavior has evolved little, if at all, over the centuries, it follows that a familiarity with history can prove instructive when one takes today’s world and attempts to predict future developments. One such category of prediction is the notion, advanced by adherents to the Austrian School of economics, of which I am one, is that when governments control currency, they inevitably enrich themselves by debasing the currency–often in order to finance foreign wars. In older times, when currency consisted of precious metal coinage, governments would direct their mints to dilute the precious metal content of the coins, or reduce their diameter while maintaining the face value of said coins. When the public recognized this and hoarded the older more valuable coins, the government would attempt to confiscate said coins and make it illegal for the public to possess them.

Fast forward a few hundred years, and President Franklin Roosevelt, in 1933, criminalized the possession of gold coin. In 1971, President Nixon formally severed any link between the value of the dollar and underlying stores of gold to back up its value. This set the stage for the inflation of the 1970s which continues unabated.

Given these thoughts, I found the following article of interest. There is no consensus view that the causes of the decline and fall of the Roman Empire are accurately known, but the currency side of things has not been much discussed, so I commend you to this interesting discussion:

http://www.zerohedge.com/news/2014-08-31/currency-reform-ancient-rome