CLR Based Histogram Functions and SQL Server 2005
One of the enhancements to SQL Server 2005 is the ability to create your own aggregates, or User Defined Aggregate as they are called. They are pretty easy to make and I'll go over the basics later in this article. But the real question is why would you want to make your own UDA? To answer that question we have to look some of the details of aggregates in general.
The code for the examples in this article can be found at code samples.
For example SQL Server has only a few built-in aggregates compared to Oracle and you might like to add some of the ones found in Oracle. Oracle has an aggregate named COVAR_POP that you might like to be able to use in SQL Server. However most aggregates that are basically some kind of accumulation of arithmetic calculations can be composed out of the aggregates that already exist in SQL Server. For example:
SELECT (SUM(col1 * col2) - SUM(col2) * SUM(col1) / count(*)) / count(*) FROM myTable
WHERE col1 IS NOT NULL and col2 IS NOT NULL
produces the same result as the Oracle aggregate. So for most arithmetic aggregates you need not create a CLR UDA. In general you can use a composition of the built-in aggregates in SQL Server to create most aggregates that accumlate arithmetic calculations.
A couple of kinds of aggregates that you can't compose out of the aggregate functions built into SQL Server are a median and string concatenation aggragates. I've talked about median calculations last year. You can find the blog articles at Psuedo Median Aggregrate and Yet Another Median Calculation.
The string concatenation aggregate is at least a candidate for a UDA, but a UDA has a limitation that requires its intermediate state to be serializable in 8000 bytes. You can make a UDA that concatenates strings but it seems like it would be a bit scary to use in practice. It could be working fine in production for months and then suddenly fail because it happens produce a string that is too long. I would call this brittle and avoid it, but YMMV. Also, if you don't need a aggregate function, but just an aggregate calculation, there are other ways to do this in SQL Server.
There is another kind of "aggregate" that is useful, a histogram aggregate. I put the term aggregate in quotes because all of SQL Server's built-in aggregate functions and UDA's return scalars and a histogram is a table. But we are going to look at a way to use a UDA to create a histogram and, with a little slight of hand, make it return a table. We will call this a UDA/Histogram.
There are at least two kinds of UDA/histograms we could look at. One is a histogram of the letter usage in words. For example the letter usage for the the pharse "Attention all aardvarks." would show that the letter A was used five times, assuming we ignore case. This UDA/histogram could be used against a table of customer names to find out the of letter usage all customer names.
Another historgram we could look at is more of a classic histogram, it would find the usage of numbers in a set of ranges, maybe as the count of numbers in the range 0 though 1, 1 through 2, and so on up to 9 through 10.
There are a number of ways to implement a histogram in SQL Server 2005. One is via the UDA/histogram that will will be looking at shortly, one is by JOINing with a range table, and the third is to use a CURSOR. In this article we will look at the letter histogram. In later blog articles we will look at using CLR UDA to implement more "classic" histograms.
We will look at the letter usage UDA/histogram now. First of all a UDA has four methods that you must implement: Init, Accumulate, Merge and Terminate. If you use Visual Studio and make a language/database project it will skeleton UDA for you, you will just have to fill in these methods.
This implementation has 26 Int32 fields, one for each letter in the English alphabet, to accumulate the letter counts. It might seem better to use an array to do this but if you do it is a bit more complicated to implement. A fragement of the accumlators and Init method are shown below:
public struct LetterHistogramAgg
{
Int32 As;
Int32 Bs;
Int32 Cs;
...
public void Init()
{
As = 0;
Bs = 0;
...
The Accumulate method converts the incoming string to upper case, the checks out each of its letters to increment the approprate accumlator. Here is a fragement of that:
public void Accumulate(SqlString Value)
{
Char[] letters = Value.Value.ToUpper().ToCharArray();
foreach (Char letter in letters)
{
if (letter == 'A')
{
As = As + 1;
continue;
}
The Merge method just has to add together the accumlators, you can see that in the example code.
The terminate method has to return the histogram. Because this is a UDA that must be a scalar, so what it returns is a ";" separated string with the counts from each letter accumulator, in order. Here is a fragment of that method:
public SqlString Terminate()
{
StringBuilder sb = new StringBuilder();
sb.Append(As.ToString());
sb.Append(";");
sb.Append(Bs.ToString());
sb.Append(";");
At first glance it might seem less than useful to return a histogram as a string, but shortly we will see it is very easy to make a table valued user defined function that converts it to a table.
One of the pieces of test data I have are tables of male and female first names and last names that were used to fill out United States census forms. You can get the raw data for these tables from http://www.census.gov/genealogy/names/dist.all.last, http://www.census.gov/genealogy/names/dist.female.first, and http://www.census.gov/genealogy/names/dist.male.last.
I've named the tables MaleFirstName, FemaleFirstNames, and LastNames. Using the LetterHistogramAgg on the MaleFirstName table produces:
SELECT dbo.LetterHistogramAgg(Name) from MaleFirstNames)
-------------------------------------------------------------------------
664;135;234;333;785;86;117;191;448;104;94;507;229;597;561;52;13;677;286;304;168;73;83;10;205;18;
Next lets look at making a CLR table valued user defined function that can turn this into something more useful. In order to create a table valued function using the CLR, the function must return an object with an IEnumerable interface. This can be a bit tedious to implement, but in C# the yield keyword will in fact do most of the implementation for you. Below is the code for a UDF that converts the string produced by the LetterHistorgramAgg into a table. The name of the function it implements is LetterHistogramTable.
public partial class UserDefinedFunctions
{
public static void GetHistRow(Object obj, out Char c, out Int32 count)
{
HistSample hs = obj as HistSample;
c = hs.letter;
count = hs.count;
}
class HistSample
{
internal Char letter;
internal Int32 count;
internal HistSample(Char letter, Int32 count)
{
this.count = count;
this.letter = letter;
}
}
[Microsoft.SqlServer.Server.SqlFunction(
FillRowMethodName="GetHistRow",
IsPrecise=true,
IsDeterministic=true,
TableDefinition="Letter NCHAR(1), [Count] INT")]
public static IEnumerable LetterHistogramTable(SqlString histogram)
{
if (!histogram.IsNull)
{
String[] counts = histogram.Value.Split(new char[] { ';' });
String alphabet = "ABCDEFGHIJKLMNOPQUSTUVWXYZ";
for (Int32 index = 0; index < 26; index++)
{
yield return new HistSample(alphabet[index], Int32.Parse(counts[index]));
}
}
}
};
Once we have deployed the LetterHistogramTable UDF we can retry making the histograms of the letters in the MaleFirstNames table. We use the LetterHistogramAgg to make the histogram string, then use the LetterHistogramTable to turn the string into a table.
SELECT * FROM dbo.LetterHistogramTable((
SELECT dbo.LetterHistogramAgg(Name) from MaleFirstNames))
Letter Count
------ -----------
A 664
B 135
C 234
D 333
E 785
F 86
G 117
H 191
I 448
J 104
K 94
L 507
M 229
N 597
O 561
P 52
Q 13
U 677
S 286
T 304
U 168
V 73
W 83
X 10
Y 205
Z 18
Let's do the same thing with the FemaleFirstNames table:
SELECT * FROM dbo.LetterHistogramTable((
SELECT dbo.LetterHistogramAgg(Name) from FemaleFirstNames))
Letter Count
------ -----------
A 4146
B 327
C 717
D 811
E 3370
F 142
G 346
H 796
I 2350
J 311
K 405
L 2064
M 775
N 2330
O 986
P 147
Q 43
U 1793
S 1097
T 1217
U 423
V 262
W 100
X 29
Y 686
Z 94
There is another way to implement the letter historgram and that is to use a CURSOR. You create a table to hold the accumulations of letter counts and then use the CURSOR to fill that table. Here is an example:
CREATE TABLE #letterCounts
(
letter CHAR(1) PRIMARY KEY,
count INT
)
INSERT INTO #letterCounts SELECT 'A', 0
INSERT INTO #letterCounts SELECT 'B', 0
.....
UPDATE #letterCounts SET Count=0
DECLARE Name_Cursor CURSOR FOR SELECT Name FROM MaleFirstNames
DECLARE @name VARCHAR(MAX)
OPEN Name_Cursor;
FETCH NEXT FROM Name_Cursor INTO @name;
WHILE @@FETCH_STATUS = 0
BEGIN
DECLARE @index INT
SET @index = LEN(@name)
WHILE @index > 0
BEGIN
-- go through each letter in string
UPDATE #letterCounts SET count = count + 1 WHERE SUBSTRING(@name, @index, 1) = letter
SET @index = @index - 1
END
FETCH NEXT FROM Name_Cursor INTO @name
END;
CLOSE Name_Cursor;
DEALLOCATE name_Cursor;
SELECT * FROM #letterCounts
letter count
------ -----------
A 664
B 135
C 234
D 333
E 785
F 86
G 117
H 191
I 448
J 104
K 94
L 507
M 229
N 597
O 561
P 52
Q 13
R 677
S 286
T 304
U 168
V 73
W 83
X 10
Y 205
Z 18
In general any aggregate can be implemented using a CURSOR. The difference between using a CURSOR to implement the aggregate and a CLR user defined aggregate is speed. In general a CLR user defined aggregate will be more than two orders of magnitude, that is 100 times, faster than the same aggregate implemented using a CURSOR. On my little portable running SQL Server Developer Edition the Cursor based implementation takes about 50 seconds to run and the user defined aggregate one runs in well under one second. Of course this is just a generalization to show that CLR UDA's in general run faster than CURSOR based one... always check you data in your usage patterns before making any kind of decision about performance
This isn't the whole store on user defined aggregates but it is a start. In later blog articles we will look at "classic" aggregates and also look at some more trade-off between the implementations of aggregates.
Dan
Posted
Aug 26 2006, 11:43 AM
by
dan-sullivan