Managed Code @ SQL Server – Part 3: User-defined Aggregates

In the previous post of this series I covered the definition of custom types to SQL Server by using the CLR. This allows us to have complex types that can give better meaning and organization to our data. At the end of that post, a question came out: what about the aggregate functions? As you might expect, if one can define its custom type, one can also define its custom aggregation.

First of all there’s something to know about the way SQL Server processes these aggregates. Depending on its size, the data to be aggregated might be divided into chunks. The values on each chunk are aggregated and generate a “subtotal” (since aggregations are cumulative, this can be done to benefit from parallel processing). Then, some “subtotals” are aggregated and generate another “subtotal”. This goes on until we have the final value.

In user-defined aggregates, the behavior described above is accomplished by the methods Accumulate, Merge and Terminate. These methods belong to a contract that all the custom aggregates must fulfill, and have the following semantics:

  • Accumulate – processes a value (of a user-defined type or not). The value can be received as a SqlString or an instance of the aggregate “target type”, if a conversion is available.
  • Merge – combine the current values of two instances of the custom aggregate. The current instance should accumulate the value of the instance passed as parameter.
  • Terminate – return the final value of the aggregation.

That said, lets define a MaxRectArea aggregate that returns the area of the biggest Rectangle (from the previous post).

[Serializable]
[SqlUserDefinedAggregate(Format.Native)]
public struct MaxRectAreaAggregate
{
    private int maxArea;

    public void Accumulate(Rectangle r)
    {
        if (r.IsNull) return;
        maxArea = Math.Max(maxArea, r.Height * r.Width);
    }

    public void Merge(MaxRectAreaAggregate agg)
    {
        maxArea = Math.Max(maxArea, agg.maxArea);
    }

    public SqlInt32 Terminate()
    {
        return maxArea == -1 ? SqlInt32.Null : maxArea;
    }

    public void Init()
    {
        maxArea = -1;
    }
}

The logic is pretty simple: we have a field that stores the biggest area found in the data, with “-1” representing the absence of Rectangles in the aggregation. Each time we have a new value or we merge two aggregate instances, we keep the higher value. There are two more details that you may have noticed. First, the class is serializable. This is necessary because, during the “chunking” process described before, SQL Server may need to save intermediate values on disk (this is a simple aggregate, but others might use more memory that needs to be managed). Second, there’s the Init method. When combining two subtotals, one of the instances of the aggregate becomes unnecessary. In this case, it can be reused, but only after a type-specific “reset” (we don’t want to mix old values with the ones in the new chunk of data).

As for custom types, you can deploy the aggregate from Visual Studio or using the statements below. The usage is as any other aggregate function, except that you have to specify the schema.

CREATE ASSEMBLY CLRSQLSample FROM '…\CLR_SQL.dll'

CREATE TYPE Rectangle external name CLRSQLSample.Rectangle

CREATE AGGREGATE MaxRectArea(@r Rectangle) RETURNS int
EXTERNAL NAME CLRSQLSample.MaxRectAreaAggregate

select dbo.MaxRectArea(r)
from rectanglesTable

Finally, the SqlUserDefinedAggregate attribute has a few properties that allow us to provide information about the aggregate behavior, as shown below.

[Serializable]
[SqlUserDefinedAggregate(Format.Native, 
    Name="MaxRectArea",
    IsInvariantToDuplicates = true,
    IsInvariantToNulls = true,
    IsNullIfEmpty = true,
    IsInvariantToOrder = true)]
public struct MaxRectAreaAggregate
{…}

This information can be used by SQL Server to decide how to manage the data when using our aggregate (should it be ordered first? should it have all items or can be a set without duplicates?). In the specific case of IsNullIfEmpty set to true, I expected that if the source set is empty, SQL Server wouldn’t invoke the aggregation, just returning NULL. I run a few tests and the aggregate code was invoked anyway. The same for IsInvariantToNulls, so we need to consider that situation in the Accumulate method (I did that before, as you may have noticed).

This post ends this series on using the CLR within SQL Server. In the future I might run some performance tests on this; if so, I’ll post the results. Hope you find this helpful when you need some complex data representation in a sort of “higher level” fashion (our rectangle could be made of two separate fields in a row…) and when some complex calculations or processing aren’t satisfied in SQL Server.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s