

SQL Server Tips by Gama and Naughter

Calculating
Median with a SP
There are many ways to calculate the median from a table, each one
being more effective on a particular situation. The algorithm is
minimal and linear but there are many different possible
implementations.
Knowing that a decision based on the parity of the number of
elements will retrieve one central element or the average of two
central elements an IFELSE statement sounds perfect:
CREATE PROCEDURE myMedianRoyalty2
AS
Calculate the median of the royalty field from table roysched
using MAX and TOP
DECLARE @totalRec int
SELECT @totalRec=COUNT(royalty) FROM roysched
SEE CODE DEPOT
if even then get the average of both central values
SELECT AVG(royalty*1.0) as Median FROM (get average
SELECT MAX(royalty) as royalty FROM (SELECT TOP 50 PERCENT royalty
FROM roysched ORDER BY royalty)tmpCount
UNIONunion of both central values
SELECT MIN(royalty) as royalty FROM (SELECT TOP 50 PERCENT royalty
FROM roysched ORDER BY royalty desc)tmpCount
) tmp2
This code works but it would be much better with only one SELECT
statement and using the CASE statement within the SELECT:
CREATE PROCEDURE myMedianRoyalty
AS
Calculate the median of the royalty field from table roysched
using MAX and TOP
SELECT CASE (SELECT COUNT(royalty) FROM roysched)
WHEN 1 THEN
(SELECT MAX(royalty) FROM (SELECT TOP 50 PERCENT
royalty FROM roysched ORDER BY royalty)tmpCount)
ELSE
(SELECT AVG(royalty*1.0) as Median FROM (get
average
SELECT MAX(royalty) as royalty FROM (SELECT
TOP 50 PERCENT royalty FROM roysched ORDER BY royalty)tmpCount
UNIONunion of both central
values
SELECT MIN(royalty) as royalty FROM
(SELECT TOP 50 PERCENT royalty FROM roysched ORDER BY royalty
desc)tmpCount
) tmp2)
END AS Median
When possible CASE should be used instead of IFELSE, particularly
if there is more than one condition.
Another solution that might be the best for a huge number of records
is using a cursor. Although cursors are very often not recommended,
because of the performance degradation and record locking, this is a
different situation because there will be only one or two record
retrievals.
CREATE PROCEDURE myMedianRoyalty3
AS
Calculate the median of the royalty field from table roysched
using a Cursor
DECLARE @tot1 float, @tot2 float, @midRec int, @totalRec int,
@median float
SELECT @totalRec=COUNT(royalty) FROM royschedget total number of
records
SET @midRec=@totalRec/2calculate the central point
DECLARE cur_med SCROLL CURSOR declare the cursor
FOR SELECT royalty FROM roysched ORDER BY royalty
OPEN cur_med
SEE CODE DEPOT
ELSE if even then get the average of both central
values
BEGIN
SET @midRec=@midRec
FETCH ABSOLUTE @midRec FROM cur_med INTO @tot1get 1st
central value
FETCH NEXT FROM cur_med INTO @tot2get 2nd central value
SET @median=(@tot1+@tot2)/2.0calculate average
END
CLOSE cur_med
DEALLOCATE cur_med
SELECT @median as Median
The next solution uses a temporary table. The best application for a
temporary table is to store intermediate data or data with very
intensive computations, such as complex calculated fields.
CREATE PROCEDURE myMedianRoyalty4
AS
Calculate the median of the royalty field from table roysched
using a temporary table
DECLARE @totalRec int, @midRec int
SELECT @totalRec=COUNT(royalty) FROM royschedget total number of
records
SET @midRec=@totalRec/2calculate the central point
print @midRec
SELECT IDENTITY(int, 1, 1) as Rid, royalty create temporary table
from roysched and with an identity field
INTO #RoySchedTmp
FROM roysched ORDER BY royalty
SEE CODE DEPOT
ELSE if even then get the average of both central
values
BEGIN
SELECT AVG(royalty*1.0) as Median calculate average
FROM #RoySchedTmp WHERE (Rid=@midRec) OR
(Rid=@midRec+1)
END
DROP TABLE #RoySchedTmp
A classic and very elaborate solution is the one created by David
Rozenshtein, Anatoly Abramovich and Eugene Berger in their article
“Computing the median”. Their solution consists of a selfjoin plus
characteristic functions (This is a functions that emulates
conditional clauses). The following code was adapted from the
article:
CREATE PROCEDURE myMedianRoyalty5
AS
Calculate the median of the royalty field from table
roysched using a self join
SELECT
CASE WHEN COUNT(*)%2=1
THEN x.royalty
ELSE
(x.royalty+
MIN(CASE WHEN y.royalty>x.royalty
THEN y.royalty
END))/2.0
END median
FROM roysched x, roysched y
GROUP BY x.royalty
HAVING
SUM(CASE WHEN y.royalty <= x.royalty
THEN 1 ELSE 0 END)>=(count(*)+1)/2
AND
SUM(CASE WHEN y.royalty >= x.royalty
THEN 1 ELSE 0 END)>=(count(*)/2)+1 
CREATE PROCEDURE myMedianRoyalty6
AS
Calculate the median of the royalty field from table
roysched using a self join
SELECT
1.0*(SIGN((1SIGN(SIGN(SUM(SIGN(1SIGN(y.royaltyx.royalty)))
+(count(*)+1)/2
)))+count(*)%2))
*x.royalty
+
1.0*(SIGN((1SIGN(SIGN(SUM(SIGN(1SIGN(x.royaltyy.royalty)))
+(count(*))/2+1
)))+1count(*)%2))
*(
x.royalty +
MIN(((SIGN(1SIGN(y.royaltyx.royalty)))
*y.royaltythis is to increase the smaller numbers
+1)
*y.royalty )
)/2.0
as median
FROM roysched x, roysched y
GROUP BY x.royalty
HAVING
(SUM(SIGN(1SIGN(y.royaltyx.royalty)))>=(count(*)+1)/2)
and
(SUM(SIGN(1SIGN(x.royaltyy.royalty)))>=count(*)/2+1) 
The left side takes advantage of the TSQL specific statement CASE
that returns values according to a condition. The right side is very
similar to the original code from the article. Both are equivalent
and work fine for tables with unique values. Selfjoins are rare in
development code because they are too complex and consume too many
resources from the system. There are simpler and faster
alternatives.
The above book excerpt is from:
Super SQL
Server Systems
Turbocharge Database Performance with C++ External Procedures
ISBN:
0976157322
Joseph Gama, P. J. Naughter
http://www.rampantbooks.com/book_2005_2_sql_server_external_procedures.htm
