||SQL Server Tips by Gama and Naughter
SQL Server Collation
A collation is a particular set of rules for using characters for a
language or alphabet. Although the same language might be spoken by
several nations, there are national or cultural variations, not to
mention nations with several languages. Microsoft Windows created
the Language ID Reference Number (LCID), a 32 bit code for nearly
200 languages. The LCID is more than a number that identifies a
national language, it contains information encoded in its bits:
The first ten bits are the primary language ID in the range 0x200 to
0x3FF. Bits ten to fifteen are the sub-language ID in the range 0x20
to 0x3F, for the same language from different regions. Bits sixteen
to nineteen are the sort ID and the remaining twelve bits are
reserved and should be zero.
Collations are responsible for determining the correct characters
and how they are sorted or compared. In SQL Server 2000, different
collations can coexist down to the level of columns.
Each SQL Server collation determines:
* The sort order for Unicode.
* The sort order for ASCII.
* The code page used for ASCII.
ASCII characters in multilingual databases are problematic because
ASCII has different character sets, called code pages. Converting
between code pages is difficult because identical characters from
different code pages might have different ASCII codes and some
characters have no equivalent. With characters in Unicode there is
no such problem.
The Sort ID field determines the sort order, which is very important
when comparing or sorting data. There are five considerations about
the sort order:
* Ascending or descending? (is a>b?)
* Case-sensitive? (is a>A?)
* Accent-sensitive? (is a=á?)
* Character width? (is a>aa?)
* Kana character types? (is ?=??)
SQL Server 2000 has two types of collations:
* Windows collations use the Windows locale.
* SQL collations - provided for backwards compatibility with sort
A SQL collation name consists of four components:
* SortRules name of the alphabet or language.
* Pref - uppercase preference. (optional)
* CodePage - code page. (optional)
* CaseSensitivity + AccentSensitivity or BIN
CaseSensitivity can be either CI (case insensitive) or CS (case
AccentSensitivity can be either AI (accent insensitive) or AS
BIN means that binary sort order is used, instead of text order.
For example SQL_Latin1_General_Pref_CP437_CI_AS is
Latin1_General(alphabet), Pref(uppercase preference), CP437(code
page 437), CI(case insensitive) and AS(accent sensitive).
The above book excerpt is from:
Turbocharge Database Performance with C++ External Procedures
Joseph Gama, P. J. Naughter