Monday 29 April 2013

Sorting a String as a Number with T-SQL and SSIS

I was working on a Cube in Analysis Services for a client recently and needed to sort on a field that was a varchar but contained numeric data. I wanted to sort as if it was numeric. I could not just convert this code to a number and sort on that, because the codes had multiple decimals as seen in this image below.

image

Notice the numbers are sorted as a string and not numeric. You want the number 1.1.2 to come before 1.1.10. Instead you can see it is lower in the order due to the string order. You will also notice 2.2.0 should be before 2.10.0. This is happening because a string is evaluated as alphabetical when sorted. So the number 10 would come before the number 2. This is because 1 is less than 2. The zero in ten is not even checked because is it alphabetizing when ordering these fields.

To fix this issue you need get the individual numbers between the decimals in the code as integers and sort on each one. To get the first number we simply use a char index to get up to the first decimal with this SQL code. This code it getting the location (CharIndex) of the decimal and then getting the substring up the character before the decimal. Then is it converting it to an Integer.

convert(int,SUBSTRING(code,1, CHARINDEX(‘.’, code) – 1)) as Num1,

Then we need to get the number after the first decimal but stop at the second decimal. This is a little harder as you can tell by the below code. First we get the location of the first decimal plus one to use as the starting point in our substring. The length of the substring takes a little math. It is the length of the code minus the first decimal location minus the index of the last decimal.

convert(int,SUBSTRING(code, CHARINDEX(‘.’, code) + 1, LEN(Code) – CHARINDEX(‘.’, code) – Charindex(‘.’,REVERSE(code)))) as Num2,

Then we need to get the number after the last decimal. The location of the start of the substring is calculated by subtracting the length of the code minus the location of the last decimal place plus 2. The length of the code is used as the length of the substring, this is too long but since it is the last digit it works fine.

Convert(int,SUBSTRING(code, len(code) – Charindex(‘.’,REVERSE(code)) + 2, len(code))) as Num3

The complete Query will be:

Select Code,

convert(int,SUBSTRING(code,1, CHARINDEX(‘.’, code) – 1)) as Num1,

convert(int,SUBSTRING(code, CHARINDEX(‘.’, code) + 1, LEN(Code) – CHARINDEX(‘.’, code) – Charindex(‘.’,REVERSE(code)))) as Num2,

Convert(int,SUBSTRING(code, len(code) – Charindex(‘.’,REVERSE(code)) + 2, len(code))) as Num3

from CodeOrder

Order by Num1, Num2, num3

And the results of this query are:

image

You can see that the codes are in the numerical order we wanted. Your SQL code may need to be adjusted for the number of decimals in your field.

You can add these new columns to your dimension and use them as the key to your attributes and order by the key. You can change the dimension table in the DSV to a Named Query and add these new number fields.

If you prefer, you can go back to the SSIS package loading this dimension and create these new number columns on the dimension table. In SSIS the derived column transform could be used to do the same conversion that I am doing in the Named Query to get the numeric fields.

Which is better, SSIS or SQL? Should you do this in derived columns in your package or should you use SQL statements like the one above to perform this work?

Maintainability and Performance are the two items to consider when making this decision. The SQL Query will perform much better than the derived columns but the query could confuse others that may need to maintain this after you. The SSIS derived column tends to be a little easier to understand and managing one derived column in an SSIS package could be considered easier. This is debatable and I know hard core T-SQL Gurus are going to disagree.

Here is the derived column that does the same thing as the SQL statement above.

image

(DT_I4)SUBSTRING(Code,1,FINDSTRING(Code,”.”,1) – 1)

(DT_I4)SUBSTRING(Code,FINDSTRING(Code,”.”,1) + 1,FINDSTRING(Code,”.”,2) – FINDSTRING(Code,”.”,1) – 1)

(DT_I4)SUBSTRING(Code,FINDSTRING(Code,”.”,2) + 1,LEN(Code))

The FINDSTRING function allows you to select the occurrence you want to find so there is no need for the reverse and the extra subtraction that was needed in the SQL query.

The results are the same and these columns can now be added to the dimension table and be used to sort. The SQL statement did perform 20% faster than the derived column. But the Derived column could be considered easier to maintain depending on your level of T-SQL and your level of SSIS.

Also, as pointed out in the comments, you can use ParseName too.

SELECT code, 
Cast(PARSENAME(code, 3)as int) as Num1, 
cast(PARSENAME(code, 2) as int) as Num2, 
cast(PARSENAME(code, 1) as int) as Num3 
FROM CodeOrder
Order by Num1, Num2, num3

No comments:

Post a Comment