Monday, 29 April 2013

Handling Flat File Headers and Errors in SSIS

Using SSIS to load a flat file into a database is a common use of the tool. This works great in SSIS and is very fast due to the dataflow doing batch updates. This is easy to set up when the flat file has no headers and footers. I am not referring to the column headers, those can be skipped. I am referring to the headers that are not a part of the data. Below is an example of data from a comma delimited flat file. There are several columns of data in the flat file separated by commas. The headers and footers that appear around the data make it difficult to get to the data.


image


One of the options you have in this situation is to use the error output from the source in the data flow. The problem with this approach is data could still be written from the header if the header does not cause an error.
Here is a data flow example of the flat file loading into the table with the error rows being sent to another file that can be checked later manually for data.

image
After running the dataflow with the above file we can see that the headers were moved to the bad file, but the first data row was moved also. Here is the bad rows file:

image
The rest of the rows were written to the table.

image

If the header is very small then you may get some of the header in the database. Here is an example file with a small header.

image
Loading this using the above data flow causes no rows to be written to the bad rows file and places the header into the table as seen below. This occurred because the header and the first row in the data did not exceed the column width on the table and therefore was treated as the data.

image
You can see that this bad row option does not always work for loading data. You may end up with missing rows or junk data. We need away to remove the header rows. Fortunately this can be done with a script task in SSIS.
The first thing we need to do is create some package variables to be used by our script task. This way if the file format changes we can update the variables and the script task will still run successfully. This script is made to work with delimited files only. If the file is fixed width or ragged right, this approach will not work.

intDelimCount = number of delimiters that should be on each line that contains data
strAppend = Appended to the new file name created from the script task
strDelimiter = delimiter used in the flat file
strDestFolder = destination folder for the new files created by the scrip task
strFileExt = extension of the flat file
strFileName = name of the flat file
strSourceFolder = folder where the flat file exist

image
Now for the script task, this script task will open the file and parse through the rows looking for any rows that have the right amount of delimiters. When it finds a row that has the right amount of delimiters it writes this row to a new file in the destination folder. It adds the append variable to the file name also. If the row does not contain the correct number of variables it is skipped over. Once the script task is complete you will have the original flat file unchanged, and you will have a new file with only the data rows. Below is the code in VB, you can convert it to C# with websites like this one: http://www.developerfusion.com/tools/convert/csharp-to-vb/.


    Public Sub Main()

        'save the package variables as script variables
        Dim strDestFolder As String = Dts.Variables("strDestFolder").Value
        Dim strAppend As String = Dts.Variables("strAppend").Value
        Dim delim As String = Dts.Variables("strDelimiter").Value
        Dim intDelimCount As Integer = Dts.Variables("intDelimCount").Value
        Dim strFileName As String = Dts.Variables("strFileName").Value
        Dim strSourceFolder As String = Dts.Variables("strSourceFolder").Value
        Dim strFileExt As String = Dts.Variables("strFileExt").Value

        'combine the variables to get the file names
        Dim inputFileName As String = strSourceFolder + strFileName + strFileExt
        Dim outputFileName As String = strDestFolder + strFileName + strAppend + strFileExt

        'intitilize the line count
        Dim intLineCount As Integer = 0

        ' try to write the proper lines to a file
        Try

            Dim inputFile As New StreamReader(inputFileName) ' create a streamreader to read the flatfile
            FileOpen(1, outputFileName, OpenMode.Output, OpenAccess.Write) 'create and open the new file to write the data into

            'create variables for the reading and writing loop
            Dim strFullLine As String = Nothing
            Dim strLine As String = Nothing
            Dim intDelimPos As Integer = 0
            Dim intCount As Integer = 0

            While Not (inputFile.EndOfStream) ' read until we reach the end of the flat file

                strFullLine = inputFile.ReadLine ' read one line and save so we can write it to the new file later

                strLine = strFullLine ' save the line in another variable that will be broken up to count delimiters

                intDelimPos = strFullLine.IndexOf(delim) ' get first delim position

                intCount = 0 'reset the count of delimeters

                While intDelimPos <> -1 'loop until there are no delimters found

                    strLine = strLine.Substring(intDelimPos + 1) 'drop everything before the current delimeter

                    intDelimPos = strLine.IndexOf(delim) 'get position of the next delimiter

                    intCount += 1 ' increment count of the number of delimiters found

                End While

                If intCount = intDelimCount Then 'if the right amount of delimeters were counted then write the line to the fixed file

                    PrintLine(1, strFullLine) ' write the data line to the new file

                    intLineCount += 1 'increment the line count showing the number of lines written

                End If

            End While

            If intLineCount > 0 Then ' if any rows were written fire information so we can see this in the progress tab

                Dts.Events.FireInformation(0, "subComponent", CStr(intLineCount) + " Lines Written to:" + outputFileName, String.Empty, 0, False)

            End If

            'close the two files
            FileClose(1)
            inputFile.Close()

        Catch exDTS As DtsException 'catch any errors and fire the error event on the package

            Dts.Events.FireError(CInt(exDTS.ErrorCode), Dts.Variables("System::TaskName").Value.ToString, exDTS.Message.ToString, String.Empty, 0)

        Catch ex As Exception 'catch any script errors and fire the error event on the package

            Dts.Events.FireError(0, Dts.Variables("System::TaskName").Value.ToString, ex.Message.ToString, String.Empty, 0)

        End Try

        Dts.TaskResult = ScriptResults.Success

    End Sub

.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }

After running this script on the following file:

image
Here is the new file created by the scrip task:

image
The comments in the code should explain what is taking place line by line. This script task can be placed in a For Each Loop and run on a series of files. As long as they all have the same number and type of delimiters.
This is not a perfect system. If there is a header row that contains the right amount of delimiters then it will be written to the new file. This might cause an error during the data flow, or the header may get written to the table. I think a header having the exact same number of delimiters as the data would be rare. If the flat file has column headers, they will be copied to the new file. You will need to check the header rows option in the connection manager for the flat file. Also if the delimiter appears in the data it will not have the correct delimiter count.
If you have any questions about this article, or if you have another flat file situation that is strange, let me know.

Download the Zip file here:

http://www.bidn.com/Assets/Uploaded-CMS-Files/16f06a8f-8828-4331-abb0-4903562e4e06ReadFileHeaders.zip

No comments:

Post a Comment