Using SSIS to load a flat file into a database is a common use of the
tool. This works great in SSIS and is very fast due to the dataflow
doing batch updates. This is easy to set up when the flat file has no
headers and footers. I am not referring to the column headers, those can
be skipped. I am referring to the headers that are not a part of the
data. Below is an example of data from a comma delimited flat file.
There are several columns of data in the flat file separated by commas.
The headers and footers that appear around the data make it difficult to
get to the data.
One of the options you have in this situation is to use the error output from the source in the data flow. The problem with this approach is data could still be written from the header if the header does not cause an error.
Here is a data flow example of the flat file loading into the table with the error rows being sent to another file that can be checked later manually for data.
After running the dataflow with the above file we can see that the headers were moved to the bad file, but the first data row was moved also. Here is the bad rows file:
The rest of the rows were written to the table.
If the header is very small then you may get some of the header in the database. Here is an example file with a small header.
Loading this using the above data flow causes no rows to be written to the bad rows file and places the header into the table as seen below. This occurred because the header and the first row in the data did not exceed the column width on the table and therefore was treated as the data.
You can see that this bad row option does not always work for loading data. You may end up with missing rows or junk data. We need away to remove the header rows. Fortunately this can be done with a script task in SSIS.
The first thing we need to do is create some package variables to be used by our script task. This way if the file format changes we can update the variables and the script task will still run successfully. This script is made to work with delimited files only. If the file is fixed width or ragged right, this approach will not work.
intDelimCount = number of delimiters that should be on each line that contains data
strAppend = Appended to the new file name created from the script task
strDelimiter = delimiter used in the flat file
strDestFolder = destination folder for the new files created by the scrip task
strFileExt = extension of the flat file
strFileName = name of the flat file
strSourceFolder = folder where the flat file exist
Now for the script task, this script task will open the file and parse through the rows looking for any rows that have the right amount of delimiters. When it finds a row that has the right amount of delimiters it writes this row to a new file in the destination folder. It adds the append variable to the file name also. If the row does not contain the correct number of variables it is skipped over. Once the script task is complete you will have the original flat file unchanged, and you will have a new file with only the data rows. Below is the code in VB, you can convert it to C# with websites like this one: http://www.developerfusion.com/tools/convert/csharp-to-vb/.
.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }
After running this script on the following file:
Here is the new file created by the scrip task:
The comments in the code should explain what is taking place line by line. This script task can be placed in a For Each Loop and run on a series of files. As long as they all have the same number and type of delimiters.
This is not a perfect system. If there is a header row that contains the right amount of delimiters then it will be written to the new file. This might cause an error during the data flow, or the header may get written to the table. I think a header having the exact same number of delimiters as the data would be rare. If the flat file has column headers, they will be copied to the new file. You will need to check the header rows option in the connection manager for the flat file. Also if the delimiter appears in the data it will not have the correct delimiter count.
If you have any questions about this article, or if you have another flat file situation that is strange, let me know.
Download the Zip file here:
http://www.bidn.com/Assets/Uploaded-CMS-Files/16f06a8f-8828-4331-abb0-4903562e4e06ReadFileHeaders.zip
One of the options you have in this situation is to use the error output from the source in the data flow. The problem with this approach is data could still be written from the header if the header does not cause an error.
Here is a data flow example of the flat file loading into the table with the error rows being sent to another file that can be checked later manually for data.
After running the dataflow with the above file we can see that the headers were moved to the bad file, but the first data row was moved also. Here is the bad rows file:
The rest of the rows were written to the table.
If the header is very small then you may get some of the header in the database. Here is an example file with a small header.
Loading this using the above data flow causes no rows to be written to the bad rows file and places the header into the table as seen below. This occurred because the header and the first row in the data did not exceed the column width on the table and therefore was treated as the data.
You can see that this bad row option does not always work for loading data. You may end up with missing rows or junk data. We need away to remove the header rows. Fortunately this can be done with a script task in SSIS.
The first thing we need to do is create some package variables to be used by our script task. This way if the file format changes we can update the variables and the script task will still run successfully. This script is made to work with delimited files only. If the file is fixed width or ragged right, this approach will not work.
intDelimCount = number of delimiters that should be on each line that contains data
strAppend = Appended to the new file name created from the script task
strDelimiter = delimiter used in the flat file
strDestFolder = destination folder for the new files created by the scrip task
strFileExt = extension of the flat file
strFileName = name of the flat file
strSourceFolder = folder where the flat file exist
Now for the script task, this script task will open the file and parse through the rows looking for any rows that have the right amount of delimiters. When it finds a row that has the right amount of delimiters it writes this row to a new file in the destination folder. It adds the append variable to the file name also. If the row does not contain the correct number of variables it is skipped over. Once the script task is complete you will have the original flat file unchanged, and you will have a new file with only the data rows. Below is the code in VB, you can convert it to C# with websites like this one: http://www.developerfusion.com/tools/convert/csharp-to-vb/.
Public Sub Main() 'save the package variables as script variables Dim strDestFolder As String = Dts.Variables("strDestFolder").Value Dim strAppend As String = Dts.Variables("strAppend").Value Dim delim As String = Dts.Variables("strDelimiter").Value Dim intDelimCount As Integer = Dts.Variables("intDelimCount").Value Dim strFileName As String = Dts.Variables("strFileName").Value Dim strSourceFolder As String = Dts.Variables("strSourceFolder").Value Dim strFileExt As String = Dts.Variables("strFileExt").Value 'combine the variables to get the file names Dim inputFileName As String = strSourceFolder + strFileName + strFileExt Dim outputFileName As String = strDestFolder + strFileName + strAppend + strFileExt 'intitilize the line count Dim intLineCount As Integer = 0 ' try to write the proper lines to a file Try Dim inputFile As New StreamReader(inputFileName) ' create a streamreader to read the flatfile FileOpen(1, outputFileName, OpenMode.Output, OpenAccess.Write) 'create and open the new file to write the data into 'create variables for the reading and writing loop Dim strFullLine As String = Nothing Dim strLine As String = Nothing Dim intDelimPos As Integer = 0 Dim intCount As Integer = 0 While Not (inputFile.EndOfStream) ' read until we reach the end of the flat file strFullLine = inputFile.ReadLine ' read one line and save so we can write it to the new file later strLine = strFullLine ' save the line in another variable that will be broken up to count delimiters intDelimPos = strFullLine.IndexOf(delim) ' get first delim position intCount = 0 'reset the count of delimeters While intDelimPos <> -1 'loop until there are no delimters found strLine = strLine.Substring(intDelimPos + 1) 'drop everything before the current delimeter intDelimPos = strLine.IndexOf(delim) 'get position of the next delimiter intCount += 1 ' increment count of the number of delimiters found End While If intCount = intDelimCount Then 'if the right amount of delimeters were counted then write the line to the fixed file PrintLine(1, strFullLine) ' write the data line to the new file intLineCount += 1 'increment the line count showing the number of lines written End If End While If intLineCount > 0 Then ' if any rows were written fire information so we can see this in the progress tab Dts.Events.FireInformation(0, "subComponent", CStr(intLineCount) + " Lines Written to:" + outputFileName, String.Empty, 0, False) End If 'close the two files FileClose(1) inputFile.Close() Catch exDTS As DtsException 'catch any errors and fire the error event on the package Dts.Events.FireError(CInt(exDTS.ErrorCode), Dts.Variables("System::TaskName").Value.ToString, exDTS.Message.ToString, String.Empty, 0) Catch ex As Exception 'catch any script errors and fire the error event on the package Dts.Events.FireError(0, Dts.Variables("System::TaskName").Value.ToString, ex.Message.ToString, String.Empty, 0) End Try Dts.TaskResult = ScriptResults.Success End Sub
.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }
After running this script on the following file:
Here is the new file created by the scrip task:
The comments in the code should explain what is taking place line by line. This script task can be placed in a For Each Loop and run on a series of files. As long as they all have the same number and type of delimiters.
This is not a perfect system. If there is a header row that contains the right amount of delimiters then it will be written to the new file. This might cause an error during the data flow, or the header may get written to the table. I think a header having the exact same number of delimiters as the data would be rare. If the flat file has column headers, they will be copied to the new file. You will need to check the header rows option in the connection manager for the flat file. Also if the delimiter appears in the data it will not have the correct delimiter count.
If you have any questions about this article, or if you have another flat file situation that is strange, let me know.
Download the Zip file here:
http://www.bidn.com/Assets/Uploaded-CMS-Files/16f06a8f-8828-4331-abb0-4903562e4e06ReadFileHeaders.zip
No comments:
Post a Comment