Technology Musings
It seems that SSIS can’t handle embedded text qualifiers when importing from a flat file. What is an embedded text qualifier? Let’s say you have a CSV file with a few fields. To thwart problems with commas inside your text fields causing confusion with your column delimiters, you implement a text qualifiers – typically double quotes. It makes a row look something like this:
"12036","Company Name, Inc.","555-555-1234","3.14159"
That’s fine, but what happens when the field also contains quotes, such as this:
"52665","Best "Kept" Secret Storage Facility","555-555-9876","2.71828"
Note: This post originally assumed you were working with SQL Server 2008 because the included code was only provided in C#. The post has been updated to include a VB .NET code sample as well.
Now, depending on how the parser interprets this line, it could see the double-quote before “Kept” and expect a field terminator (a comma in our case). Well, it so happens, that for whatever reason, Integration Services exhibits this type of behavior and will fail on this type of line. It’s a well-documented bug and oft-requested fix, but even as of SQL Server 2008, the issue is still present (Note: This apparently should be a supported configuration according to RFC 4180). Ideally, you can simply import a different format – using a different text qualifier is probably the easiest change (ever tried the thorn?). If this is not feasible, a variety of other solutions have been suggested, the most flexible being, writing your own Script Transformation to custom-parse the rows.
The setup is simple: configure the Flat File Source to import the whole record as a single field. Pipe that output to your script component, and then connect that to the destination database or rest of your process. Here’s what my Data Flow Task looks like:
And here are the settings I used for my Flat File Connection:
And the steps to get started:
Now, a quick explanation on what all this was for: Because we are only importing the whole row, we want to configure the Script Component to output the individual columns. We did this part when we completed steps 5-7. Next, we configured an additional output for rows that we are not able to successfully parse. This is why we defined the Error Rows output in steps 8-10. Be aware, that if you’re supremely confident that you can parse every row, you could optionally skip the creation of the Error Rows output, but I would advise against it.
Now, we will create the actual script to parse the rows manually.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | /* Microsoft SQL Server Integration Services Script Component * Write scripts using Microsoft Visual C# 2008. * ScriptMain is the entry point class of the script.*/ using System; using System.Data; using Microsoft.SqlServer.Dts.Pipeline.Wrapper; using Microsoft.SqlServer.Dts.Runtime.Wrapper; [Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute] public class ScriptMain : UserComponent { //Declare LineNum to keep track of the number of rows int LineNum; public override void PreExecute() { base.PreExecute(); //Initalize LineNum LineNum = 0; } public override void PostExecute() { base.PostExecute(); } public override void Input0_ProcessInputRow(Input0Buffer Row) { //Increment LineNum to keep track of the number of rows LineNum += 1; //split the input column. This will be highly dependent on the format and will probably need adjustment string[] columns = System.Text.RegularExpressions.Regex.Split(Row.line, "(?<="),(?=")"); //If the number of elements is not expected, we assume there was a problem if (columns.Length != 4) { Row.ErrorLine = Row.line; Row.ErrorLineNum = LineNum; Row.DirectRowToErrorRows(); } //Everything looks good, let's move forward else { Row.MatchLineNum = LineNum; Row.ID = StripQualifier(columns[0], """); Row.Company = StripQualifier(columns[1], """); Row.PhoneNumber = StripQualifier(columns[2], """); Row.FavoriteIrrationalNumber = StripQualifier(columns[3], """); Row.DirectRowToMatchRows(); } } public static string StripQualifier(string InputString, string Qualifier) { //This is a helper function only to remove surrounding text qualifiers string OutputString; if( InputString.Substring(0, Qualifier.Length) == Qualifier && InputString.Substring(InputString.Length - Qualifier.Length, Qualifier.Length) == Qualifier ) OutputString = InputString.Substring(Qualifier.Length, InputString.Length - (2 * Qualifier.Length)); else OutputString = InputString; return OutputString; } } |
Here is a similar script in Visual Basic .NET that will work for SQL Server 2005:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | ' Microsoft SQL Server Integration Services user script component ' This is your new script component in Microsoft Visual Basic .NET ' ScriptMain is the entrypoint class for script components Imports System Imports System.Data Imports System.Math Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper Imports Microsoft.SqlServer.Dts.Runtime.Wrapper Public Class ScriptMain Inherits UserComponent 'Declare and initialize LineNum to keep track of the number of rows Dim LineNum As Int32 = 0 Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer) 'Increment LineNum to keep track of the number of rows LineNum += 1 'split the input column. This will be highly dependent on the format and will probably need adjustment Dim columns As String() = System.Text.RegularExpressions.Regex.Split(Row.line, "(?<=""),(?="")") 'If the number of elements is not expected, we assume there was a problem If columns.Length <> 4 Then Row.ErrorLine = Row.line Row.ErrorLineNum = LineNum Row.DirectRowToErrorRows() 'Everything looks good, let's move forward Else Row.MatchLineNum = LineNum Row.ID = StripQualifier(columns(0), """") Row.Company = StripQualifier(columns(1), """") Row.PhoneNumber = StripQualifier(columns(2), """") Row.FavoriteIrrationalNumber = StripQualifier(columns(3), """") Row.DirectRowToMatchRows() End If End Sub Public Function StripQualifier(ByRef InputString As String, ByRef Qualifier As String) As String 'This is a helper function only to remove surrounding text qualifiers Dim OutputString As String If InputString.Substring(0, Qualifier.Length) = Qualifier And InputString.Substring(InputString.Length - Qualifier.Length, Qualifier.Length) = Qualifier Then OutputString = InputString.Substring(Qualifier.Length, InputString.Length - (2 * Qualifier.Length)) Else OutputString = InputString End If Return OutputString End Function End Class |
Be sure and build the script (Ctrl+Shift+B or use the Build menu) before closing out. If not, you may receive a validation error. After a successful build, close and click “OK” on the Script Transformation Editor dialog. You can now connect the outputs from your Script Component to any destination, be it an OLE DB Destination or another flat file. The important thing to note is that you connect the Error Rows output to analyze any potential problems. You have access to both the record number and the original line read in. This should provide you with sufficient information to start analyzing the problem, and hopefully fix the parsing routine in the Script Component to deal with it.
Note, that performing this extra Script Component severely affected the processing rate versus SSIS natively supporting the import. Again, the best solution would be to correct the source, if possible.
12 Responses to Handling Embedded Text Qualifiers
Dennis
February 3rd, 2009 at 8:38 am
How to run the C# script? Mine only takes VB.
Taylor Gerring
February 3rd, 2009 at 10:28 am
@Dennis
I have added a VB.NET code snippet to this post as well. I hope that helps you!
Handling Embedded Text Qualifiers in SSIS 2005 | Idea Excursion
February 3rd, 2009 at 10:32 am
[...] a quick note advising that I’ve updated my Handling Embedded Text Qualifiers post to also include a Visual Basic example, making the information also relevant to SQL Server [...]
Dennis
February 3rd, 2009 at 1:31 pm
Oh, great. I will give the VB.NET a try.
Dennis
February 3rd, 2009 at 3:16 pm
I’m getting “Error 30456: ErrorLine is not a member of Script Component”. Do you know why?
Taylor Gerring
February 3rd, 2009 at 3:22 pm
Did you follow the steps outlined above? It sounds like it can’t find the column for some reason.
Dennis
February 3rd, 2009 at 4:24 pm
Oh, yes. It works now. Thanks.
Orlando
April 27th, 2009 at 9:42 pm
As described in RFC 4180 setion 2 item 7 (http://tools.ietf.org/html/rfc4180#section-2) any characters may appear between text-qualifiers…including line breaks. Your solution will suffice for rows that exist on one line, however csv files can have lines such as this:
1,”Hello, this field
is a “”real”" pain!”,”4/27/2009″
Yes, that’s one row where:
Field 1 = 1
Field 2 (represented on one line with line break escaped for readability) = Hello, this field \r\nis a “real” pain!
Field 3 = 4/27/2009
The destination table is:
create table dbo.LogInfo
(
RecordID int,
LogInfo varchar(500),
LogDateTime datetime
)
I have looked into reading the file where each line equates to a single column and parsing from there as you suggested however the embedded line break prevents me from using that method.
Any further pointers on how to import csv files using SSIS would be much appreciated. Against some long-standing personal bias I am actually considering recommending using DTS, a 10+ year old technology, to solve the issue since is does a capable job of parsing and importing csv files and SSIS does not provide an easy path to process what many would consider a most common file format.
Thanks for reading.
Taylor Gerring
April 28th, 2009 at 12:21 pm
@Orlando
Yes, thanks for pointing out that a line break as data would not work for this solution due to parsing each line as a single record.
And you’re right insofar as using DTS. This is what infuriates many people, is that this very old tool more correctly parses CSV than SSIS, despite the longstanding issue. At this point, it’s clear this is not a priority for Microsoft, so unless someone develops a custom Data Flow Source, all we can do is wait and hope for a patch or fix in the next version of SQL Server.
Andy Galbraith
May 4th, 2009 at 2:57 pm
The biggest problem in the file I am fighting is that everything does not have a qualifier around it:
“Clark, Bob”,”123 Main St”,1234.00,45,”Something Else”,….
…so I am having trouble writing the appropriate parsing function because I cannot parse on just comma or on quote-comma-quote!
Thanks for listening to my frustration…{-:
Marco
June 2nd, 2009 at 12:08 pm
Andy, I found a way to solve your problem:
——————————————————-
This has also the case where there are emtpy fields:
For example
“Hi”,123,,”my name is Marco the “”programmer”"”,”other text”
Will produce the output:
Hi
123
my name is Marco the “programmer”
other text
SQL Lion
April 4th, 2010 at 1:13 pm
To get the workaround and Step by Step description for developing SSIS package in order to overcome the issue with SSIS while importing text files with Flat File Connection Manager and Flat File Source where the “Row Delimiter” property does not work properly for rows having NULL or empty values, follow the below link:
http://www.sqllion.com/2010/04/ssis-vs-text-file-importing-1/
Thanks,
SQL Lion