Handling Embedded Text Qualifiers

In: SQL Server|SSIS


12 Nov 2008

It seems that SSIS can’t handle embedded text qualifiers when importing from a flat file. What is an embedded text qualifier? Let’s say you have a CSV file with a few fields. To thwart problems with commas inside your text fields causing confusion with your column delimiters, you implement a text qualifiers – typically double quotes. It makes a row look something like this:

"12036","Company Name, Inc.","555-555-1234","3.14159"

That’s fine, but what happens when the field also contains quotes, such as this:

"52665","Best "Kept" Secret Storage Facility","555-555-9876","2.71828"


Note: This post originally assumed you were working with SQL Server 2008 because the included code was only provided in C#. The post has been updated to include a VB .NET code sample as well.

Now, depending on how the parser interprets this line, it could see the double-quote before  “Kept” and expect a field terminator (a comma in our case). Well, it so happens, that for whatever reason, Integration Services exhibits this type of behavior and will fail on this type of line. It’s a well-documented bug and oft-requested fix, but even as of SQL Server 2008, the issue is still present (Note: This apparently should be a supported configuration according to RFC 4180). Ideally, you can simply import a different format – using a different text qualifier is probably the easiest change (ever tried the thorn?). If this is not feasible, a variety of other solutions have been suggested, the most flexible being, writing your own Script Transformation to custom-parse the rows.

The setup is simple: configure the Flat File Source to import the whole record as a single field. Pipe that output to your script component, and then connect that to the destination database or rest of your process. Here’s what my Data Flow Task looks like:

Data Task Overview

Data Flow Task Overview

And here are the settings I used for my Flat File Connection:

  • General
    • Text qualifier: <none>
    • Header row delimiter: {CR}{LF}
  • Columns
    • Row delimiter: {CR}{LF}
    • Column delimiter: [blank]
  • Advanced
    • Name your column. My example user the unimaginative name, “line”.
    • DataType: string [DT_STR]
    • OutputColumnWidth: 8000

And the steps to get started:

  1. Go ahead and add a Flat File Source to your Data Flow Task and configure it to use this Flat File Connection
  2. Now, place a Script Component on your Data Flow and select “Transformation” when prompted
  3. Drag the green output arrow from Flat File Source to Script Component
  4. Edit the Script Component and switch over to the “Inputs and Outputs” page
  5. Rename the output to “Match Rows”, or anything else of your choosing.
  6. Set ExclusionGroup = 1
  7. Add the correct output columns, with necessary names. If you set the types to string [DT_STR], you can convert them in a data convesion transformation later. Or, if you prefer to set the correct types now, you’ll need to case them specifically in the script component. Add an extra column for the line number with type of “four-byte signed integer [DT_I4]“. I called mine MatchLineNum.
  8. Add another output and call it “Error Rows” or anything else of your choosing.
  9. Set ExclusionGroup = 1
  10. Set “Synchronous InputID” to your input. In my case, it is named “Input 0″
  11. Add only two columns, ErrorLine (string [DT_STR] 8000) and ErrorLineNum (four-byte signed integer [DT_I4]).

Now, a quick explanation on what all this was for: Because we are only importing the whole row, we want to configure the Script Component to output the individual columns. We did this part when we completed steps 5-7. Next, we configured an additional output for rows that we are not able to successfully parse. This is why we defined the Error Rows output in steps 8-10. Be aware, that if you’re supremely confident that you can parse every row, you could optionally skip the creation of the Error Rows output, but I would advise against it.

Now, we will create the actual script to parse the rows manually.

  1. Switch back to the Script page and click the “Edit Script…” button. This will open up a script editor that looks like a stripped-down Visual Studio.
  2. Take note of the code below, and I will explain it below:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
/* Microsoft SQL Server Integration Services Script Component
*  Write scripts using Microsoft Visual C# 2008.
*  ScriptMain is the entry point class of the script.*/
 
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
 
[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    //Declare LineNum to keep track of the number of rows
    int LineNum;
 
    public override void PreExecute()
    {
        base.PreExecute();
 
        //Initalize LineNum
        LineNum = 0;
    }
 
    public override void PostExecute()
    {
        base.PostExecute();
    }
 
    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        //Increment LineNum  to keep track of the number of rows
        LineNum += 1;
 
        //split the input column. This will be highly dependent on the format and will probably need adjustment
        string[] columns = System.Text.RegularExpressions.Regex.Split(Row.line, "(?<="),(?=")");
        //If the number of elements is not expected, we assume there was a problem
        if (columns.Length != 4)
        {
            Row.ErrorLine = Row.line;
            Row.ErrorLineNum = LineNum;
            Row.DirectRowToErrorRows();
        }
        //Everything looks good, let's move forward
        else
        {
            Row.MatchLineNum = LineNum;
            Row.ID = StripQualifier(columns[0], """);
            Row.Company = StripQualifier(columns[1], """);
            Row.PhoneNumber = StripQualifier(columns[2], """);
            Row.FavoriteIrrationalNumber = StripQualifier(columns[3], """);
            Row.DirectRowToMatchRows();
        }
 
    }
 
    public static string StripQualifier(string InputString, string Qualifier)
    {
        //This is a helper function only to remove surrounding text qualifiers
 
        string OutputString;
 
        if(
            InputString.Substring(0, Qualifier.Length) == Qualifier
            && InputString.Substring(InputString.Length - Qualifier.Length, Qualifier.Length) == Qualifier
          )
            OutputString = InputString.Substring(Qualifier.Length, InputString.Length - (2 * Qualifier.Length));
        else
            OutputString = InputString;
 
        return OutputString;
    }
 
}
  • Line 14: Declare LineNum for counting our position in the file.
  • Line 21: Initialize LineNum to 0. In the above script, we increment before parsing, so the first record will have a LineNum of 1. If you happen to have a header row and would prefer it to be 2 to coincide with the actual lines in your flat file, feel free to change the initalization to -1.
  • Line 31: Increment LineNum for keeping track of our position in the file
  • Line 35: This is the real nugget to the whole post. It utilizes regular expressions to correctly parse out the row, even with an embedded text qualifier. This is a  fairly simplistic implementation. If you need to tweak the parsing routine to your situation do it here!
  • Line 37: In my example I am expecting 4 fields, so I check that my string array contains four entries. If not, I assume it’s just a bad row and will inspect later.
  • Line 39-40: Set the appropriate output columns.
  • Line 41: Force the rows to the “Error Rows” output. Remember that extra output we created? It’s specifically for this situation, so we can redirect the “bad” rows to a different location.
  • Line 44: If we have 4 elements in our array, assume the best! You probably want to perform additional sanity checks here, including wrapping the whole thing in a try-catch block, but for our example, I’m keeping it simple.
  • Line 47-50: Set the appropriate output columns
  • Line 51: Redirect output to “Match Rows”
  • Line 56-71: This is simply a helper method to strip the text qualifiers off the text. Discussing it is outside the scope of this post, but I’ve tested it and it works fine.

Here is a similar script in Visual Basic .NET that will work for SQL Server 2005:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
' Microsoft SQL Server Integration Services user script component
' This is your new script component in Microsoft Visual Basic .NET
' ScriptMain is the entrypoint class for script components
 
Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper
 
Public Class ScriptMain
    Inherits UserComponent
 
    'Declare and initialize LineNum to keep track of the number of rows
    Dim LineNum As Int32 = 0
 
 
    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
        'Increment LineNum  to keep track of the number of rows
        LineNum += 1
 
        'split the input column. This will be highly dependent on the format and will probably need adjustment
        Dim columns As String() = System.Text.RegularExpressions.Regex.Split(Row.line, "(?<=""),(?="")")
 
        'If the number of elements is not expected, we assume there was a problem
        If columns.Length <> 4 Then
            Row.ErrorLine = Row.line
            Row.ErrorLineNum = LineNum
            Row.DirectRowToErrorRows()
            'Everything looks good, let's move forward
        Else
            Row.MatchLineNum = LineNum
            Row.ID = StripQualifier(columns(0), """")
            Row.Company = StripQualifier(columns(1), """")
            Row.PhoneNumber = StripQualifier(columns(2), """")
            Row.FavoriteIrrationalNumber = StripQualifier(columns(3), """")
            Row.DirectRowToMatchRows()
        End If
 
    End Sub
 
    Public Function StripQualifier(ByRef InputString As String, ByRef Qualifier As String) As String
        'This is a helper function only to remove surrounding text qualifiers
 
        Dim OutputString As String
 
        If InputString.Substring(0, Qualifier.Length) = Qualifier And InputString.Substring(InputString.Length - Qualifier.Length, Qualifier.Length) = Qualifier Then
            OutputString = InputString.Substring(Qualifier.Length, InputString.Length - (2 * Qualifier.Length))
        Else
            OutputString = InputString
        End If
 
        Return OutputString
    End Function
 
 
End Class

Be sure and build the script (Ctrl+Shift+B or use the Build menu) before closing out. If not, you may receive a validation error. After a successful build, close and click “OK” on the Script Transformation Editor dialog. You can now connect the outputs from your Script Component to any destination, be it an OLE DB Destination or another flat file. The important thing to note is that you connect the Error Rows output to analyze any potential problems. You have access to both the record number and the original line read in. This should provide you with sufficient information to start analyzing the problem, and hopefully fix the parsing routine in the Script Component to deal with it.

Note, that performing this extra Script Component severely affected the processing rate versus SSIS natively supporting the import. Again, the best solution would be to correct the source, if possible.

  • Share/Bookmark

12 Responses to Handling Embedded Text Qualifiers

Avatar

Dennis

February 3rd, 2009 at 8:38 am

How to run the C# script? Mine only takes VB.

Avatar

Taylor Gerring

February 3rd, 2009 at 10:28 am

@Dennis

I have added a VB.NET code snippet to this post as well. I hope that helps you!

Avatar

Handling Embedded Text Qualifiers in SSIS 2005 | Idea Excursion

February 3rd, 2009 at 10:32 am

[...] a quick note advising that I’ve updated my Handling Embedded Text Qualifiers post to also include a Visual Basic example, making the information also relevant to SQL Server [...]

Avatar

Dennis

February 3rd, 2009 at 1:31 pm

Oh, great. I will give the VB.NET a try.

Avatar

Dennis

February 3rd, 2009 at 3:16 pm

I’m getting “Error 30456: ErrorLine is not a member of Script Component”. Do you know why?

Avatar

Taylor Gerring

February 3rd, 2009 at 3:22 pm

Did you follow the steps outlined above? It sounds like it can’t find the column for some reason.

8. Add another output and call it “Error Rows” or anything else of your choosing.
9. Set ExclusionGroup = 1
10. Set “Synchronous InputID” to your input. In my case, it is named “Input 0”
11. Add only two columns, ErrorLine (string [DT_STR] 8000) and ErrorLineNum (four-byte signed integer [DT_I4]).

Avatar

Dennis

February 3rd, 2009 at 4:24 pm

Oh, yes. It works now. Thanks.

Avatar

Orlando

April 27th, 2009 at 9:42 pm

As described in RFC 4180 setion 2 item 7 (http://tools.ietf.org/html/rfc4180#section-2) any characters may appear between text-qualifiers…including line breaks. Your solution will suffice for rows that exist on one line, however csv files can have lines such as this:

1,”Hello, this field
is a “”real”" pain!”,”4/27/2009″

Yes, that’s one row where:

Field 1 = 1
Field 2 (represented on one line with line break escaped for readability) = Hello, this field \r\nis a “real” pain!
Field 3 = 4/27/2009

The destination table is:

create table dbo.LogInfo
(
RecordID int,
LogInfo varchar(500),
LogDateTime datetime
)

I have looked into reading the file where each line equates to a single column and parsing from there as you suggested however the embedded line break prevents me from using that method.

Any further pointers on how to import csv files using SSIS would be much appreciated. Against some long-standing personal bias I am actually considering recommending using DTS, a 10+ year old technology, to solve the issue since is does a capable job of parsing and importing csv files and SSIS does not provide an easy path to process what many would consider a most common file format.

Thanks for reading.

Avatar

Taylor Gerring

April 28th, 2009 at 12:21 pm

@Orlando

Yes, thanks for pointing out that a line break as data would not work for this solution due to parsing each line as a single record.

And you’re right insofar as using DTS. This is what infuriates many people, is that this very old tool more correctly parses CSV than SSIS, despite the longstanding issue. At this point, it’s clear this is not a priority for Microsoft, so unless someone develops a custom Data Flow Source, all we can do is wait and hope for a patch or fix in the next version of SQL Server.

Avatar

Andy Galbraith

May 4th, 2009 at 2:57 pm

The biggest problem in the file I am fighting is that everything does not have a qualifier around it:

“Clark, Bob”,”123 Main St”,1234.00,45,”Something Else”,….

…so I am having trouble writing the appropriate parsing function because I cannot parse on just comma or on quote-comma-quote!

Thanks for listening to my frustration…{-:

Avatar

Marco

June 2nd, 2009 at 12:08 pm

Andy, I found a way to solve your problem:

public override void Input0_ProcessInputRow(Input0Buffer Row)
{
        Regex rCSV = new Regex(",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))");
        Regex rQout = new Regex("\"\"");
 
        string[] fields = rCSV.Split(Row.line);
 
 
        if (fields.Length == EXPECTED_FIELDS)
        {
                // Case 1: Quoted field with "" embedded text qualifiers
                Row.FIELD1 = Format(field[0], rQout);
 
                // Case 2: Non-quoted field
                Row.FIELD2 = Convert.ToDESIRED_TYPE(field[1]);
        }
}
 
public static string Format(string input, Regex rQout)
{
        return (!String.IsNullOrEmpty(input)) ? rQout.Replace(input.Substring(1, input.Length - 2), "\"") : "";
}

——————————————————-

This has also the case where there are emtpy fields:

For example

“Hi”,123,,”my name is Marco the “”programmer”"”,”other text”

Will produce the output:

Hi
123

my name is Marco the “programmer”
other text

Avatar

SQL Lion

April 4th, 2010 at 1:13 pm

To get the workaround and Step by Step description for developing SSIS package in order to overcome the issue with SSIS while importing text files with Flat File Connection Manager and Flat File Source where the “Row Delimiter” property does not work properly for rows having NULL or empty values, follow the below link:

http://www.sqllion.com/2010/04/ssis-vs-text-file-importing-1/
Thanks,
SQL Lion

Comment Form

  • shiva ramani: good instruction. How to push data to MySQL from SQL 2005 [...]
  • Winkey: I've created a linked server for mysql successfully, and also can do Insert, Select. But when I try [...]
  • RaghuRam: Thanks a lot buddy it helped me to change the root password when I got the error trying to c [...]
  • Taylor Gerring: If you're on a 64-bit OS, did you run the 32-bit version of ODBC applet? Run: odbcad32 [...]
  • Taylor Gerring: If the issue is the guest sync'ing to the host, can't you just force time sync the host if you can't [...]


This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 United States.