Welcome to EMC Consulting Blogs Sign in | Join | Help

SSIS Junkie

SSIS Nugget: Splitting a file into multiple files

The following question was posed on the SSIS forum recently:

I have a table with 25 Million records. I want to transfer the records from the table to flat files. One flat file can contain only 1 Million record. So I need to generate 25 flat files with 1 Million in each file. Whats the best way to accomplish this task using SSIS.

On the surface this sounds fairly simple but if you look into it in a bit more detail it becomes an interesting little problem. Firstly, I can inteprete the question in one of two ways:

  1. Always give me 25 files with the same number of records in each file, or,
  2. Split the data into as many files as are required so that there are no more than 1 million rows in each file

#1 here is quite easy to accomplish so lets deal with that first. We simply have to split the data into the required number of files (which in the question was 25, but I've simpified it to 5). Here's the data-flow that does it. You will notice that I have used Konesans' Data Generator Source Adapter and 5 instances of the Trash Destination Adapter which are invaluable tools for scenarios like these. The data generator is used here to synthesize some random string data and the trash destination is used to simulate some flat files into which the data flows.

20051204DF1.JPG

As you can see from the diagram we use a script component to generate a surrogate key value (this technique is described here: http://www.sqlis.com/default.aspx?37). We use that value in the conditional spilt transform along with the modulus (%) operator to split the data in a round-robin technique. Here is the conditional split editor showing this:

20051204CondSplit.JPG

And that's it really. Replacing the trash destination with flat file destination adapters will give you 5 files each with the same number of records in them.

 

So onto the slightly harder one - producing as many files as are required with a maximum number of records (lets call this number X) in each. This is harder than the previous example because the number of files that will be produced is not known at design-time hence we cannot accomplish this with a single data-flow.

The basic premise to my chosen solution was to output the top X records from the dataset into a file and flow the rest into a raw file. This raw file could then be used as the source for the same data-flow which again takes the top X records.

This concept gives rise to the notion of iterating over the dataset until there are no more records left to output - we can discover the number of records left to process using the rowcount transformation.

Here's the control-flow that achieves all this:

20051204CF2.JPG

The first bit is simple - create a raw file with all our data in it. I again used the the Data Generator for this part for demo purposes. Here's that data-flow:

20051204LoadRawFile.JPG

We need to capture the rowcount so that we know whether to iterate over the data set or not. I store the rowcount in a variable called NumRowsLeft which we use in the For loop:

20051204ForLoop.JPG

I'm using a variable called "counter" to count the number of iterations. This will be included in the names of the raw files and output csv files to distinguish them between each loop iteration.

And now here's the clever bit - a data-flow that takes the top X records and places them into the output csv flat file whereas the rest of the records flow into a raw file to be processed in the next iteration:

20051204Splitdataset.JPG

You can see that again we're capturing the number of rows to be processed using the rowcount component. We are using a script component to send the top X records to a csv flat file and the rest to a raw file.

N.B. See this month's SQL Server Standard in which I have an article explaining in much greater detail how to get the Top X records from a data set.

Lastly, we use the FileSystem task to delete the raw files at the end of each iteration as they won't be needed again. Here's the resultant files on disk:

20051204Files.JPG

 

And that's all there is to it. A fairly tricky problem solved easily with the flexibility and features inherent in SSIS. I've heard some criticism of SSIS recently where people were actually yearning for the previous incumbent, DTS. Well, can you imagine trying to what we have done here in DTS? It would be an infinitely harder task to accomplish. Here I managed to do it in about 20 minutes without having to write any SQL and only having to rely on the tasks and components that you get as standard.

Showing you how this works is alot more useful than telling you about it so to that end I've made the package containing everything I've explained available here: http://blogs.conchango.com/Admin/ImageGallery/blogs.conchango.com/jamie.thomson/20051204SplittingData.zip

Feel free to download it and have a play - understand how it works for yourself. And let me have feedback as well. I know I say that every time I post something and nobody ever responds but I'm gonna keep saying it anyway :)

 

-Jamie

 

 

Published Sunday, December 04, 2005 9:51 PM by jamie.thomson

Comments

 

Nick Barclay said:

Yet another *very* useful nugget, Jamie - thanks.

Lack of comments could be viewed as a good thing i.e. we're all too pre-occupied with running / pulling apart the sample .dtsx to leave any.

Really like the use of Expressions to dynamically set the DestinationRawFileLocation, OutputFileName and SourceRawFileLocation variables. I may well have to 'borrow' that technique :)

Cheers,
Nick
December 5, 2005 5:46 AM
 

jamie.thomson said:

Nick,
Yeah, that was an "invisible" tip I was trying to put in there. It really helps for debugging purposes because you can examine a variable in a watch window.

-Jamie
December 5, 2005 9:18 AM
 

Jamie Thomson - Life, the universe and SSIS! said:

I have extolled the virtues of the expression language in SSIS more times than I care to remember and...
December 5, 2005 8:46 PM
 

Jamie Thomson - Life, the universe and SSIS! said:

I have extolled the virtues of the expression language in SSIS more times than I care to remember and...
December 6, 2005 11:18 AM
 

Professional Association for SQL Server (PASS) SIG said:

December 7, 2005 5:19 PM
 

Jamie Thomson - Life, the universe and SSIS! said:

I have extolled the virtues of the expression language in SSIS more times than I care to remember and...
January 11, 2006 12:53 PM
 

Thomson Fred said:

Hello

I've been using a combination of SQL Scripts and SQL 2000 DTS packages for  

data transfer work for the past 18 months.

I would like to start using SSIS instead but am not really sure where to  

start. I have managed to do a simple import in the SSIS environment but I  

would really like to do is........

Develop a package which can take a configuration file as input which  

stipulates sets of transfers for the overall package. The configuration file  

would be used to define the following for each transfer in the package:-

A Data source (e.g. CSV files, SQL DB source with a SQL query)

Different SQL Server Destinations Tables

Column mapping between the source and destination

Defaults for columns

Conversion rules for columns e.g. if columnA value = '45' set ColumnA = '105'

(The config file needs to be something like an XML file or an excel sheet  

which I can take out of the office to configure)

February 15, 2007 10:36 AM
 

Tools - SQL Server said:

Until the Office 2003 version in Excel application there is no feature to store the data more than 65536

March 13, 2007 4:09 AM
 

SQL Server tools said:

Until the Office 2003 version in Excel application there is no feature to store the data more than 65536

March 17, 2007 4:42 PM
 

Eric Bouchard said:

I downloaded your example and I can't figure something.

In the second example, as I understand, every iteration is incremented by one (counter) in the For Loop. So DestinationRawFileLocation is next to SourceRawFileLocation.

Where do you use the counter variable for the filenames?  Where (and when) is DestinationRawFile is becoming SourceRawFile?

Thanks

May 4, 2007 7:46 PM
 

jamie.thomson said:

Eric,

I can't actually remember (and I don't have time to look) but I'm guessing that there are property expressions on the ConnectionString properties of the connection managers. They are very hidden away and I happen to think that is a bad thing in the UI at the moment.

-Jamie

May 4, 2007 7:57 PM
 

Eric Bouchard said:

I finally found!

There were expressions assigned to variables (OutputFileName, DestinationRawFileLocation, SourceRawLocation).

You have to select a variable and look at Property window.

May 4, 2007 8:59 PM
 

Mike Brooks said:

I'm having issues getting the Raw File Source working property.  The file does not exist on my laptop so I'm getting "The System cannot find the file specified" error when I attempt to configure the component.  If I create the file I get an error when configuring the component stating that the file is invalid.  What is the trick to using the Raw File Source?  I have set DelayValidation to "True" on the Data Flow.  What else do I need to do to get this working?

Thanks,

Mike

October 5, 2007 3:29 PM
 

jamie.thomson said:

Mike,

When you "create the file" as you put it, how exactly are you creating it?

-Jamie

October 5, 2007 3:45 PM
 

Mike Brooks said:

Sorry, I'm creating the file from Notepad.  

October 5, 2007 4:07 PM
 

jamie.thomson said:

That won't work. The designer has to interrogate the file in order to discover the metadata it needs to build the output for the Raw File Source component. Hence, the file needs to be created by a Raw File Destination component.

-Jamie

October 5, 2007 4:14 PM
New Comments to this post are disabled

This Blog

Syndication

Powered by Community Server (Personal Edition), by Telligent Systems