The following question was posed on the SSIS forum recently:
I have a table with 25 Million records. I want to transfer the records from the table to flat files. One flat file can contain only 1 Million record. So I need to generate 25 flat files with 1 Million in each file. Whats the best way to accomplish this task using SSIS.
On the surface this sounds fairly simple but if you look into it in a bit more detail it becomes an interesting little problem. Firstly, I can inteprete the question in one of two ways:
- Always give me 25 files with the same number of records in each file, or,
- Split the data into as many files as are required so that there are no more than 1 million rows in each file
#1 here is quite easy to accomplish so lets deal with that first. We simply have to split the data into the required number of files (which in the question was 25, but I've simpified it to 5). Here's the data-flow that does it. You will notice that I have used Konesans' Data Generator Source Adapter and 5 instances of the Trash Destination Adapter which are invaluable tools for scenarios like these. The data generator is used here to synthesize some random string data and the trash destination is used to simulate some flat files into which the data flows.
As you can see from the diagram we use a script component to generate a surrogate key value (this technique is described here: http://www.sqlis.com/default.aspx?37). We use that value in the conditional spilt transform along with the modulus (%) operator to split the data in a round-robin technique. Here is the conditional split editor showing this:
And that's it really. Replacing the trash destination with flat file destination adapters will give you 5 files each with the same number of records in them.
So onto the slightly harder one - producing as many files as are required with a maximum number of records (lets call this number X) in each. This is harder than the previous example because the number of files that will be produced is not known at design-time hence we cannot accomplish this with a single data-flow.
The basic premise to my chosen solution was to output the top X records from the dataset into a file and flow the rest into a raw file. This raw file could then be used as the source for the same data-flow which again takes the top X records.
This concept gives rise to the notion of iterating over the dataset until there are no more records left to output - we can discover the number of records left to process using the rowcount transformation.
Here's the control-flow that achieves all this:
The first bit is simple - create a raw file with all our data in it. I again used the the Data Generator for this part for demo purposes. Here's that data-flow:
We need to capture the rowcount so that we know whether to iterate over the data set or not. I store the rowcount in a variable called NumRowsLeft which we use in the For loop:
I'm using a variable called "counter" to count the number of iterations. This will be included in the names of the raw files and output csv files to distinguish them between each loop iteration.
And now here's the clever bit - a data-flow that takes the top X records and places them into the output csv flat file whereas the rest of the records flow into a raw file to be processed in the next iteration:
You can see that again we're capturing the number of rows to be processed using the rowcount component. We are using a script component to send the top X records to a csv flat file and the rest to a raw file.
N.B. See this month's SQL Server Standard in which I have an article explaining in much greater detail how to get the Top X records from a data set.
Lastly, we use the FileSystem task to delete the raw files at the end of each iteration as they won't be needed again. Here's the resultant files on disk:
And that's all there is to it. A fairly tricky problem solved easily with the flexibility and features inherent in SSIS. I've heard some criticism of SSIS recently where people were actually yearning for the previous incumbent, DTS. Well, can you imagine trying to what we have done here in DTS? It would be an infinitely harder task to accomplish. Here I managed to do it in about 20 minutes without having to write any SQL and only having to rely on the tasks and components that you get as standard.
Showing you how this works is alot more useful than telling you about it so to that end I've made the package containing everything I've explained available here: http://blogs.conchango.com/Admin/ImageGallery/blogs.conchango.com/jamie.thomson/20051204SplittingData.zip
Feel free to download it and have a play - understand how it works for yourself. And let me have feedback as well. I know I say that every time I post something and nobody ever responds but I'm gonna keep saying it anyway :)