Welcome to EMC Consulting Blogs Sign in | Join | Help

SSIS Junkie

SSIS: Deploy to file system or SQL Server

A decision that all SSIS architects need to make at some point is whether to deploy their packages to SQL Server or the File System.

Kirk has a useful post here that discusses some of the pros and cons of both approaches: http://www.sqljunkies.com/WebLog/knight_reign/archive/2005/05/05/13523.aspx

 

I've wrestled with this one at length myself and have made the decision to always deploy to the file system. Allow me to me explain why.

The SSIS systems that I have deployed up to now have been multi-package solutions that use the Execute Package Task extensively. When you're building your packages in BIDS then they all exist as flat files so you need to use a file connection manager to point to them. However, if you deploy to SQL Server then your packages all require an OLE DB connection manager as well. So as I see it, deploying to SQL Server creates 4 problems:

  1. Your packages contain an OLE DB Connection Manager that doesn't get used in your development environment
  2. Your packages contain a File Connection Manager that doesn't get used in your Test/QA/Production environments
  3. You need to tell your packages which environment they are running in
  4. You have extra development effort to ensure your Execute Package Tasks use the correct connection manager

Deploying to the file system alleviates these issues.

For these reasons I choose to always deploy to the file system. I'm not recommending it as an approach for anyone else but I'd urge you to consider these issues when making this decision.

-Jamie

UPDATE 7th Jan, 2008: It would be wrong of me not to highlight the negative impacts of using File System storage as well. Sam Loud has uncovered an issue (documented here) that explains how, under certain circumstances, many connection managers may be needed. This is because when using File System storage the location and name of the package is defined in the connection manager; when using SQL Server storage the connection manager defines the instance on which the packages reside, and then the name of the package is defined in the Execute Package Task. Hence, when using SQL Server storage you have the advantage that a single connection manager can be used for many packages.

As always, evaluate the right choice for YOUR situation. The things stated here are just things to be aware of.

 

Published Monday, February 20, 2006 4:56 PM by jamie.thomson

Comments

 

Jamie Thomson - SSIS Jibber Jabber said:

For reasons that I'll save until another post, I always deploy my packages as files rather than to SQL...
August 18, 2006 11:31 AM
 

\\132.132.64.2\e$\Arun\Integration Services Project2\Integration Services Project2 said:

\\132.132.64.2\e$\Arun\Integration Services Project2\Integration Services Project2

June 7, 2007 5:03 PM
 

Singanan said:

Good , informaative

August 3, 2007 11:24 AM
 

Jim Pletscher said:

Jamie - Thanks for this info.

One point I'd like to clarify, which wasn't clear to me initially, (if I understand your rationale)...

I didn't understand the need for points #3 and #4 initially, but then I discovered that if you choose to store your packages in SQL Server, then when you execute the 'Master' package in development, it will call the 'child' that is in SQL server - NOT the version that is local. What's more, it doesn't load it into BIDS, which means you can't observe the child executing. (that has been my experience anyway).

If I understand you correctly, it is precisely for this reason that you state the need for a mechanism to switch between the two, and hence, the recommendation of just sticking with the file system. Is that correct?

August 6, 2007 6:23 PM
 

jamie.thomson said:

Jim,

Yes, you got it just right.

Of course, if you WANT it to call the one in SQL Server then its perfectly satisfactory.

-Jamie

August 6, 2007 6:46 PM
 

Neil said:

If you deploy to file system can you change to location of the SSIS package store to a SAN attached drive?

Reason I ask is for DR - if you accept the defaults when you deploy a package to the file system, in the event of a disaster recovery scenario you'll require the backup of the C:\drive  of said server as well as the dbs ( which in our case are SAN replicated)  in order to recover your packages?!

November 8, 2007 1:25 PM
 

jamie.thomson said:

Neil,

I guess the answer is "if you can see it in Windows Explorer, you can deploy to it".

-Jamie

November 8, 2007 2:04 PM
 

Neil said:

Don't think so - If I do deploy to anywhere other than C:\Program Files\Microsoft SQL Server\90\DTS\Packages then after deployment the package doesn't appear under stored packages\file system?

November 8, 2007 3:34 PM
 

Nadav N said:

The thing is you can always define a SQL job that runs it (no matter where the folder is ) and usually u need the packages to be scheduled anyway so it's not that big an overhead anyway.

Hope this helps.

November 11, 2007 3:07 PM
 

Neil said:

Agreed, but its extra administration over simply deploying to the SQL Server - which would be as simple as restoring the msdb.

I guess it all depend on your environment.

November 21, 2007 4:17 PM
 

Sam Loud said:

"..many connection managers may be needed. This is because when using File System storage the location and name of the package is defined in the connection manager; when using SQL Server storage the connection manager defines the instance on which the packages reside, and then the name of the package is defined in the Execute Package Task. Hence, when using SQL Server storage you have the advantage that a single connection manager can be used for many packages....."

Ater this last issue, I went back to storing my packages in SQL Server. However, there are just as many annoying issues with doing that, so on a recent project, I thought I'd have another go at storing them in the file system.

I really, really didn't want to have 100 connection managers cluttering up my Master package, so I did the following simple workaround:

Create a Connection Manager called CHILD_PACKAGE. You're gonna use this one Connection Manager to connect to all your child packages.

Create a variable called s_package_path. Fill it with the path to your packages. In my case, that is C:\SSIS\Projectname\Packages\  (of course, you'll be doing this with a config, right?)

Create a variable called s_child_package_name. Fill it with the name of the child package you want to execute. You'll most likely be doing this programatically in your parent package. In my case that was DIM_EMPLOYEE

Use an expression to populate the Connection String Property of the CHILD_PACKAGE Connection Manager. In my case, the expression was

@[User::s_package_path]+"\\"+ @[s_child_package_name]+".dtsx"

This evaluates to

C:\SSIS\Unite\Packages\DIM_EMPLOYEE.dtsx

So, you've dynamically created the connection to the child package, and you'll only really need the one connection manager.

Make sense?

July 15, 2008 11:25 AM
 

jamie.thomson said:

Sam,

It makes complete sense, thanks for posting. That's certainly a great workaround for the majority of cases.

My only caveat would be that you need to watch out for packages executing concurrently. If you have concurrent Execute Package Tasks then there is no guarantee that this will work.

cheers

Jamie

July 15, 2008 11:34 AM
 

Sam Loud said:

^^ Yeah, it's a workaround, but it's a lame one, and I'm not very happy with it. The whole 'dynamically setting connection strings to child packages' thing has become a bit of a pain for me.

I wonder if anyone has any ideas on how I could get around this (Jamie, I'm going a bit OT here, so you may want to move this; it's only tangentially related to file system v SQL storage)

I'm in a situation where I am executing many child packages. The easy way of doing that is to have many execute package tasks, and order them with precedence constraints. I'm not doing it the easy way.

I'm querying a RUN_ORDER table, which has the child package names and a integer representing the order in which the packages are to run. i.e 'select run_order, package_name from RUN_ORDER, order by run_order' returns

1000     DIM_Employees

10000   FACT_Employees

100000 PROCESS_OLAP_dB

DIM_Employees has a run order of 1000, so it executes before FACT_Employees, which has a run order of 10000 etc.

I return the recordset into an object, and loop through it, building the connection string to the package as I describe above.

Works great, but it's totally serial. I lose any ability to execute packages concurrently. This is driving me nuts. I can't be the only person out there dynamically deciding which child packages to run. How are other people addressing this?

Cheers

Sam

July 21, 2008 5:36 PM
 

Brian Filppu said:

I had a similar situation and what I did was added a Thread Number to your RUN_ORder table.  I then had 4 starting points for my main dimension package.  I ran 4 sets of packages concurrently.  I still had the run_order column because within certain starting points (or threads) I need to have some things execute serially.  4 threads was a number i picked randomly.  I could obviously scale it out to 10 or 12, but 4 worked good for me.  Each starting point selected the packages it need to run, so SELECT * FROM RUN_Order Where ThreadID = 1 ORDER BY RUN_Order.  Then the same thing for thread 2, 3 and 4, which executed concurrently.  

July 21, 2008 9:39 PM
 

jamie.thomson said:

Sam,

Yup, I was going to suggest exactly the same as Brian. Use threads. (Not "threads" in the commonly understood use of the term, just X concurrent execution paths.

And no, I don't mind at all that the conversation is changing. I love conversations happening on here.

-Jamie

July 22, 2008 10:15 AM
 

Sam Loud said:

Well, that's a relief, because last night, after posting that, I went down exactly that 'threads' path.

I've done it slightly differently to Brian's suggestion, but it's effectively the same thing.

My RUN_Order table has 3 fields for each thread, PackageName, PackageType and DestinationTable. You could think of a row like this

Precedence, thread1packagename, thread1packagetype, thread1destinationtable,thread2packagename,thread2packagetype,thread2destinationtable etc (I've decided on 4 threads too)

I return the first row with a select on RUN_Order, then I use the field index 1,2,3 to populate the variables for thread1, field index 4,5,6 to populate the variables for thread 2 etc.

It's good to know that I seem to be on the same page as you guys. Thanks for your suggestions.

July 22, 2008 11:22 AM
 

Sam Loud said:

^^Epic Fail by me. An awful idea. I'm currently reworking this to use Brian's approach. Explain why later

July 22, 2008 3:07 PM
New Comments to this post are disabled

This Blog

Syndication

Powered by Community Server (Personal Edition), by Telligent Systems