Anyone Else Have Problems With Sort/Merge In For Each Loop?
Aug 1, 2006
Just wondering if I am an isolated case or just doing stuff other people don't. Ive created around a dozen packages for importing various types of data into a datawarehouse and some of the packages have random crashes. Over time I have been trying to tie together what the crashing ones have in common and now believe I have completely isolated it: Sort + Merge inside a dataflow in a for each loop. The more sorts and merges in the flow the more likely the problem is to occur.
This week I created 2 identical packages to import some XML... one used some gnarly complicated substrings and such to extract some data and add it as derived columns while the other used the XML correctly that produced 2 streams (one being meta data - other being detail data) and sorted + merged them. The gnarly substring package could run all night (processes about a file every 2 or 3 seconds so you can do the math) while the merged data flows crashes somewhere between 1 and 4 hrs of running - usually right around 3 hrs (also processes about 1 file every 2 seconds). The file it crashes on wil be completly random and not related to file size - I've had crashes on files with as little as 1 row of data and 1 row of metadata and as big as 1 row of metadata and 100,000+ rows of data. Another more complicated package I created has 4 merges (and thus 8 sorts) and it will crash anywhere between 5 minutes and 20 minutes of running.
The crash is usually one where the whole process just shuts down with a memory dump but occasionally I have gotten errors about out of memory or warnings about threads leaking buffers.
I am trying to do some "research" to see if others have this problem. I've spent 1.5 months writing all ETL jobs in SSIS and its gotten to the point management has asked me to explore other platforms. I do have a case open with Microsoft on the package with 4 merges - but we've spent weeks now just trying to get debugging tools to get them some useful info when the process crashes. I am running SP1 and kb91822.
Does this look correct for MERGE SYNTAX?I want to INSERT INTO the target table,new service contracts sold with the start date,from historical table by month, only if they don't exist in the target table, doing nothing if matched. . . but the source table is indexed Fiscal Period descending and I have multiple renewal versions of the same contract number in following years. . .
MERGE dbo.SERVICE_CONTRACTS_NEW AS Target USING (SELECT r.Customer_Acct, r.[Contract], r.Start_Period FROM dbo.ACTIVE CONTRACTS r ORDER BY r.FiscalPeriod ASC) AS Source ON Target.Account_Nbr = Source.Customer_Acct AND Target.[Contract] = Source.[Contract] WHEN NOT MATCHED BY TARGET THEN INSERT (Account_Nbr, [Contract], FiscalPeriod_Start) VALUES ( Source.Customer_Acct ,Source.[Contract] ,Source.Start_YYYYMM );
When select this data with an order by like: select test from table_2 order by test The result will be:
Code Snippet
test 1-1.00.00.00 1000 2000 If you sort the data by the SORT block of the SSIS the result will be:
Code Snippet
test 1000 1-1.00.00.00 2000This is annoying and dangerous, because it causes the next bug.
2/ Two datasources sorted by ORDER BY clause can give problems in a Merge Join.
If you have 2 data sources both correctly sorted by an order by in the query. When you join these 2 datasources with a Merge Join, you can lose some records in the dataflow. This happens with larger datasets than examples above.
I need to run an Insert query which pulls data from a table located on server A database AA Table AAA conditional on (or JOINED with) Table BBB in database BB sever B. In SQL 2000 it could be done as:
From Server A: sp_addlinkedserver B INSERT dbo.ResultsTable SELECT SourceTable.* FROM B.BB.dbo.BBB SourceTable INNER JOIN A.AA.dbo.AAA ConditionTable ON SourceTable.RecID = ConditionTable.RecID sp_dropserver B
In SSIS one of the possible solutions is to use a package which does the following: OPEN A + OPEN B-> SORT A + SORT B->MERGE JOIN A and B->OUTPUT RESULT
The problem with this approach is that it's extremely slow for large datafiles (50M records each)
Questions:
1) In the procedure above could the SORT step be avoided? 2) Is there another approach to run cross-servers JOIN in SSIS?
In the first image as can be seens i have 2 different data sources and then they are being joined using "Merge Inner Join". The "sort" is on BusinessEntityID column of Person table and "Sort1" is on "PersonID" of Customer table. The merge join of these 2 result in 19,119 rows.
On the other hand, if i use single data source and use a query with inner join on  tables  used in the first image (ie. 2 tables being used in 2 different data sources) as depicted in second image. Also,  since merge cannot operate without SortKey i have defined TerritoryID as sort key in the advanced editor. The number of rows i get after this is "10,274". My select query was :
SELECT P.BusinessEntityID, P.PersonType, P.Title, P.FirstName, P.MiddleName, P.LastName, P.Suffix, C.TerritoryID FROM stg.Person AS P INNER JOIN stg.Customer AS C ON C.CustomerID = P.BusinessEntityID ORDER BY C.TerritoryID;
According to me, it should have been the same as in first case i am using merge inner join and in second case i am using SELECT query with inner join. Upon drilling down i found that in the first case , my sort keys are BusinessEntityID  and PersonID, if i modify this to CustomerID  and BusinessEntityID as this is my join condition (in ithe inner join query shown above), i get the desired output. What i was wondering was, how  the sort order change the Join Condition?
I am trying to set sorting up on a DataGrid in ASP.NET 2.0. I have it working so that when you click on the column header, it sorts by that column, what I would like to do is set it up so that when you click the column header again it sorts on that field again, but in the opposite direction. I have it working using the following code in the stored procedure: CASE WHEN @SortColumn = 'Field1' AND @SortOrder = 'DESC' THEN Convert(sql_variant, FileName) end DESC, case when @SortColumn = 'Field1' AND @SortOrder = 'ASC' then Convert(sql_variant, FileName) end ASC, case WHEN @SortColumn = 'Field2' and @SortOrder = 'DESC' THEN CONVERT(sql_variant, Convert(varchar(8000), FileDesc)) end DESC, case when @SortColumn = 'Field2' and @SortOrder = 'ASC' then convert(sql_variant, convert(varchar(8000), FileDesc)) end ASC, case when @SortColumn = 'VersionNotes' and @SortOrder = 'DESC' then convert(sql_variant, convert(varchar(8000), VersionNotes)) end DESC, case when @SortColumn = 'VersionNotes' and @SortOrder = 'ASC' then convert(sql_variant, convert(varchar(8000), VersionNotes)) end ASC, case WHEN @SortColumn = 'FileDataID' and @SortOrder = 'DESC' THEN CONVERT(sql_variant, FileDataID) end DESC, case WHEN @SortColumn = 'FileDataID' and @SortOrder = 'ASC' THEN CONVERT(sql_variant, FileDataID) end ASC And I gotta tell you, that is ugly code, in my opinion. What I am trying to do is something like this: case when @SortColumn = 'Field1' then FileName end, case when @SortColumn = 'FileDataID' then FileDataID end, case when @SortColumn = 'Field2' then FileDesc when @SortColumn = 'VersionNotes' then VersionNotes end
case when @SortOrder = 'DESC' then DESC when @SortOrder = 'ASC' then ASC end and it's not working at all, i get an error saying: Incorrect syntax near the keyword 'case' when i put a comma after the end on line 5 i get: Incorrect syntax near the keyword 'DESC' What am I missing here? Thanks in advance for any help -Madrak
I have a table called Tbltimes in an access database that consists of the following fields:
empnum, empname, Tin, Tout, Thrs
what I would like to do is populate a grid view the a select statement that does the following.
display each empname and empnum in a gridview returning only unique values. this part is easy enough. in addition to these values i would also like to count up all the Thrs for each empname and display that sum in the gridview as well. Below is a little better picture of what I€™m trying to accomplish.
Tbltimes
|empnum | empname | Tin | Tout | Thrs |
| 1 | john | 2:00PM | 3:00PM |1hr |
| 1 | john | 2:00PM | 3:00PM | 1hr |
| 2 | joe | 1:00PM | 6:00PM | 5hr |
GridView1
| 1 | John | 2hrs |
| 2 | Joe | 5hrs |
im using VWD 2005 for this project and im at a loss as to how to accomplish these results. if someone could just point me in the right direction i could find some material and do the reading.
I have database on SQL Server 2000 set up with a merge publication.This publication is configured with a number of dynamic filters toreduce the amount of data sent to each client. Each client has ananonymous pull subscription. The merge process can be triggered by thewindows sync manager and my application.To improve performance I have created some helper tables to hold themapping between user login and primary keys of selected entities.For the replicated data to be correct the contents of the helper tablesneeds to be up to date.I need to fire off a stored procedure on the publisher beforereplication starts to verify that this data is up to date. I can notsee any documented way of doing this however I have been experimentingwith some unorthodox systems.Firstly has anyone any ideas?I have been considering adding a trigger to some of the tables used bythe Microsoft replication code - yes I know this is very nasty.My problems arise because executing this stored procedure will causesome data to be updated. In updating data we could create a newgeneration in the database. I must therefore run my stored procedurebefore any the Microsoft code makes any generation checks / updates.Anyone done anything similar, Anyone have any better ideas?Any comments would be gratefully received.
I'm using merge replication to maintain a backup copy of my main (publisher)MSDE database. A push subscription periodically (1 per minute) updates the backup DB. It's intended that if the main db goes down then the backup (subscription) db can be configured as a publisher. This must all be performed via scripting. The initial configuration of the main publisher and subscription is controlled via scripting, which works fine. The problems occur when I try to configure the subsciber to become a publisher. A script is executed on the subscriber but fails at the point when it's configuring the publisher detail. The error is something like "unable to configure a publication for a database setup as an anonymous subscription". I'm guessing that there are subscritpion artifacts added to the database which need to be removed before it can be configured as a new publisher.
We are using a modeling technique called Anchor Modeling in our data warehouses. You can read more about the technique itself at our homepage http://www.intellibis.se, where we have published a fact sheet and a recently held presentation (TDWI European conference). One of the features with this technique is its simple way to historize data. This is done by having a fromDate column which together with the surrogate key will yield a unique combination. On the tables that has this kind of historization we add a primary key, which in turn will create a clustered index, with the following specification (surrogateKey asc, fromDate desc). This will physically order data on the storage media according to the specificed columns and ordering. Now I move on to create a "latest view" of this table which does a subselect to find the latest version for every surrogateKey using max(fromDate). Should not the optimizer now figure out that data is ordered so that the latest version always comes first for every surrogateKey, hence any sorting would be unneccessary? If I look at the actual execution plan after running a query that uses the view there is a sort in the plan, but the cost is always 0%. Does this mean that it did not sort the data, or that it did call a sorting routine, but it actually took very little time to do the sorting? If so, is there a reason that is has to do the sorting or could it have been left out by an even smarter optimizer?
I would also like to applaud the people behind the optimizer, since it will figure out which tables are in fact necessary to query and eliminate others, even if I have left joined them into the view I am using. This speeds up performance and makes anchor modeling feasible. Unfortunately optimizers from other vendors seem to have trouble doing this...
I've been racking my brain all day and I finally decided to ask for help. I've got two tables with rows from the first that need to be sorted by the second. The problem is that the rows don't always exist in the second table. I've tried various forms of INNER, LEFT, RIGHT, OUTER, LEFT OUTER, CROSS, etc., etc., etc. and nothing (oh yeah UNION too). Every time I get close, I lose the records that don't have matches.
Something close-
SELECT A.IDDoc, B.First FROM A LEFT JOIN B ON A.IDDoc = B.IDDoc WHERE B.Dept = 'A' ORDER BY B.First
SELECT LEFT(CONVERT(CHAR(11),convert(datetime,task_date),109),3) + ' ' + RIGHT(CONVERT(CHAR(11),convert(datetime,task_date),109),4) as Date,SUM(CASE a.status_id WHEN 1000 THEN b.act_point ELSE 0 END) as Programming,SUM(CASE a.status_id WHEN 1016 THEN b.act_point ELSE 0 END) as Design,SUM(CASE a.status_id WHEN 1752 THEN b.act_point ELSE 0 END) as Upload,SUM(CASE a.status_id WHEN 1032 THEN b.act_point ELSE 0 END) as Testing,SUM(CASE a.status_id WHEN 1128 THEN b.act_point ELSE 0 END) as Meeting,SUM(CASE a.status_id WHEN 1172 THEN b.act_point ELSE 0 END) as OthersFrom task_table a,act_table b where a.status_id=b.act_id and a.user_id=(select user_id from user_table where user_name='Raghu') and a.task_date like '%/%/2006' GROUP BYLEFT(CONVERT(CHAR(11),convert(datetime,task_date),109),3) + ' ' + RIGHT(CONVERT(CHAR(11),convert(datetime,task_date),109),4)Output :Aug 2006 294 0 0 80 0 0 Jan 2006 14 0 0 0 0 0 Oct 2006 336 0 0 0 0 0 Sep 2006 3262 20 24 8 16 0 How to sort the date in ascending Order ?Jan 2006Aug 2006Sep 2006Oct 2006
I have: 4 tables and 1 table variable. CCenters (ID, Name) Campaigns (ID, Name) Rel (ID, CCenterID, CampaignID) - [many to many] and @SCampaigns (ID, CampaignID) - represents the selected campaigns by the user
performing the commands below I would get the centers associated with the campaigns selected.SELECT CCenterID FROM Rel INNER JOIN @Campaigns ON @SCampaigns.CampaignID = Rel.CampaignID But what I really want are the common centers to the selected campaigns. Thanks
I am trying to select a record from a table where it has the smallest priority how would you go about doing this is there a cool sort command or is there a select command syntax that can do this thanks
I've made this example and it loads a picture into a database. (MsSql )Take a look at the code, it works just fine however it leaves a process in sleeping mode "avaiting command" in Enterprise manager under "Management/current Activity/Process Info"Is it supposed to be like this or is it supposed to be reemoved after .net is finished??Code snip_______________________________________________________ Dim conn As New SqlConnection("Data Source = (local);Initial Catalog = " & "test;User ID = NAME; Password=PASSWORD;") Dim cmd As New SqlCommand("Select * from tab_bild", cnn) Try conn.Open() Dim myDatareader As SqlDataReader myDatareader = cmd.ExecuteReader(CommandBehavior.CloseConnection) Do While (myDatareader.Read()) Response.ContentType = myDatareader.Item("PersonImageType") Response.BinaryWrite(myDatareader.Item("PersonImage")) Loop conn.Close() Response.Write("Picture info succesfully retrieved") Catch SQLexc As SqlException Response.Write("Read failed, Reason: " & SQLexc.ToString()) End Try End Sub________________________________________________________________Please can someone explain this for me or sort this out for me.All help is welcome even if its only points me too a direction.RegardsTombola
I have a table that most of the data has the same value, but there are only a few that do not match that value. I want to populate a listbox with all values from the table, but I'd like to have the majority listed first, followed by the others (the few that don't matach). What's the best way to approach this with SQL?
I'm trying to setup a duplicate of an old SQL Server 4.2 server to put in place while we upgrade the server, but I can't get the sort-order right. I know the existing server uses sort order id 40, but I can't find which sort-order that corresponds to during the install process. If anyone can give me a system table that lists all the sort orders names and id's, or can tell me what the text name for sort order 40 is, I would be very grateful.
I have a table with 3 columns. Product, Location and Value. The data looks like this:
NULL NULL 100 Atlanta NULL 50 Atlanta Cookie1 30 Atlanta Cookie2 20 Dallas NULL 120 Dallas Cookie1 80 Dallas Cookie2 40
This table gets filled with a Groupby with Rollup option. The NULLS show subtotals/total. Is there a way to build a query that returns the results with NULLs at the bottom of each section like:
Atlanta Cookie1 30 Atlanta Cookie2 20 Atlanta NULL 50 Dallas Cookie1 80 Dallas Cookie2 40 Dallas NULL 120 NULL NULL 100
I need to copy the structure and data of an existing SQL 6.5 server to one with a different sort order. Normally, I would use the transfer tool to accomplish this, but the servers are on different networks. My question is, is BCP the answer? In other words, will the data copied via BCP from the sending server be able to be copied on the recieiving server. Also, is there a way to automatically generate the BCP statements for all tables? What I would really like is to be able to get at the scripts and data files created by the transfer tool.