SQL Server 2008 :: Retrofitting Partitioning To Existing (large) Tables
Feb 9, 2015
We have an existing BI/DW process that adds large chunks of data daily (~10M rows) to an existing table, as well as using Deletes to remove stale data. This scenario seems to beg for partitioning to support switching in/out data.
After lots of reading on this, I have figured out the mechanics of the switching, bit I still have some unknowns about the indexes needed to support this.
The table currently has several non-clustered indexes, including one on the partitioning column - let's call that column snapshotdate. Fortunately there are no FKs involved, and no constraints.
Most of the partitioning material I see focuses on creating a clustered PK to assist with switching. Not sure if this is actually necessary, but assume I create one using an Identity column (currently missing) plus snapshotdate.
For the other non-clustered, non-unique indexes, can I just add the snapshotdate to the end of the index? i.e. will that satisfy the switching requirement?
I have some huge tables (think 200+GB for a single table) which are excellent candidates for sparse columns. The tables have many columns which are defined with decimal datatypes (13,2) with a large percentage of them (over 50% in most cases- some as much as 99%) being 0.00. Since this is very expensive in terms of storage my idea is to set all the 0.00 values equal to NULL then set them as sparse. Across 100 or so identical databases, I have 5 such tables, with 20-40 columns in each table.
1.) three steps for each column in each table in each db.
Step 1: update table to allow for nulls
Step 2: update tabe set column=null where column =0.00
Step 3 update table set sparse columns
2.)
Step 1: Create entirely new table with sparse column definitions
Step 2: copy entire table, transforming 0.00 to null for affected columns via SSIS
Step 3: drop original table, rename new table to original name
Normally it is recommended to leave an empty partition on both the front and back ends of a table to avoid data movement when merging/splitting. But I have some questions based on my scenario, which is a table partitioned by a load date, so all records in a partition contain the same date, not a range of dates.
If I use range left, once I switch out the first partition it would become empty, so would there be data movement when I merge it into the next partition? The real issue though is that we will not just be removing the first partition, but "random" partitions throughout the table. Will this work?
If I use range right, when I split the last partition to create a new one it doesn't seem there would be any data movement there either. Am I missing something?
Basically I'm wondering if I should use range left or right. Most recommend using right, but then the boundary value is not the value in the partition. This could potentially result in someone deleting the wrong data if they are not careful. So is there any reason not to use left in my scenario?
I did a test of removing a partition in the middle and it worked just fine; this was using range right. I have about 6 million rows per partition. I also tested splitting at the end and it worked fine. I'll rebuild it with range left and test.
I've yet to use partitioning in a production environment, and pretty much last ran any partitioning related code a few years back when looking at certification; so I'm definitely not an expert on the matter and only loosely clued up on the concepts.
I've recently started with a new employer, and they have just implemented a new system for sms messaging. The database tables tracking the sms messages being sent are going to get big and so they have created decided to implement partitioning on some of the tables using a partition scheme on the CreatedDate column; the DBA involved in designing the partitioning has left and I'm picking this up.
There are some issues with the above that I will be addressing seperately (e.g. the clustered index should be unique as it contians the unique key, and the fillfactors are daft), but my concerns for this post are below.
1) How to define the Primary Key and enforce it's uniqueness whilst trying to ensure it's aligned with the partition in order to be able to switch out old data once an as yet undefined retention period has passed. In books online it states:- "If it is not possible for the partitioning column to be included in the unique key, you must use a DML trigger instead to enforce uniqueness. " Books online - Special Guidelines for Partitioned Indexes. However, I'm not sure what this means, nor how I create the primary key to use the partition function seeing as it doesn't have the CreatedDate in the unique key?
2) The original partition function was envisaged as the following:-
CREATE PARTITION FUNCTION [DateFunction](datetime) AS RANGE LEFT FOR VALUES (N'2014-01-01T00:00:00.000' , N'2014-04-01T00:00:00.000' , N'2014-07-01T00:00:00.000' , N'2014-10-01T00:00:00.000' , N'2015-01-01T00:00:00.000' , N'2015-04-01T00:00:00.000' , N'2015-04-02T00:00:00.000' , N'2015-04-03T00:00:00.000' , N'2015-04-04T00:00:00.000' , N'2015-04-05T00:00:00.000') GO
There is a procedure that has been created and scheduled daily that will create a new partition for each day, and then merge these together at the end of the quarter. My understanding of partitioning is that this is a bad idea, as it will result in merging several populated partitions together. Is my understanding correct? If so, I'm planning on removing the day partitions at the end of the function, and simply adding quarterly partitions, maintaining a spare empty partition at the end of the table. Would this make more sense?
we planning to create partitioning on existing tables. The partitioning is on date column, there should be one partition for each year.
Creating of new partitions should be automated, and also we dont have any plans of archiving old data, all we want is that new partition creation should be automated.
Hi folks! I'm looking for advice on partitioning a large table. In the DDL below I've changed names to protect the guilty.
My table has this schema:
CREATE TABLE [dbo].[BigTable] ( [TimeKey] [int] NOT NULL, [SegmentID] [int] NOT NULL, [MyVal] [tinyint] NOT NULL ) ON [BigTablePS1] (TimeKey) -- see below for partition scheme
-- will evaluate whether this one is needed, my thinking is yes -- based on the expected select queries. create index NCI_SegmentID on BigTable(SegmentID asc)
The TimeKey column is sort of like a unix time. It's the number of minutes since 2001/01/01, but always floored to a 5 minute boundary. so only multiples of 5 are allowed.
Now, this table will be rather big. There are about 20k possible SegmentIDs. For every TimeKey from 2008/01/01 to 2009/01/01 (12 months), I'll have on the order of 20000 rows, one for each SegmentID.
For the 12 month period, there are 365*24*60/5=105120 possible TimeKey values. So the total rowcount is over 2 billion. (20k * 105120)
Select queries are expected to be something like this:
-- fetch just one particular row... select MyVal from BigTable where TimeKey=5555 and SegmentID=234234
--fetch for a certain set of SegmentID and a particular time... select b.SegmentID ,b.MyVal from BigTable b join OtherTable t on t.SegmentID=b.SegmentID where b.TimeKey=5555 and t.SomeColumn='SomeValue'
Besides selects, also I need to be able to efficiently issue update statements against the table with new values in the MyVal column based on a range of TimeKey values (a contiguous span of a few days) and sets of about 1000 SegmentID. updates would always look like this:
update t set t.MyVal=p.MyVal from BigTable t join #myTempTable p on t.TimeKey=p.TimeKey and t.SegmentId=p.SegmentId
where #myTempTable would have order of 1000*24*60 rows in it, all with contiguous TimeKey values, and about 1000 different SegmentID values. #myTempTable also has a clustered pk on (timekey asc, SegmentId asc).
After the table is loaded, it would never get any inserts or deletes. only selects and updates.
Given the size, and the nature of the select and update queries, this table seems like a good candidate for partitioning. I'm thinking it makes sense to partition on TimeKey.
So my question is, is it stupid to create a separate partition for each day in the year long span of TimeKeys this table covers? That would mean 365 partitions in the partition function and partition scheme. Something like this:
CREATE PARTITION FUNCTION [BigTableRangePF1] (int) AS RANGE LEFT FOR VALUES ( 3680640 + 0*1440, -- 3680640 is the number of minutes between 2001/01/01 and 2008/01/01 3680640 + 1*1440, 3680640 + 2*1440, 3680640 + 3*1440, ...snip... 3680640 + 363*1440, 3680640 + 364*1440, 3680640 + 365*1440 ); GO
CREATE PARTITION SCHEME [BigTablePS1] AS PARTITION [BigTableRangePF1] TO ( [PRIMARY],[PRIMARY],[PRIMARY], ...snip... [PRIMARY],[PRIMARY],[PRIMARY] ); GO
does anyone have any experience with partitioned tables with so many partitions? Is a few hundred partitions too many? From my understanding of partitions, seems like having so many will be ok. Is it somehow worse than having hundreds of tables in a database?
Even with one partition for each day, I'll still have 24*60*20000/5 ~ 5m rows in each one.
The SQL CAT team's recommendation is to avoid partitioning dimension tables: URL.....I have inherited a dimension table that has almost 3 billion rows and is 1TB and been asked to look at partitioning and putting maintenance in place, etc.I'm not a DW expert so was wondering what are the reasons to not partition dimensions?
I have a very large table that I need to partition. Ideally the table will write to three filegroups. I have defined the Partition function and scheme as follows.
CREATE PARTITION FUNCTION vm_Visits_PM(datetime) AS RANGE RIGHT FOR VALUES ('2012-07-01', '2013-06-01') CREATE PARTITION SCHEME vm_Visits_PS AS PARTITION vm_Visits_PM TO (vm_Visits_Data_Archive2, vm_Visits_Data_Archive, vm_Visits_Data)
This should create three partitions of the vm_Visits table. I am having a few issues, the first has to do with adding a new clustered index Primary Key to the existing table. The main issue here is that the closed column is nullable (It is a datetime by the way). So running the following makes SQL Server upset:
ALTER TABLE dbo.vm_Visits ADD CONSTRAINT [PK_vm_Visits] PRIMARY KEY CLUSTERED ( VisitID ASC, Closed ) ON [vm_Visits_PS](Closed)
I need to define a primary key on the VisitId column, but I need to include the Closed column in order to partition on it.how I would move data between partitions on a monthly basis. Would I simply update the Partition function, or have to to some sort of merge, split, or switch function?
I have 6 tables which are very huge in row count and need to be partitioned for better manageability.
Little info: Every day, 300 Million records are inserted and 300 million records are deleted in below 7 tables. we maintain only 8 days worth of data in below tables which is the reason records which are older than 8 days are continuously deleted.
Master table which has [ID],[Timestamp] Table Name: Sample - 2,578,106
Child tables: Foreign key [ID] is common for all the tables. There is no timestamp column in child table. dbo.ConnectionDB - 1,147,578,048 dbo.ConnectionSS - 876,458,321 dbo.ConnectionRT - 118,133,857 dbo.ConnectionSample - 100,038,535 dbo.Command - 100,032,235
I would like to partition the above child tables based on the IDs that are inserted every 4 hours. Meaning, All IDs that are inserted in 4 hours window should be in a partition.
* SQL Server 2008 R2 * Database was created from a third party product. The product writes to the 3 tables that I need to make changes to 24/7 and downtime is not an option. All changes must be done live. * Database overall size is ~200 GB * The 3 tables I must update make up ~190 GB of that space. * Tables have no primary key or ID columns. Therefore, the data is highly fragmented. * Of the ~190 GB of space allocated for the tables, there is roughly 70 GB of actual data. * Rows of the table are not guaranteed to be unique. In fact, on one of the tables, tests were ran with a small sample of data and duplicates were very much evident.
What I'm trying to accomplish here is to get an ID column added to the 3 tables and set that ID field as the primary key. Doing so will force the data to become much less fragmented than it is currently and with purging and new inserts, eventually fragmentation will be nearly non-existent.
Problem: Making table changes on tables this large while data is constantly being added poses many risks and can cause data loss. This was tried on a smaller table than these three and the entire table was lost in the process. Restore from backup was needed to get back to most recent log backup point.
Original Solution: My original plan was to create a backup of each table and run the script below to migrate the majority of the data temporarily into the new table. I could then update the original table (which now would contain much less data) and then migrate the data back.
Original Solution Problem: The problem with the solution above is that it calls the DELETE function on the original table using the values from the temporary table. When there are duplicate rows, which have not all been inserted into the backup table yet, they will all be removed from the original table because there is nothing unique to separate them out. In my testing, I had 10,000 rows in the original table and ended up with 9,959 rows in the backup table.
Question 1: Is my approach to making these table changes reasonable? Question 2a: If so, how can I make sure I don't lose data as part of this temporary migration of the data to my backup tables? Question 2b: If not, what would be a better approach that isn't going to cause disruption to the application that INSERTs data 24/7 and won't have any risk of data loss?
Can we create the Partition on Existing Table?e.g Create table t ( col1 number(10,0), Col2 Varchar(10)) ;After the table Creation can we alter the table to partition the table.
I have a large table containing about 800 million rows with an average row length of about 1K. The columns in the table are char columns. I need to move the contents of this table into a similar table where the target columns are varchar. The original table column definitions are compatible with the target table but the reverse is not necessarily true. For example, one column is being changed from int to bigint. The table is partitioned.
So, what is the fastest way to migrate the data. I was thinking to unload each partition into a flat file and load the target table running multiple load streams? Is this a good way?
I have a scenario where I need to add a blank column to a table that is a publisher. This table contains over 100 million records. What is the best way to add the column? In the past where I had to make an update, it breaks replication because the update would take forever as jobs are continuously updating the table so replication can't catch up.
If I alter a table and add a column, would this column automatically get picked up in replication?
Let's say I have a function named MyCustomFunction and I want to ensure that it exists in the database. Let's say I have a create script for the function. I want the script to be runnable multiple times if needed. A common way to do this is to check for an object_id at the top of the function like this:
IF OBJECT_ID('MyCustomFunction') IS NULL DROP FUNCTION MyCustomFunction GO CREATE FUNCTION MyCustomFunction...
But is there a more elegant way to do this? For example, instead of dropping and recreating the function, is there a way to simply exit from the script and do nothing if the function already exists? Something like this:
IF OBJECT_ID('MyCustomFunction') IS NULL RETURN GO CREATE FUNCTION MyCustomFunction...
We have a SQL 2008 Instance existing on active/passive cluster with 2 nodes running on Windows server 2008 R2 Ent. Edn.
Now we need to install another SQL instance on this cluster. So what are the prerequisites apart from new IP address? Do we need shared disks or can the existing disks can be utilized?
I've got two databases on the same server and replicate some tables from one database to another.The replication is configured so not to drop the table if it exists, but to delete the data based on the filter if one exists. There are two tables on the subscriber that have some extra columns.
I get "field size too large" error when trying to replicate them. Is there a workaround without having to make the publisher and the subscriber tables identical by schema?
I have the default trace on a SQL Server 2008R2 instance enabled and found today that there is a gap of nearly 4 minutes in the trace during a time of the day when there most certainly is not going to be a 4 minute window of nothing.
What if anything could cause the default trace to have a gap like this? The SQL Server Instance (against my preferences) is hosted on VMware however it has its on HOST and so its resources are not being shared with any other server. The data & log files reside on different parts of the SANS. Our IT & Network admins are looking into the issue on their end but when I looked and found a near 4 minute gap in the default trace it hit me that this could be something above/outside of SQL Server.
I have a query below which filters detail field in the #TempLogins table. The details field is a text field which contains many types of text strings, some containing urls that have parts like "ResultID=5" which is what is contained in the ResultIDSearch and ResultSetIDSearch fields. The records with entries like "ResultID=5" are the ones I'm trying to filter for.
The problem I have is that the query takes way too long to run. The TempLogin table has around 200 K records and the TempSearch table has around 80 K records.
select * from #TempLogins a where exists (select 1 from #TempSearch t1 where a.detail like '%' + t1.ResultIDSearch + '%' or a.detail like '%' + t1.ResultSetIDSearch + '%')
How can I find calls which do not exist in stored procedures and functions?We have many stored procedures, sometimes a stored procedure or function which is called does not exist. Is there a query/script or something that I can identify which stored procedures do not 'work' and which procedure/ function they are calling?I am searching for stored procedures and functions which are still called, but do not exist in the current database.
I have a well-structured but also very large binary data-set that is generated by a C++ application every five minutes. The data needs to be accessed by SQL applications. Since data is generated every five minutes, performance is key, both for write and read. The data set is about 500MB.If data is written to the file system, the write performance doesn't involve SQL server. For reading it, I have a CLR to read the portions of the data that I need based on offset and length. That works and is very fast. The problem is that data is stored in the file system, so it is not self-contained within the database.
A second option that I haven't explored yet, is to write the data into a table as VARBINARY(MAX). I would read the data using SUBSTRING with appropriate offset and length. Performance of SQL write/read of binary data of this size, and whether there is a third option I haven't thought off. I'm using SQL Server 2014.
Currently we has a database of size about 300G. Because our backup system failed some time past we were left with a transaction log file which grew to about 160G. However our backups are working again and everything is working fine. My understanding is that now the transaction log file is practically empty but the capacity remains at 160G.
When you delete records the deleted transactions are going to get logged to the transaction file. My understanding is when a backup is done these transactions get discarded out of the transaction file.
could I make use of this relatively large transaction file and start deleting transactions without out actually adding to the transaction file size.
The plan is to delete records from logging tables that are not referenced to by any other table without this increasing the transaction log file.For example over a period of a few weeks we can delete a chunk of records from a table. Then after it has completed a backup we can delete another chunk of records out of this table until we have got the table down to the records that we now need.Will this work?
I need to recover some data in a table but i'm not 100% sure the right way to do this safely.
I'll need to query the two tables to compare the before and after but how do i go about restoring/attaching the backup database to SQL without causing conflicts?
If I restore, I assume this would just overwrite which is obviously the worst thing that can happen. if i attach the backup, how does this affect the current live DB? how do i make sure that it's not getting accessed and mistaken for the live DB?
We have a database. It is enabled for mirroring. We need to delete the old records. That is around 500k records from a table. But it has foreign key relation. How to do in Production servers these kind of deletes?
I want to partion a table based on the client specific id and I want these values (client id's) to be passed dynamically to the create partition function. I am not familiar with partitioning so it will be great if someone guides me (I am also reading some articles on partitioning, but it will be easier with some help)
The table I am trying to partition has like 80 million rows with four client's data as of now and will be more once we implement new clients.
I also think Partition will help, because before we load a client's data, we remove the data that is already out there (we flush previous qtr data before we insert this qtr data)
Our monitoring tool shows that our production system periodically experiencing large rate - up to 800 memory pages/sec. How to find out which particular queries, S.P., processes that initiate this?
I am trying to use table partition feature from Sql Server 2005 enterprise edition.
I have Names table with columns FNAME, LNAME and DISPLAYNAME (concatenation of FNAME and LNAME) which I partitioned across 2 drives and 4 file groups based on the below criteria.
CREATE PARTITION FUNCTION pfNameRange(varchar(200)) AS RANGE RIGHT FOR VALUES ('F', 'I', 'S');
Currently there are 5 mill rows in this partitioned tables - partitioned table has clustered index on ID (identity property) and LNAME.
I also created another table with the same data without partition on the table.
When I run the following query I get the same response time of 10secs from both tables.
Names - partitoned table with clustered index on ID and Lname NameSEARCH - with no partition and no index
select * from names where lname = 'smith' select * from namesearch where lname = 'smith'
Is it safe to assume that if the data files are on San it doesnt give any advantage of table paritioning?
How can paritioning be made effective with data files on San
I have about 45000 records in a CSV file, which I am using as HTTP request parameters to query a website and store some results in a database. This is the kind of application which runs 24/7, so database grows really quickly. Every insert fires up a trigger, which has to look for some old records based on some criteria and modify the last inserted record. My client is crazy about performance on this one and suggested to move the old records into another table, which has exactly the same structure, but would serve as a historical table only (used to generate reports, statistics, etc.), whilst the original table would store only the latest rows (so no more than 45k at a given time, whereas the historical table may grow to millions of records). Is this a good idea? Having the performance in mind and the fact that there's that trigger - it has to run as quickly as possible - I might second that idea. Is it good or bad? What do you think?
I read a similar post here, which mentioned SQL Server 2005 partitioning, I might as well try this, although I never used it before.