The Importance of Archiving SQL Server Data: Balancing Performance and Retention Needs

10 min readMay 7, 2024

Introduction

As a SQL Server DBA, one of the biggest challenges I’ve faced is dealing with rapidly growing history tables. I’ve seen them balloon from a few gigabytes to terabytes in size, causing all sorts of performance headaches. But at the same time, the application teams are often hesitant to archive old data due to potential requests from other teams who may need it. In this article, I’ll share my experiences and advice on how to strike the right balance between optimizing database performance and meeting data retention requirements.

Why Archiving Data is Crucial

When history tables grow unchecked, it can lead to some major problems:

Slower queries — As tables get bigger, queries take longer to scan through all that data. This can cause unacceptable delays for end users.
Increased storage costs — Terabytes of data means you need a lot more disk space, which gets expensive fast.
Harder maintenance — Key maintenance tasks like index rebuilds, statistics updates, and backups all take much longer on huge tables.
Compliance issues — Some data may need to be retained for a certain time period for regulatory or legal reasons. Archiving lets you keep that data accessible without impacting production performance.

I once worked on a system where a key table had grown to over 5 billion rows. Queries that used to take seconds were timing out after 10 minutes. It was a nightmare! We had to take the application offline for hours to move old data out and rebuild the indexes. Lesson learned — don’t let it get to that point.

Strategies for Smart Archiving

So what’s the solution? The key is to implement an archiving strategy early, before the problem gets out of hand. Here are some best practices I recommend:

Define clear data retention policies with the business. Know exactly what data needs to be kept and for how long.
Archive data to separate history tables on a regular schedule, like monthly or quarterly. This keeps the active tables manageable.
Consider moving archived data to cheaper, slower storage since it’s accessed less frequently. Options include table/index partitioning, filegroups on slower disks, or even Azure SQL Database elastic pools.
Make sure your archiving process is fully automated using scripts or tools. You don’t want to be doing this manually!
Have a plan for providing archived data to other teams in a reasonable timeframe. Could you restore a backup to a separate server for them to query?
For super old data, consider aggregating it before archiving so you still have the important summary info without all the details. For example, sales data could be aggregated from individual transactions to daily or monthly totals.

Here’s an example of how you could implement a basic archiving process using T-SQL:

First, let’s say we have a Sales table that keeps growing over time:

CREATE TABLE Sales (
    SalesID int IDENTITY(1,1) PRIMARY KEY,
    CustomerID int NOT NULL,
    SalesDate datetime NOT NULL,
    Amount decimal(10,2) NOT NULL
);

We can create a separate SalesHistory table with the same structure to hold the archived data:

CREATE TABLE SalesHistory (
    SalesID int NOT NULL,
    CustomerID int NOT NULL,
    SalesDate datetime NOT NULL,
    Amount decimal(10,2) NOT NULL
);

Then, we can use a T-SQL script to periodically move old data from Sales to SalesHistory. For example, to archive sales older than 1 year:

INSERT INTO SalesHistory
SELECT * FROM Sales
WHERE SalesDate < DATEADD(year, -1, GETDATE());

DELETE FROM Sales
WHERE SalesDate < DATEADD(year, -1, GETDATE());

If you need to aggregate old data before archiving to save space, you could use queries like:

INSERT INTO SalesHistoryAggregated
SELECT
    CAST(SalesDate AS date) AS SalesDate,
    SUM(Amount) AS TotalAmount
FROM SalesHistory
GROUP BY CAST(SalesDate AS date);

TRUNCATE TABLE SalesHistory;

This aggregates the individual sales transactions into daily totals before archiving.

Performance Impact

When you run the INSERT INTO ... SELECT query to move old data from the Sales table to the SalesHistory table, there are several potential performance impacts to consider:

1. Transaction Log Growth:

The INSERT operation will generate transaction log records for each row inserted into the SalesHistory table.
If the Sales table contains a large number of rows matching the WHERE clause condition, it can result in significant transaction log growth.
Ensure that your transaction log has sufficient space to accommodate the log records generated by the INSERT operation.

2. Locking and Blocking:

During the execution of the INSERT query, locks will be acquired on the Sales and SalesHistory tables.
If there are concurrent transactions accessing the Sales table, they may be blocked by the INSERT query until it completes.
This can lead to increased waiting times and reduced concurrency for other transactions.

3. I/O Operations:

The INSERT query will perform I/O operations to read data from the Sales table and write it to the SalesHistory table.
The amount of I/O generated depends on the number of rows being inserted and the size of the data.
If the Sales table is large and the WHERE clause matches a significant portion of the table, it can result in substantial I/O activity, impacting overall system performance.

4. Indexes on the Target Table:

If the SalesHistory table has indexes defined on it, the INSERT operation will need to update those indexes for each inserted row.
The more indexes present on the SalesHistory table, the more overhead there will be for index maintenance during the INSERT operation.
This can lead to increased CPU and I/O usage.

5. Transaction Duration:

The INSERT query will run within a transaction, and the duration of the transaction depends on the number of rows being inserted.
If the transaction takes a long time to complete, it can hold locks on the Sales and SalesHistory tables for an extended period, affecting other concurrent transactions.

To mitigate these performance impacts, consider the following strategies:

Break the archiving process into smaller batches to reduce transaction duration and lock contention.
Perform the archiving during off-peak hours or maintenance windows to minimize impact on concurrent transactions.
Ensure that the necessary indexes are in place on the Sales table to optimize the WHERE clause filtering.
Consider using partition switching (as mentioned in the previous example) for faster data movement if the tables are partitioned.
Monitor the transaction log growth and ensure sufficient log space is available.
Test the archiving process in a non-production environment to assess its performance impact and optimize accordingly.

Impact on tempdb Space

When you execute the INSERT INTO ... SELECT query to move data from the Sales table to the SalesHistory table, SQL Server may utilize the tempdb system database for various operations. The tempdb database is used for temporary storage and is heavily utilized by SQL Server for tasks such as temporary tables, work tables, sort operations, and row versioning.

Here are a few scenarios where the archiving process can impact `tempdb` space:

1. Sort Operations:

If the INSERT INTO ... SELECT query involves a GROUP BY or ORDER BY clause, SQL Server may need to perform sort operations.
These sort operations often utilize tempdb to store intermediate results.
If the amount of data being sorted is large, it can lead to significant tempdb space usage.

2. Temporary Tables:

If the archiving process involves complex transformations or staging of data using temporary tables, those temporary tables will be created in tempdb.
The size of the temporary tables and the amount of data processed can impact tempdb space utilization.

3. Row Versioning:

If the database has snapshot isolation or read committed snapshot isolation enabled, row versioning will be used for concurrency control.
When rows are modified or deleted during the archiving process, SQL Server generates row versions in tempdb to maintain the consistency of concurrent transactions.
If there are long-running transactions or a high number of concurrent transactions accessing the Sales table during the archiving process, it can lead to increased tempdb space usage.

To mitigate the impact on `tempdb` space, consider the following strategies:

1. Monitor tempdb Space Usage:

Regularly monitor the space utilization of tempdb during the archiving process.
Use SQL Server Dynamic Management Views (DMVs) such as sys.dm_db_file_space_usage to track the space consumed by tempdb.
Ensure that tempdb has sufficient disk space to accommodate the temporary storage requirements.

2. Optimize Query Performance:

Optimize the INSERT INTO ... SELECT query to minimize the need for sort operations or temporary tables.
Use appropriate indexes on the Sales table to efficiently filter the rows to be archived.
Consider breaking the archiving process into smaller batches to reduce the amount of data processed at once, thereby reducing the temporary storage requirements.

3. Adjust tempdb Configuration:

Ensure that tempdb is configured with sufficient size and growth settings to handle the expected space requirements during the archiving process.
Consider using multiple tempdb data files to distribute the I/O load and improve performance.
Place tempdb on a separate, fast storage device to minimize contention with other database files.

4. Schedule Archiving During Off-Peak Hours:

If possible, schedule the archiving process during off-peak hours when the system has lower concurrent activity.
This can help reduce the impact on tempdb space and minimize conflicts with other transactions.

By monitoring and managing tempdb space usage, optimizing query performance, and properly configuring tempdb, you can mitigate the impact of the archiving process on tempdb and ensure smooth operation of your SQL Server environment.

To check the database log size with free space and the tempdb size with free space, you can use the following T-SQL queries:

-- Database Log Size and Free Space
SELECT 
    DB_NAME() AS DatabaseName,
    CAST(total_log_size_mb AS DECIMAL(10,2)) AS TotalLogSizeMB,
    CAST(used_log_space_mb AS DECIMAL(10,2)) AS UsedLogSpaceMB,
    CAST(used_log_space_mb / total_log_size_mb * 100 AS DECIMAL(5,2)) AS UsedLogSpacePercentage,
    CAST(log_space_percent_used AS DECIMAL(5,2)) AS LogSpacePercentUsed
FROM 
    (SELECT 
        DB_ID() AS database_id,
        CAST(total_log_size_in_bytes / 1024.0 / 1024.0 AS DECIMAL(10,2)) AS total_log_size_mb,
        CAST(used_log_space_in_bytes / 1024.0 / 1024.0 AS DECIMAL(10,2)) AS used_log_space_mb,
        used_log_space_in_percent AS log_space_percent_used
    FROM sys.dm_db_log_space_usage) AS log_space_usage

UNION ALL
-- tempdb Size and Free Space
SELECT 
    'tempdb' AS DatabaseName,
    CAST(SUM(size) * 8.0 / 1024 AS DECIMAL(10,2)) AS TotalSizeMB,
    CAST(SUM(CASE WHEN is_percent_growth = 0 THEN size ELSE 0 END) * 8.0 / 1024 AS DECIMAL(10,2)) AS DataSizeMB,
    CAST(SUM(CASE WHEN is_percent_growth = 1 THEN size ELSE 0 END) * 8.0 / 1024 AS DECIMAL(10,2)) AS LogSizeMB,
    CAST(SUM(CASE WHEN state = 0 THEN size ELSE 0 END) * 8.0 / 1024 AS DECIMAL(10,2)) AS FreeSpaceMB
FROM 
    tempdb.sys.database_files;

Table Partitioning

For even better performance on large tables, we could use SQL Server’s partitioning feature.

Let’s say we partition the Sales table by SalesDate:

CREATE PARTITION FUNCTION pf_SalesDate (datetime)
AS RANGE RIGHT FOR VALUES ('2022-01-01', '2023-01-01');

CREATE PARTITION SCHEME ps_SalesDate
AS PARTITION pf_SalesDate ALL TO ([PRIMARY]);CREATE TABLE Sales (
    SalesID int IDENTITY(1,1),
    CustomerID int NOT NULL,
    SalesDate datetime NOT NULL,
    Amount decimal(10,2) NOT NULL
) ON ps_SalesDate(SalesDate);

Now, archiving data is as simple as switching out the old partition:

ALTER TABLE Sales SWITCH PARTITION 1 TO SalesHistory;

This is much faster than moving rows individually.

Partition Performance Considerations

When you run the ALTER TABLE ... SWITCH PARTITION statement to switch a partition from the Sales table to the SalesHistory table, there are several performance considerations:

1. Metadata Operation:

The ALTER TABLE ... SWITCH PARTITION statement is a metadata-only operation, meaning it does not physically move data rows.
Instead, it updates the metadata of the tables involved to reflect the change in partition ownership.
As a result, the actual execution of the ALTER TABLE ... SWITCH PARTITION statement is generally fast and has minimal impact on performance.

2. Locking and Blocking:

During the execution of the ALTER TABLE ... SWITCH PARTITION statement, exclusive locks are acquired on the source and target tables.
These locks prevent other transactions from accessing or modifying the tables while the partition switch is in progress.
If there are concurrent transactions attempting to access the Sales or SalesHistory tables, they may be blocked until the partition switch operation completes.
The duration of the locks depends on the time it takes to update the metadata, which is usually brief.

3. Partition Alignment:

For the ALTER TABLE ... SWITCH PARTITION statement to succeed, the source and target tables must have the same partition function and be aligned on the partition boundaries.
If the tables are not properly aligned or have different partition functions, the partition switch operation will fail.
Ensuring proper partition alignment is crucial for the performance and success of the partition switch.

4. Indexing and Constraints:

The source and target tables must have the same indexes and constraints for the partition switch to be allowed.
If the indexes or constraints are different between the tables, the partition switch will fail.
Maintaining identical indexes and constraints across the tables ensures seamless partition switching and avoids performance issues.

5. Partition Function and Scheme Design:

The performance of the partition switch operation is influenced by the design of the partition function and partition scheme.
A well-designed partition function that evenly distributes data across partitions can facilitate efficient partition switching.
Properly aligning the partition boundaries with the data distribution and query patterns can optimize the performance of partition-related operations.

6. Storage and I/O:

Although the ALTER TABLE ... SWITCH PARTITION statement itself is a metadata operation, it's important to consider the storage and I/O implications.
Switching a partition from one table to another may affect the physical storage layout and data distribution on disk.
If the switched partition is large or if there are frequent partition switch operations, it can impact I/O performance and storage utilization.
Proper storage planning, such as using appropriate filegroups and storage configurations, can help mitigate any potential performance impacts.

It’s important to note that the actual performance impact of the ALTER TABLE ... SWITCH PARTITION statement may vary depending on factors such as the size of the tables, the number of partitions, the concurrent workload, and the overall system configuration.

Conclusion

Archiving SQL Server data is all about striking the right balance. You need to keep your active tables and indexes lean and fast, while still meeting the business requirements for data retention and accessibility. It takes some planning and effort to set up a good archiving strategy, but it’s so worth it to avoid those 2am emergency downtimes due to an overgrown history table bringing your application to its knees. Trust me on that one!

The key is to be proactive, have clear retention policies defined upfront, automate the archival process, and work closely with your app teams and business users to make sure everyone’s needs are met. Do that and you’ll be able to enjoy the benefits of a high-performance database for years to come.

The Importance of Archiving SQL Server Data: Balancing Performance and Retention Needs

Introduction

Why Archiving Data is Crucial

Strategies for Smart Archiving

Here’s an example of how you could implement a basic archiving process using T-SQL:

If you need to aggregate old data before archiving to save space, you could use queries like:

Performance Impact

To mitigate these performance impacts, consider the following strategies:

Impact on tempdb Space

Here are a few scenarios where the archiving process can impact `tempdb` space:

To mitigate the impact on `tempdb` space, consider the following strategies:

To check the database log size with free space and the tempdb size with free space, you can use the following T-SQL queries:

Table Partitioning

For even better performance on large tables, we could use SQL Server’s partitioning feature.

Partition Performance Considerations

Conclusion

Written by Joseph Chakola

No responses yet

The Importance of Archiving SQL Server Data: Balancing Performance and Retention Needs

Introduction

Why Archiving Data is Crucial

Strategies for Smart Archiving

Here’s an example of how you could implement a basic archiving process using T-SQL:

If you need to aggregate old data before archiving to save space, you could use queries like:

Performance Impact

To mitigate these performance impacts, consider the following strategies:

Impact on tempdb Space

Here are a few scenarios where the archiving process can impact tempdb space:

To mitigate the impact on tempdb space, consider the following strategies:

To check the database log size with free space and the tempdb size with free space, you can use the following T-SQL queries:

Table Partitioning

For even better performance on large tables, we could use SQL Server’s partitioning feature.

Partition Performance Considerations

Conclusion

Written by Joseph Chakola

No responses yet

Here are a few scenarios where the archiving process can impact `tempdb` space:

To mitigate the impact on `tempdb` space, consider the following strategies: