I have talked with a lot of storage and backup administrators over the years. While everyone has challenges that are just as unique as their data sets and organizations, there is a common theme. Users typically say, "My backups take too long," "I need faster hardware," "I take too many tapes offsite" or "Even with deduplication, I can't afford to replicate".
Constant data growth and shrinking backup windows are the one, two punch that makes it very difficult for us to successfully protect all of our data. For many organizations, backup can be the highest bandwidth application on the network. After all, we are trying to move a copy of all of our data over the network every weekend. This touches the primary storage system the data resides on, the clients that access it, the entire network infrastructure and of course the backup servers and storage. No wonder troubleshooting backup performance problems can drive even a teetotaler to drink.
The traditional way to deal with backup performance issue is to deploy more, faster gear. We buy faster networks and install more connections and procure larger, faster storage systems, be they disk or tape. I know several companies whose first 10 GbE network clients were the actual backup servers. We deploy deduplication to try and shrink the footprint of backup.
As we go through all of these architecture changes, we never look at the root of the problem, there is too much data to backup. If we had less data to protect, the stress on the systems would be much lower. Of course, we can't just delete a bunch of data because it will make our life easier. So, how do we reduce the data we have to manage? We can archive it.
It is easy to forget that most of the data stored by a company is static and rarely used. After all, we use data every day. But in truth, that data we manage daily is typically only a very small percentage of the total data a company stores. Let’s consider a 100TB environment:
For this example, we will assume that every terabyte (TB) of data in production ultimately creates 25TB of data in the backup between all copies. This is based on four daily backups (each 10% the size of the full); a full weekly backup saved for four weeks; plus end of month backups saved for one year; and finally, end-of-year backups saved for seven years. So, a 100TB data set ends up driving 2.5 petabytes of backup data. Combine the 100 TB of production data with the 2.5 PB of backups and the organization has to manage a total of 2.6 PB-. That is pretty significant. If you want to complete a full backup of the 100 TB in 24 hours, you need 1,215 MB/sec nonstop backup performance.
So, let’s archive 80% of the data in our example. It seems reasonable that 80% can be archived for most organizations.
The 20% "production" data still drives a 25X expansion in backup data, but now only needs 500TB to store, and a 243MB/sec environment to protect in 24 hours. The 80TB that gets archived in this example has three copies made for offsite and redundancy. One of these copies is stored off site for disaster recovery. This consumes an additional 240TB of storage. All told, instead of 2.6PB of data in the environment, the same data now consumes 760TB.
Instead of increasing the performance of your servers, software, storage and network to meet backup windows, a solid archive strategy lets you effectively reduce the total amount of data to be backed up.
What are your best practices for archiving?
Follow me at www.Twitter.com/3pedal