Backup, Archive, HSM – What’s the Difference Anyway?

Reading time for this article .
Part One or Two

One of the interesting things I have discovered since I have been talking with so many HPC customers is that the term “backup” is seldom used.  You might ask if they aren’t doing traditional backups, then why would we, a backup solutions provider,  want to talk to them. Well, first you need to fully understand the difference between backup and archive.  Archive is a word you will hear more often in the HPC and M&E environments, especially if there is data in excess of the petabyte range and large files that aren’t accessed frequently but need to be kept indefinitely. 

In this blog, which is the first of a two part series, I will provide some fundamental information that can help you differentiate backup from archive.  In the subsequent blog, part two, we will peel the covers back on the process that is different from backup and archive and similar to the traditional HSM (Hierarchical Storage Management). This information will prove to be valuable for those HPC or other data intensive customers who may claim that they don’t do backups.  Stay tuned for more on this subject later.

The differences between backup and archive:

Backup: simply refers to the creation of a copy of data and storing it somewhere for restoration in the event the original version of the data was compromised in some way.  We evangelize the concept of backups because we know, and most customers realize, that data can accidentally be deleted, corruption could occur, data loss, or even worse, a natural disaster could wipe out the entire data center.

Backup is simply safeguarding or protecting the data that is being used by duplicating that data.  This is usually done in a rotating cycle or through schedules including: daily incremental which are kept for seven days, a weekly full kept for a month, a monthly full kept for a year and a yearly full kept for seven years.  Although this process has proven effective and most of the backup applications on the market today are ideal for doing this, problems occur when you start having multiple copies of the same data consuming a lot more hardware than necessary, not to mention the associated costs of running and managing that hardware. 

With backup – think business continuity

One of the key differences when comparing backup strategies to  archiving, is the difficulty of singling out select files for long term retention.  Everything in the backup gets lumped into the large full backup at the end of the year or seven years and called an “archive”.  It may in fact be called an archive but a recovery would function more like a backup recovery, which could be very costly and time consuming.  Backup strategies are more for business continuity purposes and not necessary for long term archiving.

With archive – think long-term retention

Archive: The main difference between an archive and a backup is that an archive refers to a single collection of records or data that is designated for long-term retention.  When the data is moved from the production environment to the archive environment it is tagged or indexed by metadata that assists in quickly locating that particular file or chunk of data through a search mechanism.  This process and the sophisticated software that performs it make locating a single file much more efficient than it would be in a traditional backup.  An archive is generally found in a common file system structure and the determination of where the file is located is a function of file system.  The file system may have several different storage devices that the archived data is stored on based on a number of attributes such as size, type, last accessed, etc.  This system could be a combination of expensive disk, such as fiber channel, less expensive disk, such as SATA or SAS and tape.  The key is how the data is “structured.”  In most cases, the data may never be accessed again, but it is necessary to keep it for historical purposes, regulatory compliance or unplanned event.  The goal with creating an archive is to keep it separate from the backup rotation cycle.  It is recommended that a separate copy of the archived data be made and kept in a separate location so there are at least two copies of the final archive.

Many environments will include both backup and archive.  Through the use of sophisticated software features that are available today, customers can establish policies that determine type, size, age, last accessed, remaining disk space and other characteristics of stored data that can automate the process of deciding whether to keep the data in the backup cycle or move it to the archive pool.

These two functions can be performed within a single library in separate partitions.  The software can then provide notification of what tapes need to be exported based on the function that was performed on those tapes, backup or archive.  I have seen numbers as high as 80% indicating how much data is duplicated within a storage infrastructure because the differences between backup and archive aren’t fully understood.  At the end of the day, knowing the difference and the benefits of backup and archive technologies, when to use them and how to balance the the two functions in an environment can drastically reduce the amount of redundancy, complexity and storage operating costs.

In Part Two of this discussion, we will look at how archives that contain production data, no matter how old or infrequently accessed, can still be retrieved online using high density and high speed tape systems and secondary disk systems.  Stay tuned for my next post which will look at enduring access to data. 

Want to talk more? I’ll be in Dearborn Michigan at the IDC HPC User Forum and DICE Alliance 2010 events next week. Contact me at