Understanding the Nuances of Data Movement in Today’s Complex Systems

Subscribe to the Spectra Blog >

It’s 1985. You’re the last one to leave the office that afternoon. You go into the server room, put the 8-inch floppy disk in the drive, go back to your office, type a backup command via the “terminal” on your desk, and now you’re free to go home. Your data is being backed up and safe. One last thing… don’t forget to take yesterday’s backup floppy with you in case the building burns down over night. OK, maybe you worked for a slightly larger company than I did in 1985, but it worked pretty much the same.  There was storage and backup for the storage. We didn’t refer to “primary” storage.  Disk was storage and all floppies, CDs, or tape was backup to that storage.

As storage and access to storage has evolved, so have the methods to protect it and place it in an appropriate storage tier for cost and access. Hence, the discussion of Backup, Archive, Hierarchical Storage Management (HSM) and Migration.  While the terms are often used interchangeably, there are key differences, and arguably they can be quite significant. Understanding the variances is key to creating a fail-safe data protection scheme as well as an efficient and affordable storage infrastructure.

Backup

There are many options to backup that have been introduced over the years – snapshots, fulls/incrementals, incremental-incrementals, disk-to-disk, etc. But backup, inherently, is a simple concept. Data is created or captured on some form of storage medium. If it’s the only copy in existence, it’s vulnerable to accidental deletion, storage medium failure, natural disasters or (assuming it’s still online) some type of cyberattack. So a duplicate copy is made on a storage medium that can be taken offline and offsite for protection. The original data is left where it was, and a second copy is stored somewhere else. Moreover, data continues to be backed up on a regular basis, so users have multiple copies to turn to in case one copy becomes corrupt or is otherwise inaccessible. This procedure covers the “big four” threats mentioned above.

The 13th annual Cost of Data Breach Report declared that the average cost of a data breach is $3.86 million, which is a 6.4% increase year-over-year.

Here’s a simple way to look at backup when comparing it to the other data management approaches – the backup is a copy of the original, and should be stored offline and offsite. If not, while the data may be protected in some other way, it’s not “backed up”.

Archive

Archive is very similar to backup. The main difference is that the original data no longer resides in its original location. While that may sound too simple to mention, there are significant implications. If data is deleted from its original location after being copied, users will need a way to find it when it’s needed. The original path or file system will no longer see it after it’s moved and deleted. A new database needs to be referenced to find the data, and its format and features will vary per application.

Spectra’s family of tape and disk products for backup and archiving

Archives are great for large amounts of infrequently accessed or fixed data that can be associated with a project or grouped in some way. Fixed data includes content such as last year’s final financials; a completed movie, a sports event or news event; a large data download or output from research. There are numerous data sets that would qualify and are relatively simple to identify for recall based on a larger grouping versus looking for a single file. Once data has been safely archived, it’s no longer backed up. Archiving data is a great way to decrease the amount of primary or active data that needs to be backed up on a regular basis and, at its core, is a data management process that enables cost efficiency.

If the data is going to be accessed semi-frequently, HSM or Migration are better approaches.

Hierarchical Storage Management (HSM)

HSM is a concept that allows organizations to tier their data, keeping the most business-critical, frequently accessed data on the most responsive (and expensive) tier of storage, and moving less critical data to more affordable storage, including  disk, or tape. Here’s the big differentiator for HSM: when the data is moved from its original location, a “stub” file is typically left in its place. The stub file contains some of the data that’s been moved. This seems like an ideal way to move data because the application or user can go to the original location to retrieve their data even though it’s been moved. When the user or application requests the data, its return starts immediately from that stub file, while the remainder of the data is recalled from the new location. Depending on the type of storage target and its recall capabilities, users may notice a slight delay or the HSM application may time-out.

HSMs can be complex. In addition to dealing with time-out errors, they are also the only method to retrieve the moved data. Its proprietary format means that if the HSM goes down, so does access to the data. If a tremendously large file has been migrated in this way, it may be recalled without enough primary storage to hold it. Many HSM solutions have come and gone, but the HSM applications that have stood the test of time show up most often in the high-end High Performance Computing (HPC) world and are capable of integrating tape as a storage tier accessible by users or applications.

Migration

Migration holds some of the most interesting possibilities for truly opening up the world of storage options – from flash to tiered disk to tape to cloud. Most migration applications use symbolic links instead of stub files. When moving data to Tier 2 disk or highly responsive cloud, the symbolic link is left in the data’s original location to redirect the application to the new destination where the data can be accessed directly. Cloud can even become a part of semi-active data recall if users have contracted for an appropriate data access speed.  Other migration applications may actually function as a second file system and sit directly in the path of the data.  While that approach is more complex to implement, it can make data recall easier for both users and applications. All migration applications still have to deal with time-out issues if the data has been moved to a slower response tier such as long-term cloud storage or tape. That is where having object storage on the back end can be very helpful, a topic we’ll touch on in my next blog.

Today’s “data mover” applications allow a mix of storage mediums and approaches to be implemented. IT professionals now have many more options than the 8-inch floppy disk, but with those options come choices that require the individual caveats of each to be examined.  By determining how much data can be archived, IT professionals have significantly decreased the amount of active data they have to deal with on a daily basis. By implementing a migration approach to the remaining data, the cost and performance of each storage tier can be matched with business needs. And as a final thought – no matter how much archiving and/or migration we implement – backing up active, mission-critical data to an offline medium which is stored offsite is still the best way to avoid a system shutdown due to cyber-attack, ransomware or natural disaster.