CTO Insight: Big Data: Why Tape?

 By Matt Starr, Spectra Logic’s CTO

I have watched the tape market’s growth over the last two years, which seems mostly due to the increasing number of archive installations.  With much larger system implementations projected through 2014, this growth will continue for the foreseeable future.  Military low-altitude and high-altitude video surveillance in countries like Afghanistan, the media and entertainment industry’s drive to 4K file data and the growth in PACS data are just a few of the many market segments driving the implementation of large archives. 

These are areas where dedupe and disk, in general, fall down, precisely because of the raw quantity of data involved–the disk resources required would be enormous, and use enormous quantities of power– and the delays in time to deduplicate, then reduplicate is unacceptable.   

EMC’s recent “Big Data” news splash did not mention tape, which kind of shocked me!   (It’s only kind of shocking, as EMC is tape-hostile.) Tape is Big Data:  80% of the world’s data is stored on tape[1]and tape is the only media that can scale to exabyte(s) and still be cost effective.  In fact, tape is the only cost-effective method of storing Big Data.   Tape storage is denser than disk storage, costs less up-front and is ten times less expensive to operate over time than a disk-based solution.  I am not implying that disk does not have a play in the Big Data world; it is just not well suited as the “meat” of a storage environment.  

So, where does disk belong in this Big Data world?  First, disk works very well as the cache system that interacts directly with the user via a Filesystem, WebDAV, FTP or other front-end system.   Second, disk is the right platform for meta-data storage.  For far too long, users have been saving data as file names and not objects with meta-data.  As archives grow, object storage and meta-data will take the front seat in how data is stored.   Lastly, disk has an important role in helping to make stored data searchable: why would you store data if you cannot get it when you need it?   In my opinion, roughly 10% of the total archive space should be dedicated to meta-data and search.   Add another 10% of the total archive as disk space for cache, and the picture starts to come together.   Roughly 20% of your total archive should be disk, with the other 80% consisting of long lived, reliable, cost-effective tape.

Reliable? Yes. The facts are absolute and irrefutable– tape is extremely reliable—more reliable than disk.  Tape’s error correction is 10 to -17thup to 10 to the -19thbits, which blows disk’s reliability[2]statistics out of the water.   Additionally, modern tape libraries have features like Spectra Logic’s Media Lifecycle Management that predictively informs the user about the health status of the tape as it being used. Features like this layer on reliability even beyond tape’s already high reliability.   Through MLM and other features (stay tuned for a few upcoming announcements this spring), Spectra’s TSeries libraries ensure that the data on the tape is intact and recoverable from the archive.

The architects and developers of data archives will continue to build systems based on disk and tape, not just disk.  When Big Data archives are based on disk alone, then one or more of the following scenarios is true:  1.) They are not a Big Data environment, but want to be (or think they are) 2.) They are wasting money and should be answering to their shareholders or voters.   3.) They have been mis-educated on tape.  In the end, tape is far from dead and will continue to prove itself as the ideal medium in the Big Data world.

