Q&A: How CERN Archives 1 EB of Experimental Data from the LHC on Tape

By Jenna Blumenfeld

Vladimir Bahyl senior data storage engineer at CERN shown with Spectra T-Finity tape library featuring custom high energy physics visualization graphics

Buried roughly 100 meters beneath the idyllic countryside along the Franco-Swiss border lies an engineering marvel that has led to some of the most significant scientific discoveries of our time: the Large Hadron Collider (LHC).

Designed, built, and operated by CERN (the European Organization for Nuclear Research), this 27-kilometer tunnel uses superconducting magnets to perform particle-blasting experiments, offering insights into the deepest mysteries of the universe — from the formation of galaxies to the fundamental laws of physics.

Unsurprisingly, this research generates massive volumes of data. To manage this open-source information for long-term preservation, CERN relies on a robust storage infrastructure that includes tape-based archives from Spectra Logic. CERN currently manages the largest scientific data archive in the High Energy Physics (HEP) domain and continuously innovates in data storage.

Recently, CERN’s data storage team reached a major milestone: one exabyte of experimental data from the LHC archived on tape.

For over two decades, Vladimir Bahyl, Senior Tape Technology Data Storage Specialist, has overseen the management and archiving of CERN’s invaluable data. Here, he shares how his team achieved this landmark — and why tape remains essential for safeguarding the world’s most critical scientific breakthroughs.

Spectra: Why does CERN depend on tape technology for long-term data retention?

Vladimir Bahyl: The primary reason we rely on tape technology to archive physics data is because it is the most cost-effective medium to store large quantities of data for the long term. It’s also very secure and, from our perspective, it’s relatively easy to manage once you reach a certain scale.

The LHC produces up to two petabytes per day. What are some of the biggest workflow challenges your team encounters while managing such volumes?

The biggest challenge we face is balancing the ever-increasing requirements to store the huge quantities of data being generated on one side of the workflow, and the evolution of tape technology on the other side.

In detail, our workflow transforms many fast parallel data streams from random access devices to sequential storage devices. We must find the right balance so that the buffers in-between do not overflow, and the tape cartridges are optimally mounted.

There are different sets of challenges when you archive the data and retrieve the data, too. For archiving, we have an optimized, streamlined process where the data flows through SSDs so we have control over how the data is placed on tape cartridges.

When retrieving data from tape cartridges, the situation is more complex because we don’t have control over what files researchers request or when.

Another important part of the challenge is to correctly catalogize those millions of files and report the state of the transfer (i.e. whether the file is safely on tape) to the client.

Managing data at the exabyte scale is a significant undertaking. What advice would you offer to other institutions approaching similar volumes?

To store massive data quantities, institutions might want to choose tape technology as their primary archival hardware because it is the most cost-effective storage strategy. If so, my advice is to seek storage devices from a reliable supplier with a track record of storing data at similar volumes.

Institutions will also need customizable management software to orchestrate this data. One advantage of working at CERN is that we have many decades of experience managing data on tape. Given that we have developed our own data storage software, we have full control over how our data is stored.

Lastly, managing large datasets requires a knowledgeable storage team to integrate these systems — and keep the data workflow running seamlessly to minimize archiving roadblocks.

With these building blocks established, it can be relatively straightforward to manage an exabyte of data.

How many copies of the data do you store at CERN?

The data model of the Large Hadron Collider is based on collaboration with many institutions. At CERN, we only keep one copy of the data on tape. The second copy of the data is placed with a collaborating institute of a given experiment.

For example, CERN’s CMS (Compact Muon Solenoid) experiment has a copy at Fermilab in Chicago; CC-IN2P3 (CNRS) in France also stores CERN data on tape archives.

Could you speak to how your team stores this data for long-term preservation?

Long-term preservation of scientific data is an enormous challenge and a complex task because it’s not just about preserving the bytes — you also need to preserve the meaning of the data. You need to understand in which format the data is written and what the data actually means. You need to have a framework or additional software to analyze this data.

There is actually a separate team at CERN that collaborates with the long-term preservation community. This team focuses on archiving, duplicating, safeguarding, and translating data into modern formats to support the discoveries of tomorrow.

From my perspective, we basically store this data forever on tapes. We don’t own the data — we are just guardians of this knowledge.

Could you share any details on the power usage it takes to store such quantities?

At CERN, we also manage around one exabyte of raw space on spinning disk. Having comparable data requirements for disk and tape, I’d say that tape consumes 10 times less electricity compared to spinning disk.

CERN is completing LHC Run 3 and then will prepare for Run 4, which will extend years into the future. How is your team planning to scale to multi-exabytes of storage capacity?

This is a big challenge. We’re counting on LTO technology to evolve in our favor, but unfortunately, we alone do not have enough leverage on the respective vendors. As in other domains, developments of data storage products are dominated by requirements from the hyperscalers and we must adapt.

Depending on the LTO roadmap, to meet our growth requirements we may need more tape libraries with more slots of tape cartridges. From the technical perspective, tape technology has a solid roadmap and I believe it can eventually be capable of storing over 400 terabytes of data on a single cartridge. We shall see how the market evolves.

Why were Spectra TFinity tape libraries selected to be a core part of your storage strategy?

Spectra Logic libraries comprise about a third of our tape technology. A few years ago, CERN was looking to diversify the suppliers of tape libraries. Spectra was selected for its high-quality, large tape libraries — and when I mean large, I mean over 15,000 slots.

One (often overlooked) differentiating factor was that Spectra TFinity libraries have an air filter and create positive pressure inside of them. Because one of our data storage rooms is not fully climate controlled, those air filters provide an additional level of environmental protection against airborne particles or occasional insects.

In what ways does Spectra alleviate some of your team’s data management challenges?

Spectra Logic TFinity libraries and robotics are well-integrated within our workflows as they provide reliable access to the data stored on the cartridges inside them. The access speeds and mount rates are completely sufficient for our use cases.

Finally, Spectra Logic has a dedicated and knowledgeable management team. Their insight into the overall evolution of the market and tape technology is invaluable and we greatly appreciate this collaboration.

Learn more about exascale archives by scheduling a meeting with Spectra.