The Impact of Data Management on HPC Workloads: Accelerating Outcomes While Preserving Data for Future Examination

Reading time for this article .


By Matt Starr
Chief Technology Officer, Spectra Logic

Effective data and storage management are crucial for efficient HPC workflows and can accelerate research while achieving reproducibility and preserving data for future reference. Groundbreaking discoveries often come with their own unique data management challenges. The datasets associated with innovative scientific, medical and technological research initiatives are ever-growing and being stored forever. The importance of arriving at the right data management strategy becomes paramount as the size of HPC datasets continues its inexorable march towards zettabytes.

Addressing Common HPC Data Management Challenges

Whether in academia, government, or industry, HPC and analytics communities are faced with the challenges of managing and preserving large quantities of data. While storage technologies have continued to increase in capacities to accommodate explosive data growth, research computing has become increasingly diverse. This means that organizations must contend with data management challenges as vast amounts of data need to be readily accessible for sharing and collaboration.

Common HPC data management woes include siloed data, where data becomes unfindable and inaccessible, unprotected data that can be jeopardized by ransomware, and primary storage that becomes overloaded with inactive data sets making storage costs double or triple with junk data sets. Unknown data, stall data, and unclassified data clutter systems and can grow unbounded – making finding the right data and getting it to the right place at the right time difficult at best. Data orchestration, classification and protection become key components of the next generation of storage management. Embedded metadata and custom tagging make today’s datasets more easily searchable, but rarely has the meta data been harvested and stored in a searchable format. Along with preservation, HPC storage administrators must ensure that scratch space remains available and currently-needed data is rapidly accessible, whilst ensuring that data not currently needed is not occupying fast storage.

What best practices do organizations optimizing their data management in support of breakthrough research employ? The impact of data management on HPC workflows can best be seen by exploring real-world use cases. Take a look at the following case studies to learn more.

CERN Advances the Boundaries of Human Knowledge
The CERN Data Centre processes on average one petabyte (one million gigabytes) of data per day. Experiments from the Large Hadron Collider (LHC), the world’s largest and most powerful particle accelerator, produce more than 90 petabytes of data per year, and an additional 30 petabytes of data are produced per year for data from other experiments at CERN. To manage the archival storage of physics data they use the CERN Tape Archive software (CTA), a high-performance archival storage system developed at CERN in 2020. Learn more about their use case here.


Hutchison/MRC Research Centre at the University of Cambridge Reduces Storage Infrastructure Costs by Up to 60%
A cancer research facility located on the Cambridge Biomedical Campus, the Hutchison/MRC Research Centre had developed a complex set of needs, which include: running applications whose databases are accessing files multiple times per second, potentially opening and closing those files every time; being able to securely share research data with collaborators; and limiting access to specific pieces of data in particular locations within the folder hierarchies. The Hutchison/MRC Research Centre deployed Arcitecta’s Mediaflux® platform, backed by Spectra’s BlackPearl®. Mediaflux’s policy-based virtualization would leverage the power of metadata to combine dispersed data silos into a single global namespace. The new solution provides them with a tiered storage capacity up to 60% less expensive, improves data discovery using metadata, virtualizes data silos into a single global namespace, and gives them the ability to leverage scalable storage on demand. Get the details in this case study.


Korea’s Basic Science Institute Advances the Frontiers 
of Research and Knowledge
The Basic Science Institute specializes in long-term projects that require large groups of researchers. They recently purchased a new cryo-EM microscope, which uses electron microscopes to create high-resolution images of molecules and generates around two terabytes of data per day. The institute deployed Spectra’s StorCycle® Storage Lifecycle Management software, a Spectra BlackPearl® Platform and a Spectra T950 Tape Library with LTO-8 tape drives. StorCycle enables the institute to identify and migrate data off primary storage and onto a Perpetual Tier of storage that can include cloud, NAS, object storage disk and object storage tape. Read the case study here.


Effective data management in high performance computing and research environments can accelerate outcomes, while preserving data for future use and examination. Click here to see how Spectra Logic helps organizations push the boundaries of operational objectives to meet performance, growth, and environmental needs.

Attending Supercomputing 2022 in Dallas? Stop by Booth 2806 to chat with experts from Spectra Logic about data management and data storage for HPC, and how our solutions can help accelerate discovery in supercomputing environments. More Spectra events here.