Webinar Q&A: Strategies for Managing the Growth of Machine-Generated Data in Life Sciences

Reading time for this article .

Welcome to Spectra’s webinar Q&A roundup. In this blog series we will pick relevant questions from our recent webinars and publish the responses here. 

————————————————————————–

Spectra Logic recently hosted a virtual presentation on how to apply data preservation strategies to the explosive growth of data in the Life Sciences industry with Spectra CTO Matt Starr. During the webinar, Matt reviewed how other industries, such as the media and entertainment industry, are tackling the challenges of preserving voluminous data sets that reside on costly primary storage. The following questions and answers recap highlights covered in the recent webinar.

Question:

As original research data is manipulated for analysis, peer review and publication, is it fair to say that those processes are creating more data?

Answer:

Those types of data sets are often referred to as ‘intermediate results’. Some in the Life Sciences industry believe that, as long as an output can be recomputed, it is sufficient to keep only the algorithm that performed the computation. However, it’s worth considering that, in some instances, such as genome research, analog parts of that workflow may give slightly different results.

At Spectra, we typically advise organizations to keep all original data for the long-term, including any intermediate results that cannot be replicated. The value of that premise is easily illustrated as new applications for data are developed over time. In a recent article, NASA announced that a team of transatlantic scientists, using reanalyzed early data from NASA’s Kepler space telescope, discovered of an Earth-size, habitable zone planet called Kepler-1649c.

Question:

As additional project files are generated, how can these new data sets be kept together with the original data collection? How is that maintained when a data mover is used to migrate data into an archive?

Answer:

In media and entertainment, directory structures are typically used to keep data sets associated to a particular project. Media asset management and digital asset management applications are used to keep those structures in place, and these applications can also create project names and import those assets under a certain project when moving files to an archive. There are applications and data movers for the Life Sciences industry that operate in a similar manner, like StrongLink from StrongBox Data Solutions and Arcitecta’s MediaFlux, giving users a common view into the archive with familiar access to data sets regardless of where they are actually located in the storage ecosystem.

Spectra’s StorCycle® Storage Lifecycle Management software automates the storing, accessing and preserving of data by allowing users to control when data moves, what storage target it moves to, and how long it exists. With StorCycle, research data can be moved off of primary storage to an accessible tier of storage after a determined amount of time. Multiple copies can be made to ensure data is available and shareable if needed. Finally, familiar data access is preserved to enable transparent, easy recovery of data.

Question:

Can cloud be used as a storage target with StorCycle?

Answer:

Yes, cloud is available as a storage target with StorCycle. As StorCycle migrates assets off of a cluster file system, that file’s name becomes the object’s name in the cloud after the data is moved, enabling transparent access and recovery. The data movement is transparent and familiar to the point where a user could actually employ a completely separate tool to access that bucket in the cloud, pull the object down, and put it onto a file system. It would be the same file that was archived by StorCycle.

Question:

What are the core takeaways on how to solve the problem of managing the growth of machine-generated data in life sciences?

Answer:

Stop backing up raw image, unprocessed instrument and payload data at full resolution (level 0 data). Archive it and make two to three copies for data protection purposes. Store each step that cannot be recomputed, and store original content for the life of the project. Use a project-based archiving tool like Spectra’s StorCycle to keep files associated with their original data and source projects. This will enable organizations to keep high-speed storage as work space, rather than as a bulk storage repository.

“Strategies for Managing the Growth of Machine-Generated Data in Life Sciences” webinar is a part of SpectraLIVE, Spectra’s virtual conference program. Watch it on demand here.