Spectra Vail Use Case: Government Research Organization

Research Organization Combines Public and Private Cloud Storage to Increase The Pace of Understanding Our Planet
This government research organization provides information and predictions about the natural hazards that threaten lives and livelihoods; the water, energy, minerals, and other natural resources we rely on; the health of our ecosystems and environment; and the impacts of climate and land-use change. Their scientists develop new methods and tools to supply timely, relevant, and useful information about the Earth and its processes. They also have another department that handles the results and publication catalog which provides seamless access to the research organizations data and monitoring data by scientists from across the nation. The scientific community has the ability to search, browse, and submit into this catalog, data involving every natural science discipline related to our planet.
Data as vast as the planet – the mission to deliver it to scientists everywhere
This organization collects data from around the world for current and future scientific study. The collection, transmission and storage of data comes from an evolution that precedes digital storage. Disaster preparedness and response agencies depend on timely access of data and results from this organization and its supercomputers to plan and execute effective services. Near real-time access to data and results during natural disasters can impact the survival of affected people.

Additionally, researchers use data for longer term studies that affect how we live. This organization delivers to its community of scientists a data sharing service that rivals current industry offerings in both the public and private sectors, while managing costs, making data available anywhere regardless of its physical location.

The journey to find a solution for their challenges

Prior to using Spectra’s Vail solution, this research organization had a collection of tools that came together over time, ranging from use of FTP, to transmission over VPN, to uploading to the cloud with egress back to a central repository. The issue with the previous solution is that many of the remote collection sites do not have sufficient VPN connection to successfully complete daily transfers. This solution left data siloed and inaccessible without the transferring of data to locations where it is needed.

Another option evaluated was to use the public cloud to supplement the existing VPN connection, the costs of using the public cloud began to explode and make the workflow much too expensive to continue. As an alternative they also explored the idea of passing along the cost of the cloud to the scientists, but alternative solutions where this was not required was more desirable.

Many of the core challenges revolve around the daily volume of data from a single collection point that can be over 10TB, paired with diverse collection methods and the remoteness of many of the collection sites. The main challenge became how to create a workflow that decreased the friction of collecting and delivering data within its community, and protecting all data for long-term retention in a central repository. With large data sets being transferred and collected into a central repository as the goal, the existing workflow was unable to break down the silos of data to create a single workflow for all sites and departments leaving them with the problem of finding a single solution that can solve these challenges that results in freeing up scientist’s time to focus on science and not sending and storing of data was a primary objective.

Using On-Premise and Cloud Storage

public cloud, the timing was perfect for this research organization to decide how to use one or both of these methods most effectively. In a position to benefit from the lessons learned from other science-focused agencies with the intent of sharing data, they were drawn by the ability of Vail to combine public cloud storage services and private cloud storage infrastructure into a single storage solution with the ability to utilize cloud services and infrastructure options from any vendors.

What this research organization learned from its scientific peer organizations was that public cloud offers highly desirable flexibility and agility; however, using the public cloud to serve petabytes of data per month to the public comes with significant cloud access charges that need to be balanced with the benefits of the cloud. Spectra showed how Vail could combine the benefit of the agility afforded by the public cloud with the economics of using its existing facilities and storage equipment to offer scientific data delivery as a public service to the world.
The Spectra Vail solution was a perfect fit for the desired workflow because it was able to build a true hybrid cloud solution that highlighted all the strengths of the cloud paired with the benefits of an on-premise solution. Vail’s ability to bring all data into a single management view where all data can be accessed regardless of its physical location was at the center of the decision to select it as the central data management solution. By connecting all sites into a single storage sphere, including tying into the AWS public cloud, Vail creates a seamless hybrid cloud workflow for collection, distribution, and storage.

Implementing the right solution

With Vail, this research organization was able to implement an end-to-end data management and delivery service that includes the following:

  • Allow data collection from instruments and contributors from all over the world using a combination of Amazon Web Services (AWS) SE and remote office Servers, depending on conditions at any collection point.
  • Store data in AWS and on premise according to a policy to meet service levels that match the value of the data over time.
  • Allow government agencies and the public access to data via a combination of AWS and on-premise storage according to access policies while managing access costs to meet the needs of the data consumers and their budget resources.
  • Keep historic data in mult-site, durable, accesible, and affordable datastore that any dataset can be accessed in minutes for no incremental cost beyond storing it.

With Vail, this organization is able to leverage any combination of public clouds and on-premise datastores to match the value of data at any point in time while managing it based on the elasticity and reliability of AWS cloud services.

Vail / Research Organization Workflow

How data moves from site to site

Data is the lifeblood of research organizations because almost everything that they do revolves around the data that they collect. This all begins at the remote collection sites, remote offices, satellites, and drones to name a few. These locations can be anywhere in the world and are where all the data that is further analyzed and developed into scientific findings and papers for future use.

It all begins in the field where scientists gather and collect information such as the side of a volcano, exploring a glacier, or recording scans of the earth. At the end of each day, the scientist returns to their hotel and has transfer data to the main office and central repository. This is accomplished by using the Amazon public cloud where scientists upload the days findings to a bucket in the AWS public cloud. When this happens, the upload triggers a Vail lifecycle rule such that all data that is uploaded into the AWS bucket will be synced to a local on-site storage bucket in one of the main offices and then replicated to the second main office for DR purposes. With this workflow scientists are able to focus on their research during the day and simply transfer all data collected that day into an AWS bucket with one of the thousands of S3 compatible tools available today. This organization will use the Cyberduck application or another S3 compatible application to manually transfer all the days’ data and transfer it into an AWS bucket to be synced to the local storage. Data that is synced to the main office is subject to AWS charges for download and egress fees, but once the data is at the main offices then no further egress fees are incurred. This is a true hybrid cloud storage workflow that ties in directly with the research organization’s private cloud.
The remote offices with a secure and reliable connection are able to have a Vail node located at these offices to directly transfer data to the main offices based on the lifecycle policy attached to each bucket at the remote offices. This method bypasses the public cloud and the traditional charges associated with downloading data out of the public cloud that the remote collection sites have to deal with when centralizing data at the main offices. At the main offices, data is collected and consolidated, where analysis can be performed against the data to produce results that are needed around the world to help predict future volcanic eruptions, hurricanes, earthquakes, and other natural disasters. They then leverages their existing supercomputers to run analytics and exercise the data that has been collected and transferred back to the main offices producing detailed findings and results.

After data and analysis is completed, all data must be kept for long term retention and preservation. The organization has a policy to not delete data because it can always be used for further validation to scientific findings, and for that matter, it is important that all data is kept and protected so that it is always available. Vail helps achieve this by implementing a on-premis “glacier”, or cold storage, repository where all data can be accessed, kept and protected for long term storage. This tier of storage acts as a local glacier but has the benefits of free retrieval of data, and they are not hindered by the lengthy restore time associated with cold cloud storage. Local glacier tiers can be accessed immediately and restoration can begin within minutes and not hours as with the public cloud.

With this data flow, this research organization is able to effectively create their own hybrid cloud that act as a single storage platform for all the data within their organization, all while leveraging the value and flexibility of the public cloud.

Science Catalog Workflow

After the research side of the organization has collected and analyzed their data and results are ready to be published and distributed to the scientific community around the world, the results are passed to the Science Catalog department in the organization. This department is responsible for the distribution and preservation of all the papers and research studies that are published by the research organization. Vail acts as the data management and transfer application that moves the finished data from the research data center to the Science Catalog distribution platform. As data gets older and is used less often, it is moved out of the distribution platform and kept for long-term retention in the local glacier archival repository.

List of sites and components at each site:

Main Offices (2)

  • Each site has the following:
    • Supercomputer
    • BlackPearl with Vail Endpoint Node
    • TFinity Tape Library

Remote Offices – Many spread out throughout the United States

  • These sites are located in a physical office buildings
  • Connected to the central offices with a decent bandwidth connection
  • Each site has the following:
    • 1 Vail Node in the form of a single VM instance
    • Disk storage – local VM storage

Remote Collection Sites – In the field sites where data is being collected and sent back to main

    offices for analyzation and storage

  • These sites are located in a remote location in the field
  • Data is collected in the field – side of a volcano, on a glacier
  • Return to a hotel room or basic internet connection and upload collected data into AWS cloud via
    an S3 application.
*Amazon Glacier is a registered trademark of Amazon Technologies, Inc.

Resources

X
Spectra Logic
Follow Us