1. Introduction
This paper divides itself into four sections. First, we compare vendor cloud solutions to inhouse and open cloud solutions along specific points. From there we go to reviewing the background and historical uses of REDDnet, and look into what we believe is
an exciting expansion of REDDnet beyond its initial design as a network primarily for physicists. In the third section we report on our testing with REDDnet and tools available for REDDnet. We end by talking about the strategies and goals we developed for
our continued involvement in REDDnet.
We also touch in a general way upon cost issues when using clouds, and point to sources for more specific evaluation methods. In the coming year, REDDnet will have a number of changes intended to assist the physics community. Those changes will be available
to us, becoming a benefit to our library community. It is anticipated that a future article will address our progress in our digital archival preservation efforts.
We do not address Digital Access Management Systems, techniques of digitization or metadata creation. We focus entirely on the archival preservation of digital data in the cloud.
2. Key responsibilities for cloud systems: vendor, inhouse and open clouds
At least for the near future, current technology for digital curation leaves the owner of data in the role of never being finished. Information professionals such as Reagan Moore have referred to the dilemma of this task as "communication with the future" (2008), referring to those inhabitants thousands of years from now who will see us as part of their ancient history. Surrounding this communications channel to the future are numerous perils. We foresee the angers of nature (weather damage and flooding, seismic blasts), criminal acts (terrorism and vandalism, viruses), and our everyday media which itself is unstable and can wear out or change spontaneously. Beyond these difficult concerns are complex issues such as cost savings by switching to new storage devices and conversion tasks. Over time we generally add to our existing data, so that the collections do not ever grow smaller.
Security of library data takes on a new aspect as we become aware of an ever-increasing concern. We are becoming aware that electronic theses and dissertations, which can contain patent information or information restricted to a sponsoring organization, are now the object of exploits to get hold of the data.
New file formats are also spreading quickly: note in particular the eReader marketplace with its abundance of new formats. Digital objects on top of everything else can be controlled by Digital Rights Management (DRM) software, which means that ownership of the files is not necessarily the same as access. For the most part cloud solutions do not address these threats and issues. There are many more issues which are often exposed, and in some cases not seen. Figure 1 below shows a number of these types of issues and summarizes issues those prompted us to look at open clouds.
Comparison of vender cloud systems to inhouse systems.
Issue
|
Vendor |
Inhouse |
Open |
Comment |
Responsibility for data loss. |
May still be the user's responsibility |
User's responsibility |
User's responsibility |
From Amazon's Web Services agreement: WE AND OUR AFFILIATES OR LICENSORS WILL NOT BE LIABLE TO YOU FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, CONSEQUENTIAL OR EXEMPLARY DAMAGES (INCLUDING DAMAGES FOR LOSS OF PROFITS, GOODWILL, USE, OR DATA), EVEN IF A PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.[1]. |
Make archival copies |
May often be the user's responsibility |
User's responsibility |
User's responsibility |
Amazon recommends Cloud customers make frequent archives of data. |
Speed of repair of system damage |
Not in the user's hands |
User's responsibility |
Not in the user's hands |
|
Data storage methods |
Not in the user's hands |
User's responsibility |
User's responsibility |
|
Day-to-day running of the Cloud |
Vendor's responsibility |
User's responsibility |
Infrastructure Hardware: Cloud owner; Cloud owner software: Cloud owner; User Hardware: User User's software: User |
|
Reports on health of the system |
Vendor's responsibility |
User's responsibility |
User's responsibility |
|
Periodic data recovery fire drills |
What does the vendor say on this issue? |
User's responsibility |
User's responsibility |
|
Data security |
Does the vendor claim responsibility? |
User's responsibility |
User's responsibility |
Some vendors do not claim responsibility for this. |
Disaster Recorvery |
Does the vendor claim responsibility? What about your archival copies? |
User's responsibility |
User's responsibility |
|
Figure 1.Data handling issues for vendor, inhouse, or open cloud systems. Issues that may be overlooked. (Amazon web services, 2011).
Let's take a look at a practical example. We have seen an explosion of CODECs (coding/decoding software and hardware) which can compress or encrypt data. Our audio and video digital objects make significant use of them. An archival preservation system needs to know of the existence and location of each file, its file format, and its video or audio CODECs (video files usually have a video and audio CODEC, in some cases multiple CODECs in a single file). By locating these files in the preservation system and having their formats and CODECs, it will be easier to convert them if necessary. Certainly, by sometime in the more distant future, a conversion will come about. Users who might need to access the digital object will know what types of CODECs, if any, are required, and how to obtain them.
Institutions change repository systems over time in response to perceived needs that can be met by other software, or take advantage of new directions in software development. The successful change depends upon having complete access to your data in a timely way. At least for this issue, discussions with the cloud vendor need to look into the future and question the availability of your data. How would the vendor of a cloud system return your data to you if you wished to move to a new system? By having archival preservation copies of your source electronic documents and your metadata files, you are in a much stronger position to undergo change. This concept became the most important to us as we searched cloud options.
3. Focus on REDDnet: history, structure, and choosing this architecture
At Texas Tech University Libraries we have been using DSpace and CONTENTdm for a number of years. Our modeling for an archival system was challenged by our experience with these two products. Separation of documents and their metadata has been an issue for us as we began to look at alternate strategies of working with digital files. Long-term maintenance and flexibility with our data is hampered when the data are represented in system-specific formats which require the whole system in production to unload an ingested document. If the actual document is ingested and stored separately from its metadata, it becomes difficult to reunite the two, as might be needed in a system reload or migration. We see that as a growing limitation of our two current production systems for digital content management.
From our perspective, the best systems offer modular design where parts can be put together and switched out with ease. Tasks such as ingest, display, edit/maintenance, and backup are entirely separate an indepenent modules allowing any building block to be replaced. In particular, we focused in our planning on archiving, which should not depend on another program's design, vendor method, or traditional institutional practice.
Factors for success in integrating into an open cloud system can range over a wide set of issues, but having experienced practitioners locally on campus will be a significant advantage. We were fortunate to be able to cooperate with our High Performance Computing Center (HPCC) in broad, general discussions about REDDnet before we started. We were introduced to REDDnet's principal architects over the phone and had a chance to explore topics with them. We had a technical contact addressing our questions and dealing with concerns whenever we had them. This face-to-face mode continued all the way through our setup of the REDDnet depots (servers) and gave us an opportunity to work directly with the hardware in our hands. Our environment supported discussion of "what if" questions and working through the answers, so that our designs could be weighed before we began. HPCC had worked with other cloud systems before REDDnet and therefore was able to give us a much broader view. Whether you are working with a single-institution academic, multi-institution academic, or commercial system, prior experience and support for design and concept can speed your progress.
The TTU Libraries' relationship with HPCC allowed us access to two 20 TB REDDnet depots (40 TB total). Eventually, as we go to construction mode, we are making plans to purchase one-half petabyte, consisting of five depots each at 106 TB. In the initial phase, however, space is not an immediate consideration for our investigations, since 40 TB will carry us far. When we first contacted REDDnet, we saw an unexpected usage in comparison to our library practice: most data was temporary in nature, to be discarded after analysis. With regard to REDDnet in the past it had been true for physicists that
“[REDDnet’s]... mission ...[was] to provide “working storage” to help manage the logistics of moving and staging large amounts of data in the wide area network, e.g. among collaborating researchers who are either trying to move data from one collaborator (person or institution) to another or who want share large data sets for limited periods of time (ranging from a few hours to a few months) while they work on it. REDDnet is not designed or intended to be a replacement for reliable archival or long term personal storage. Although the REDDnet software stack does support reliable long-term archival storage to both disk and tape.” (Tackett, 2011)
In other words, REDDnet was not providing traditional data center backups for its users. It was their responsibility to archive their own data. When invited by HPCC and the Vanderbilt team of REDDnet (J. Brewer personal communication, October, 2010), it was at a time when the thinking on REDDET had changed. Other library users and other projects were also invited to consider using REDDnet for permanent digital storage. This reverses, in part, the original goal for REDDnet which was to provide short-term parking (e.g., a few months) for large, temporary data sets originating from collider data. The network availability originally allowed joint participation by a number of researchers, primarily working in physics, to analyze large data sets. When the analysis was completed, the data was discarded to make way for a new dataset, a policy which has been revised.
TTU goals for the archival system we are building are the following:
- Work to avoid dependence on a single storage method (e.g. tape backup only).[2]
- Achieve modularity so that we swap out or upgrade components freely as other technologies become available; we do not wish to be tied to a particular component as we move forward.
- Allow access from anywhere for appropriate users.
- Have automatic reporting on the state of the system as well as reporting on demand.
- Protect born digital files and scanned document files as well.
- As file problems are detected, provide automated error correction and reporting.
To gain a foothold in working with REDDnet, we first did two types of tests of file transfers
- Perform a large amount of data trasfers with larger files (20 TB) .
- Perhorm a high number of smaller file transfers (12,000 files) in multiple subdirectories of varying sizes with deep nesting levels.
Our primary tool was L-STORE (Lstore, 2010) (Logistical Storage), which runs as a Java-based Linux client. For testing purposes a basic knowledge of Linux works well. We also installed the L-STORE Windows web client, which is a quick method of accessing REDDnet to provide a graphical view of the network. Figure 2 below shows the Linux console program. Figure 3 follows and shows the Windows web interface.
L-STORE console image
Figure 2. L-STORE commands: (Lstore, 2010).
REDDnet depots by nature are not visible on the Internet; particular depots are known to specific L-STORE servers. It is through L-STORE that depot data can be accessed, and this provides in its design a significantly high degree of security at the outset. We note that this is the only issue about REDDnet security we will address now in this paper, since it is a substantial topic on its own.
One feature of the design of REDDNET is multiserver striping of data across depots (i.e. a portion of the data from a single file sits on multiple servers; the file gets written faster because slices of it go the multiple servers at the same time). Just as this feature can provide great efficiency on a local system, it is able to improve throughput for large data sets on REDDNET. A 2009 presentation on “REDDNET for Emergency Response Data Distribution“ presented some of the main features of multisever striping (Moore, 2011a):
- A Depot can be one storage device or a collection located in one physical location.
- When you upload a file using the IBP protocol (Internet Backplane Protocol), it is first split into “slices” of a fixed size that you can specify.
- Slices can all be put on one depot or can be spread out across several/all (can be user specified or policy driven).
- Slices are stored with an expiration date set by you or the depot (depot sets max).
This system has allowed transfer rates at 3.3 GBytes/sec (Lstore, 2010) with higher rates projected for the future. with higher rates for the future. For a more visual sense of REDDnet structure, see Figure 4 which shows the REDDNET Americas Map and European nodes.
Figure 4. REDDNET Americas Map with additional section showing CERN in Europe (Moore, 2011a).
4. Weighing REDDNET for suitability in library digital archiving
As an industry, Information Technology (IT) is not always successful in disaster recovery (DR). In Symantec's October 2010 DR study, which was based on interviews with IT decision makers at 1,700 large enterprises, the failure rate on recovery tests was 30 percent (Fegreus, 2011). According to this study, reasons for the failure have to do with untried or untested data recovery and systems "too complex" for reliable data restoration. If nothing else, this information suggests that actual checks for successful data backup and file restoration must be undertaken at regular intervals. Does the cloud vendor guarantee doing that? Or on your inhouse system, do you take care of that? Development of disaster recovery methods in some systems takes a back seat to other demands.
With these goals in mind, we began investigating strategies to provide robust archival backup, which would include multiple copies, error checking and comparison, and reporting tools which could verify file status at any time. Since we would include many born digital objects whose persistence over time requires archival backup, our data needs are clearly different from the main users of REDDnet. Because "backup" is not always a clear and distinct term and may imply a variety of different strategies depending on the environment and purpose, it is important to address varying types of backup.
We note the following backup strategies, which can be used together or alone.
- RAID technology. Provides redundant backup through hardware. Multiple levels are defined. Recommended on all disk systems, but other strategies are needed along with RAID.
- Administrative Backup. Concerned with preserving data so that it can be restored. In some systems all files are backed up, in others only database files and changed files are backed up.
- Business Continuity. Concerned with preventing the disruption of data.
- Archival Backup. An Archival Backup is concerned with protecting source files and their metadata. Archival Backup needs to be able to reload a system that has been migrated to new hardware or replaced. In our REDDnet planning we are looking at both Business Continuity and Archival Backup.
Some aspects of Archival Backup are:
- Restoration of the full, existing working system in a new location on new hardware. This is in support of Business Continuity.
- A resource that preserves the original documents and metadata that was ingested, so that it could be easily moved over to another system or another copy of the system. In the case of scanned documents, some determination needs to made about level of resolution, since scanners can exceed capabilities of today's displays. But that level may be needed for the future.
- Data that may be ingested into a new system, such as a data warehouse, that supplements the production system by storing historical information over a long period of time.
Our approach can be perceived in effect as two systems, one providing file management system backup and a related system that stores and protects the original document files and their metadata files. By implementing five REDDnet depots on the network, we will achieve high redundancy of data. The impact increases as we expect these depots to reside in separate locations.The geographical dispersal of REDDnet allows that as a choice.
A strong attraction to REDDnet comes about from its relatively long history in the cloud world. REDDnet was launched in 2006 (ACCRE, 2011a), and from that time forward it has been expanding and attracting new participants.
"The Research and Education Data Depot network (REDDnet) team at Vanderbilt has been selected as a 2010 Internet2 IDEA award winner. REDDnet was selected based on its innovative and important solution, including the Data Logistics Toolkit, for large distributed storage facilities for data intensive collaboration among the nation's researchers and educators in a wide variety of application areas. REDDnet is an NSF-funded infrastructure project that provides a large distributed storage facility for data-intensive collaboration among the nation's researchers and educators in a wide variety of application areas including Vanderbilt's involvement in the LSST telescope project." (Stassun, 2011).
For a more complete overview, see "A Strategy for Campus Bridging for Data Logistics" (Moore, 2011b) and "REDDnet: Enabling Data Intensive Science in the Wide Area" (REDDnet, 2009).
5. Using REDDnet: tools, methods, distant objectives
The REDDnet system is looking to deploy more than 1.2 PB of distributed storage and 200 Terabytes of tape (Lstore, 2007) this year, and an additional 1.4 PB of storage used in a closely related project. Organizations such as the CMS-HI already mentioned, CERN (European Laboratory for Particle Physics), Large Synoptic Survey Telescope (LSSST), and Oak Ridge National Laboratory (ORNL) are playing a role in the use and development of REDDnet capabilities. Vanderbilt University is supplying significant project direction with NSF funding and funding from the Vanderbilt Center for the Americas. Principal collaborators are Vanderbilt University, University of Tennessee, Stephen F. Austin State University, Nevoa Networks, North Carolina State University, University of Delaware, Universidade de São Paulo, Universidade do Estado do Rio de Janeiro, University of Michigan, University of Florida, Fermilab, Caltech, and AMPATH (Pathway of the Americas). (ACCRE, 2011b) This is not a complete list of participants or collaborators.
ACCRE (Advanced Computing Center for Education & Research) at Vanderbilt University has developed L-STORE (LOGISTICAL STORAGE (ACCRE, 2011c), which is used as a client for accessing REDDnet depots. From the L-STORE Wiki:
"L-Store provides a flexible logistical storage framework for distributed, scalable, and secure access to data for a wide spectrum of users. L-Store is planned to be used on the REDDnet infrastructure. It is designed to provide: virtually unlimited scalability in both raw storage and associated file system metadata; a decentralized management system; security; fault tolerant metadata support; user controlled replication and striping of data on a file and directory level; scalable performance in both raw data movement and metadata queries; a virtual file system interface in both a web and command line form; and support for the concept of geographical locations for data migration to facilitate quicker access (ACCRE, 2011c)."
L-STORE user commands are available from a Linux console or users can have a Windows web interface. A Java library of commands is available. Readers should note that this software is still in the development phase. Currently it provides functions such as:
- List contents of remote locations
- Upload files/directories
- Download remote files to local destination
- Delete remote files/directories
- Create directories
- Run data integrity checks
As noted earlier, a web interface is available for providing similar access as these commands.
Goals for our development include all of the following areas:
- Long-term archival preservation
- Statistics on system performance and usage
- Reports to cover all error conditions detected
- Development of complete documentation for our components
By providing our own archival backup services, we avoid a number of risk situations by tackling them ourselves. The issues we avoid are:
- What happens when the cloud vendor goes out of business or is taken over by another vendor?
- Upgrade scheduling and anything that might affect our services.
- Understanding who, down to the individual staff level, is providing our services when questions need resolution.
- Contracts and the possible need to leave a vendor with an early exit.
We instead inherit the following responsibilities as a result:
- Monitoring the overall preservation system that we build.
- Maintaining any software we write.
- Installing updates to our open source systems (Fedora Linux, PostgreSQL, L-STORE).
- Planning for growth of the data.
In our discussions with campus HPCC we spoke about a number of concerns that come about when working with a cloud vendor. These are:
- Consider your responsibilities when working with a vendor.
- Know what you have.
- Know how to get it out once it is in a cloud.
- Vendors may not fully disclose the details about their infrastructures and policies.
- Make sure you get the information needed to verify that your data is being handled properly.
- What you don't know can hurt your data.
HPCC recommends the Three E's:
- Have an Exit strategy.
- Exercise that exit strategy.
- Decide what you are going to do in the event of an Emergency before it happens (J. Brewer Personal communication with Dr. Alan Sill of the HPCC at Texas Tech University March 2011).
Many users operate under the assumption that cloud computing will be more cost effective in the long run. We will need to uncover solid evidence for our particular case, especially based on the fact there are new studies and research about this topic as it becomes mature. At the recent Usenix HotCloud 2011 Workshop on Hot Topics in Cloud Computing (Jackson, 2011), Bryan Chul Tak et al. presented a paper (Chul Tak et al. 2011) which looked at costs for customers using Amazon EC2 and Microsoft Azure. One of the findings showed that
"For small workloads, the servers procured for in-house provisioning end up having significantly more capacity than needed (and they remain under-utilized) since they are the lowest granularity servers available in market today. On the other hand, cloud can offer instances matching the small workload needs (due to the statistical multiplexing and virtualization it employs). For medium workload intensity, cloud-based options are cost-effective only if the application needs to be supported for 2-3 years [emphasis added], and become expensive for longer lasting scenarios [emphasis added]. These workload intensities are able to utilize well provisioned servers making in-house procurement cost-effective." (Chul Tak et al. 2011)
Further, according to Bryan Chul Tak et al.
"Even if we assume the performance/$ offered by the cloud improves with time (say, an instance of given capacity becomes cheaper over time), cloud-based provisioning still remains expensive in the long run since data capacity and transfer costs contribute to the costs more significantly than in-house." [emphasis added] (Chul Tak et al. 2011)
Our interpretation of these points is that a cost analysis needs to done to support the move to the cloud. One final point in this article is, "using the cloud need not preclude a continued use of in-house infrastructure. The most cost-effective approach for an organization might, in fact, involve a combination of cloud and in-house resources rather than choosing one over the other" (Chul Tak et al. 2011). A cost analysis needs to be done to support the movement to the cloud. Applying these ideas to our scenario, application refers to the archival preservation system.
Standards for digital preservation and cloud standards need to emerge as well. Since in so many instances of cloud, we think of a vendor and note that a vendor's presence cannot be guaranteed, this area takes on a significance beyond the normal details of preservation work. Approximately 75% of last year's cloud vendors are out of business (J. Brewer Personal communicaiton with Dr. Alan Sill of the HPCC at Texas Tech University March 2011). Libraries have always had concerns about major vendors, such as those of online catalog software. Without backup elsewhere to capture the work libraries have put into maintaining their MARC data, there are real concerns for worry. Are the same conditions and details being taken into account when libraries approach cloud vendors? This may be one area where inhouse solutions show their strength beyond commercial/semi-commercial efforts.
In presenting this paper the goal is to review cloud factors that are present whether you are working with a cloud vendor, inhouse, or with an open cloud. In most cases some of the same work needs to be done. Responsibilities for archiving exist on both sides, and costs need to be decided based on the mix of tasks you are implementing. We at TTU firmly believe we will gain sufficient flexibility from our decisions to justify the direction of proceeding in part on our own with a mix of services from an open cloud.
6. Bibliography
REDDnet Related:
- REDDnet (2009) "REDDNET: Enabling Data Intensive Science in the Wide Area". Vandaerbilt University, last updated April 23, 2009. http://www.reddnet.org/mwiki/index.php/Main_Page
- L-Store (2007) "L-Store (logistical Storage)". ACCRE, Vanderbilt Univeristy, last updated November 30, 2010. http://www.lstore.org/pwiki/pmwiki.php?n=Main.HomePage
Cloud Computing Related:
- Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R., Konwinski, A., Lee, G., Patterson, D. Rabkin, A., Stoica, I., Zaharia, M. (2010) "A View of Cloud Computing". Communications of the ACM, Vol. 53. No. 4., April, 50-58 http://portal.acm.org/citation.cfm?id=1721672
- Ghosh, A., Ivan, A. (2010) "Guest Editors' Introduction: In Cloud Computing We Trust - But Should We?" Security & Privacy, IEEE, Vol. 8, No. 6, Nov-Dec, 14-16
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5655238&isnumber=5655229
This three-page introduction touches on themes expanded upon in the "Security & Privacy" special issue on Cloud Computing security including risks, and benefits. - Kushida, K.E., Murray, J., Zysman, J. (2011) "Diffusing the Cloud: Cloud Computing and Implications for Public policy". Journal of Industry, Competition and Trade, Vol. 11, No. 3, Sept., 209-237 http://www.scopus.com/inward/record.url?eid=2-s2.0-79960175942&partnerID=40&md5=e1f1c03f205f308a0a95487e89b3d13a
- Li, A., Yang, X., Kandula, S., Zhang, M. (2010) "CloudCmp: Shopping for a Cloud Made Easy". Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (HotCloud'10). USENIX Association, Berkeley, CA, 5-5 portal.acm.org/citation.cfm?id=1863108
- Marsh, C. (2011) "Tape in the Cloud". Computer Technology Review, posted April 11, 2011. http://www.wwpi.com/index.php?option=com_content&view=article&id=12633:tape-in-the-cloud&catid=99:cover-story&Itemid=2701018
- Marsh, C. (2011b) "Data Integrity in the Cloud". Computer Technology Review, posted May 24, 2011. http://www.wwpi.com/index.php?option=com_content&view=article&id=12800:data-integrity-in-the-cloud&catid=99:cover-story&Itemid=2701018
- Marsh, C. (2011c) "The Practicality of Archive Data in the Cloud". Computer Technology Review, posted July 15, 2011. http://www.wwpi.com/index.php?option=com_content&view=article&id=12993:the-practicality-of-archive-data-in-the-cloud&catid=99:cover-story&Itemid=2701018
- Mell, P., Grance, T. (2011) "The NIST Definition of Cloud Computing (Draft): Recommendations of the National Institute of Standards and Technology". National Institute of Standards and Technology (NIST). Draft updated January, 2011 http://csrc.nist.gov/publications/drafts/800-145/Draft-SP-800-145_cloud-definition.pdf
- Walker, E., Brisken, W., Romney, J. (2011) "To Lease or Not To Lease from Storage Clouds". Computer. Vol. 43, No. 4, 44-50 http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=5445153&arnumber=5445166&tag=1
Digital Preservation Related:
- Jordan, C., McDonald, R.H., Minor, D., Kozbial, A. (2008) "Cyberinfrastructure Collaboration for Distributed Digital Preservation," Fourth IEEE International Conference on eScience. Dec. 7-12, 2008.
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4736820
This short paper highlights four academic libraries' approaches to collaboration and partnership to address digital preservation issues. - Maniatis, P., Roussopoulos, M., Giuli, T. J., Rosenthal, D. H., & Baker, M. (2005). "The LOCKSS Peer-to-Peer Digital Preservation System". ACM Transactions on Computer Systems, Vol. 23, No. 1, 2-50
www.hpl.hp.com/research/ssp/papers/p2-maniatis.pdf
The LOCKSS program is example of a geographically distributed, peer-to-peer network. This paper provides detail about the project.