Version 1 (modified by 13 years ago) (diff) | ,
---|
Volunteer data archival
Volunteer data archival means using disk space on volunteered home computers to store large data files. This document describes the design of a system to provide volunteer data archival on BOINC. We assume the goals include:
- Storing large (e.g. petabyte) files. Files may be thousands of times larger than the amount of space available on individual computers.
- Store files are long periods.
- Be able to reduce the probability of data loss to arbitrarily small levels.
Properties of the volunteer host population include:
- A host may be sporadically available because it is turned off, or because the user has suspended network activity. Unavailable periods may range from minutes to several days.
- The upload and download speeds of hosts vary widely, and can be fairly low (e.g. 1 Mbps) in some cases.
- The amount of disk space available to a project on a given host may fluctuate over time, because of the user's own disk usage or disk usage by other BOINC projects to which the host is attached.
- The population is dynamic: hosts are constantly arriving and leaving. The mean lifetime of a host may be fairly small (on the order of 100 days).
- Many hosts are behind firewalls. We assume that all communication is initiated by the BOINC client, and involves HTTP requests to trusted project servers. We don't consider direct client-to-client communication.
There are two basic techniques for achieving reliable storage using unreliable resources:
- Replication: a file
- Coding: with Reed-Solomon coding, a file is divided into N 'packets', and an additional K checksum packets are generated. The original data can be reconstructed from any N of these N+K packets.