| 1 | = Volunteer data archival = |
| 2 | |
| 3 | '''Volunteer data archival''' means using disk space on volunteered home computers |
| 4 | to store large data files. |
| 5 | This document describes the design of a system to |
| 6 | provide volunteer data archival on BOINC. |
| 7 | We assume the goals include: |
| 8 | * Storing large (e.g. petabyte) files. |
| 9 | Files may be thousands of times larger than the |
| 10 | amount of space available on individual computers. |
| 11 | * Store files are long periods. |
| 12 | * Be able to reduce the probability of data loss |
| 13 | to arbitrarily small levels. |
| 14 | |
| 15 | Properties of the volunteer host population include: |
| 16 | |
| 17 | * A host may be sporadically available because |
| 18 | it is turned off, or because the user has suspended network activity. |
| 19 | Unavailable periods may range from minutes to several days. |
| 20 | * The upload and download speeds of hosts vary widely, |
| 21 | and can be fairly low (e.g. 1 Mbps) in some cases. |
| 22 | * The amount of disk space available to a project on a given host |
| 23 | may fluctuate over time, because of the user's own disk usage |
| 24 | or disk usage by other BOINC projects to which the host is attached. |
| 25 | * The population is dynamic: hosts are constantly arriving and leaving. |
| 26 | The mean lifetime of a host may be fairly small |
| 27 | (on the order of 100 days). |
| 28 | * Many hosts are behind firewalls. |
| 29 | We assume that all communication is initiated by the BOINC client, |
| 30 | and involves HTTP requests to trusted project servers. |
| 31 | We don't consider direct client-to-client communication. |
| 32 | |
| 33 | There are two basic techniques for achieving reliable storage using |
| 34 | unreliable resources: |
| 35 | |
| 36 | * '''Replication''': a file |
| 37 | |
| 38 | * '''Coding''': with Reed-Solomon coding, a file is divided into N 'packets', |
| 39 | and an additional K checksum packets are generated. |
| 40 | The original data can be reconstructed from any N of these N+K packets. |