| | 1 | = Volunteer data archival = |
| | 2 | |
| | 3 | '''Volunteer data archival''' means using disk space on volunteered home computers |
| | 4 | to store large data files. |
| | 5 | This document describes the design of a system to |
| | 6 | provide volunteer data archival on BOINC. |
| | 7 | We assume the goals include: |
| | 8 | * Storing large (e.g. petabyte) files. |
| | 9 | Files may be thousands of times larger than the |
| | 10 | amount of space available on individual computers. |
| | 11 | * Store files are long periods. |
| | 12 | * Be able to reduce the probability of data loss |
| | 13 | to arbitrarily small levels. |
| | 14 | |
| | 15 | Properties of the volunteer host population include: |
| | 16 | |
| | 17 | * A host may be sporadically available because |
| | 18 | it is turned off, or because the user has suspended network activity. |
| | 19 | Unavailable periods may range from minutes to several days. |
| | 20 | * The upload and download speeds of hosts vary widely, |
| | 21 | and can be fairly low (e.g. 1 Mbps) in some cases. |
| | 22 | * The amount of disk space available to a project on a given host |
| | 23 | may fluctuate over time, because of the user's own disk usage |
| | 24 | or disk usage by other BOINC projects to which the host is attached. |
| | 25 | * The population is dynamic: hosts are constantly arriving and leaving. |
| | 26 | The mean lifetime of a host may be fairly small |
| | 27 | (on the order of 100 days). |
| | 28 | * Many hosts are behind firewalls. |
| | 29 | We assume that all communication is initiated by the BOINC client, |
| | 30 | and involves HTTP requests to trusted project servers. |
| | 31 | We don't consider direct client-to-client communication. |
| | 32 | |
| | 33 | There are two basic techniques for achieving reliable storage using |
| | 34 | unreliable resources: |
| | 35 | |
| | 36 | * '''Replication''': a file |
| | 37 | |
| | 38 | * '''Coding''': with Reed-Solomon coding, a file is divided into N 'packets', |
| | 39 | and an additional K checksum packets are generated. |
| | 40 | The original data can be reconstructed from any N of these N+K packets. |