| 7 | |
| 8 | == Data-intensive volunteer computing == |
| 9 | |
| 10 | Currently, most BOINC projects work as follows: |
| 11 | * Data are stored on the server |
| 12 | * Pieces of data (input files) are sent to client, and jobs are run against them. |
| 13 | When done, the files are deleted from the client. |
| 14 | * Output files are sent back to the server. |
| 15 | |
| 16 | This architecture doesn't scale well for data-intensive computing. |
| 17 | There are various alternatives: |
| 18 | |
| 19 | * Workflows: DAGs of tasks connected by intermediate temporary files. |
| 20 | Schedule them so that temp files remain local to client most of the time. |
| 21 | * Stream computing: e.g., IBM Infosphere |
| 22 | * Models that involve computing against a large static dataset: |
| 23 | e.g. !MapReduce, or Amazon's scheme in which they host common |
| 24 | scientific datasets, and you can use EC2 to compute against them. |
| 25 | |
| 26 | BOINC has some features that may be useful in these scenarios: |
| 27 | e.g., locality scheduling and sticky files. |
| 28 | It lacks some features that may be needed: |
| 29 | e.g., awareness of client proximity, |
| 30 | or the ability to transfer files directly between clients. |