wiki:ResearchProjects

Version 4 (modified by davea, 15 years ago) (diff)

--

Research projects involving BOINC

Possible research projects involving BOINC and volunteer computing, appropriate for senior-level class projects or Masters theses. If you're interested, please contact David Anderson.

Data-intensive volunteer computing

Currently, most BOINC projects work as follows:

  • Data are stored on the server
  • Pieces of data (input files) are sent to client, and jobs are run against them. When done, the files are deleted from the client.
  • Output files are sent back to the server.

This architecture doesn't scale well for data-intensive computing. There are various alternatives:

  • Workflows: DAGs of tasks connected by intermediate temporary files. Schedule them so that temp files remain local to client most of the time.
  • Stream computing: e.g., IBM Infosphere
  • Models that involve computing against a large static dataset: e.g. MapReduce, or Amazon's scheme in which they host common scientific datasets, and you can use EC2 to compute against them.

BOINC has some features that may be useful in these scenarios: e.g., locality scheduling and sticky files. It lacks some features that may be needed: e.g., awareness of client proximity, or the ability to transfer files directly between clients.

Virtualizing volunteer computing

The volunteer computing host population is highly heterogeneous in terms of software environment (operating system type and version, system libraries, installed packages). Projects are faced with the difficult task of building application versions for all these different environments; this is a significant barrier to the usage of volunteer computing.

This problem can be mitigated using virtual machine technology. In this approach, a hypervisor such as VMWare or VirtualBox is installed (manually or automatically) on volunteer hosts. An application consists of a virtual machine image containing the application proper together with the required libraries and packages. A "wrapper" program provides an interface between the BOINC client and the hypervisor, so that, for example, the application can be suspended and resumed in accordance with user preferences.

Some of this has already been done; see http://boinc.berkeley.edu/trac/wiki/VmApps.

A related goal is to minimize VM image sizes; people at CERN are working on this.

A higher level goal is to implement a "volunteer cloud" model, and to develop tools to facilitate its use by computational scientists. A collaboration with CERN and INRIA has been formed in this area.

Analyze and improve adaptive replication

Because volunteer hosts may be error-prone or malicious, volunteer computing requires result validation. The general way to do this is by replication: run each job on 2 computers and make sure the results agree.

To reduce the 50% overhead of two-fold replication, BOINC has a mechanism called "adaptive replication" that runs jobs with no replication on hosts with low error rates, while continuing to randomly intersperse replicated jobs.

The project is to identify possible counter-strategies for adaptive replication, to establish bounds on the overall effectiveness of adaptive replication, and to identify refinements that increase the effectiveness.

Extend and refine the BOINC credit system

The idea of "credit" - a numerical measure of work done - is essential to volunteer computing, as it provides volunteers an incentive and a basis for competition. Currently, BOINC's credit mechanism is based on the number of floating-point operations performed.

The project is to design a new credit system where a) credit can be given for resources other than computing (e.g., storage, network bandwidth); b) the credit given per FLOP can depend on factors such as RAM size and job turnaround time. Ideally the system allow a game-theoretic proof that it leads to an optimal allocation of resources.

Latency-oriented volunteer computing

The early volunteer computing projects (SETI@home, Climateprediction.net) are "throughput oriented": they want to maximize the number of jobs completed per day, not minimize the turnaround time of individual jobs. BOINC's scheduling mechanisms reflect this; for example, they try to assign multiple jobs at a time so that client/server interactions are minimized.

More recent volunteer computing projects are "latency-oriented": they want to minimize the makespan of batches of jobs. The project is to redesign BOINC's scheduling mechanisms so that they can support latency-oriented computation, and to validate the new mechanisms via simulation (using an existing simulator).

Volunteer data archival

While BOINC is currently used for computation, it also provides primitives for distributed data storage: file transfers, queries, and deletion. The project is to develop a system that uses these primitives to implement a distributed data archival system that uses replication to achieve target levels of reliability and availability.

Invisible GPU computing

BOINC has recently added support for GPU computing, and several projects now offer applications for NVIDIA and ATI GPUs. One problem with this is that GPU usage is not prioritized, so when a science application is running the performance of user-visible applications is noticeable degraded. As a result, BOINC's default behavior is that science applications are not run while the computer is in use (i.e., while there has been recent mouse or keyboard activity).

The project (in collaboration with NVIDIA and possibly AMD/ATI) is to make changes to BOINC and to the GPU drivers so that the GPU can be used as much as possible, even while the computer is in use, without impacting the performance of user-visible applications.

Estimating job completion times in a heterogeneous environment

Accurate job completion time estimates are essential to BOINC. Underestimates can waste computation, and overestimates can cause resource idleness. BOINC's current mechanisms for estimating job completion times have various shortcomings: a) they require projects to estimate job FLOPS requirements in advance, and to estimate the FLOPS performance of applications on particular hosts; most projects don't have the ability or willingness to provide accurate estimates; b) they are based on peak hardware performance (e.g., benchmark values), and actual application performance can be wildly different, especially for multi-threaded and GPU applications.

The project is to design, implement, and study a system that automatically estimates job completion times on heterogeneous hosts, and that provides estimates of the actual number of FLOPS performed by a given job.