Version 5 (modified by 13 years ago) (diff) | ,
---|
BOINC support for science portals
A science portal is a web site serving a group of scientists. Science portals typically offer a computing service in which scientists can submit, via a web interface, sets of jobs to be run against standard applications. Current science portals run these jobs on clusters or Grids.
This document proposes two related additions to BOINC to facilitate using a BOINC project as a computing resource for science portals:
- A mechanism for allocating computing resources among scientists
- A mechanism for handling batches of jobs
Computing share
Demand for a portal's computing power may exceed supply, in which case we need a mechanism for dividing the supply among scientists. Assume that every job is associated with a particular BOINC user, representing either a scientist or some other organizational entity, and that each such user has a numeric computing share representing the fraction of the computing resource they should get.
The way in which computing shares are determined is outside our scope, but some possibilities are:
- Each scientist attaches their own PCs to the BOINC project, and their computing share is the recent average credit of these hosts.
- Volunteers attached to the project can assign shares of their resource to scientists, and a scientist's overall share is the sum of these, weighted by the volunteer average credit.
- A scientist's share is increased if they contribute to the portal, e.g. by participating in the message boards.
The set of users may be large (1000s) and dynamic. Typically, only a few may have jobs at any given point.
What are the precise semantics of computing share? Ideally the mechanism should accommodate two classes of scientists:
- Throughput-oriented: those who have an infinite stream of jobs, want maximum throughput, and don't care about the turnaround times of individual jobs.
- Latency-oriented: those who occasionally have large batches of jobs, and who want the entire batch to be finished fast. Such a user, for example, might not submit any jobs for a month, then submit a batch that takes a day to complete using the entire resource.
To accommodate both extremes, we propose a system in which each user has an associated debt, i.e. how much computation the system owes to that user.
A user's debt continually increases at a rate equal to their computing share. Debt is capped at a value corresponding to, say, use of the entire computing resource for 1 week.
When a job completes (successfully or not) its user's debt is decremented by the amount of computing done (more precisely, but the amount of computing that would have been done by the job's resources running at peak speed for the job's elapsed time).
In this scheme, latency-oriented users can build up a debt that allows their batches (if they are sufficiently infrequent) to get the full computing resource and finish as fast as possible.
Batch scheduling
The second requirement of science portals is support for batches: groups of jobs that have value only when all of the jobs have been completed. The interval between batch submission and completion is called makespan.
Minimizing makespan
BOINC's scheduler should be modified to reduce makespan. The scheduler has control over two things:
- what hosts to send which job instances to (and when to create new instances)
- what deadline to assign to each job
A variety of techniques are possible:
- Use slower hosts for the first jobs of a batch, with the expectation that the host will finish one job in the duration of the batch (and by extension, very slow hosts should possibly not get any jobs for some batches).
- Use faster, more reliable hosts for the last jobs of a batch.
- Use increased replication for the last jobs of a batch to reduce the expected minimum turnaround.
These techniques can be combined into policies. To study the relative performance of these policies, we can develop a simulator that takes as input a description of a real volunteer host population (e.g., SETI@home) and the parameters of a random batch arrival process, and computes the statistics of makespan, and perhaps the difference between predicted and actual makespan (see below).
Estimating makespan
Given the description of a possible new batch, the system should be able to provide a reasonably accurate estimate of its makespan (perhaps with error bars). The computation of this estimate must take into account
- The computing resource (i.e. the distributions of speed and availability of the volunteer hosts, and their number).
- The existing workload (batches currently in progress, and their state)
- The parameters of the batch
- The computing share and debt of the requesting user
- The batch scheduling policy
In addition, the system should provide users with a web-based interface for viewing batches in progress, showing their fraction done, an updated estimate of their completion time, information about errors, and a button for aborting the batch.
Job DAGs
Batches are a special case of Directed Acyclic Graphs (DAGs) of jobs. The nodes in such a graph represent jobs; there is an edge from job A to job B if A must be completed before B can be started.
A batch can be viewed as a graph consisting of a "start node" fanning out to N job nodes, then fanning back in to a "finish node".
Science portals may offer the ability to submit collections of work that can be described as DAGs but not as batches: for example, a collection of units, each of which consists of running program A and using its output as input to program B.
(Note: BOINC's wrapper mechanism, which allows multiple applications to be run in sequence as part of a BOINC job, supports some simple cases, but it not general because it requires dependent jobs to run on a single host).
As an extension of the work described above, we could support DAGs rather than just batches.
Combining multiple scheduling types
BOINC potentially supports the following types of work, each with its own scheduling mechanism:
- Job-stream: the input is a stream of jobs; the goal is to maximize throughput and eventually finish all jobs.
- Batch: described above.
- Locality: a variant of job-stream in which jobs are assigned according to the files resident on the client.
- Co-scheduling: like batch, except all jobs in each batch must run simultaneously (e.g., MPI applications).
A single project (such as a portal) may have work of some or all types.
How should we handle multiple scheduling types - in particular, how can we maintain resource shares in their presence?
One simple approach is to maintain, for each scheduling type T, the maximum debt D(T) of any user with an unsent job of type T. Then use this top-level policy:
for each scheduling type T in order of descending D(T) send jobs of type T
How to maintain D(T):
- Job-stream: maintain in the feeder (consider only jobs in the cache)
- Locality: constraint: only 1 user can use locality scheduling
- Batch: maintain in batch module
- Co-scheduling: maintain in co-scheduling module
Maintaining debts
A user's debt is adjusted each time a job is dispatched, based on the estimated FLOPS of the job. This quantity is recorded in the job record.
When a job finishes or times out (at which point the actual computing done is known) the user debt is adjusted accordingly.
An idea for prioritizing batches
Goals:
- Give short batches priority over long batches
- But don't let a long stream of short batches starve long ones
- Enforce quotas over long term
For each user U, we maintain a "logical start time" LST(U). LST(U) is always at least the current time.
When U submits a batch B, LST(U) is incremented by an amount
R / share(U)
where R is the expected runtime of B given the project's entire resource.
Each batch B has a "logical end time" LET(B). This is set when the batch is submitted as
LST(U) + R
Priority is given to the batch for which LET(B) is least.