Context Navigation

Changes between Version 8 and Version 9 of PortalFeatures

Timestamp:: Jul 24, 2011, 11:15:32 PM (14 years ago)
Author:: davea
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

PortalFeatures

-                      v8
+                      v9
 = Scheduling for multi-user projects =
-This document proposes two related additions to BOINC to facilitate using
-a BOINC project as a computing resource for science portals:
- * A mechanism for allocating computing resources among scientists
- * A mechanism for handling batches of jobs
 == Computing share ==
 Demand for a portal's computing power may exceed supply,
 in which case we need a mechanism for dividing the supply among scientists.
 Assume that every job is associated with a particular BOINC user,
+in which case we need a mechanism for dividing the supply among users.
+Assume that every job is associated with a particular user,
 representing either a scientist or some other organizational entity,
 and that each such user has a numeric '''computing share''' representing
+and that each such user has a numeric '''quota''' representing
 the fraction of the computing resource they should get.
+The way in which computing shares are determined is outside our scope,
+but some possibilities are:
+ * Each scientist attaches their own PCs to the BOINC project,
+   and their computing share is the recent average credit of these hosts.
+ * Volunteers attached to the project can assign shares of their resource
+   to scientists, and a scientist's overall share is the sum of these,
+   weighted by the volunteer average credit.
+ * A scientist's share is increased if they contribute to the portal,
+   e.g. by participating in the message boards.
+The set of users may be large (1000s) and dynamic.
+Typically, only a few may have jobs at any given point.
+What are the precise semantics of computing share?
+What are the precise semantics of quotas?
 Ideally the mechanism should accommodate two classes of scientists:
 …
    then submit a batch that takes a day to complete using the entire resource.
+To accommodate both extremes,
+we propose a system in which each user has an associated '''debt''',
+i.e. how much computation the system owes to that user.
+=== Scheduling batches ===
+A user's debt continually increases at a rate equal to their computing share.
+Debt is capped at a value corresponding to, say,
+use of the entire computing resource for 1 week.
+Scheduling batches involves two decisions:
+When a job completes (successfully or not) its user's debt is decremented
+by the amount of computing done
+(more precisely, but the amount of computing that would have been done
+by the job's resources running at peak speed for the job's elapsed time).
+ * which jobs to send to which hosts
+ * what deadlines to assign to jobs
+In this scheme, latency-oriented users can build up a debt that allows their
+batches (if they are sufficiently infrequent) to get the full computing resource
 and finish as fast as possible.
+Arnaud Legrand suggested the following general technique
+for making these decisions at any point during
+the batch's execution.
+== Batch scheduling ==
+ * for each host, estimate the delay until it can start processing jobs,
+   and the rate at which it can process jobs
+ * find the minimum time T at which, if all hosts begin processing
+   jobs as soon as possible, the batch will be completed
+ * Let H be the set of hosts which complete at least one job by time T.
+ * Send jobs only to hosts in H, and give them deadline T.
+The second requirement of science portals is support for '''batches''':
+groups of jobs that have value only when all of the jobs have been completed.
+The interval between batch submission and completion is called '''makespan'''.
+This algorithm can be repeated as jobs complete or time out,
+and as new hosts arrive.
+=== Minimizing makespan ===
+BOINC's scheduler should be modified to reduce makespan.
+The scheduler has control over two things:
+ * what hosts to send which job instances to (and when to create new instances)
+ * what deadline to assign to each job
+A variety of techniques are possible:
+ * Use slower hosts for the first jobs of a batch,
+   with the expectation that the host will finish one job
+   in the duration of the batch
+   (and by extension, very slow hosts should possibly
+   not get any jobs for some batches).
+ * Use faster, more reliable hosts for the last jobs of a batch.
+ * Use increased replication for the last jobs of a batch
+   to reduce the expected minimum turnaround.
+These techniques can be combined into policies.
+To study the relative performance of these policies,
+we can develop a simulator that takes as input a description
+of a real volunteer host population (e.g., SETI@home)
+and the parameters of a random batch arrival process,
+and computes the statistics of makespan,
+and perhaps the difference between predicted and actual makespan (see below).
+=== Estimating makespan ===
+Given the description of a possible new batch,
+the system should be able to provide a reasonably accurate estimate
+of its makespan (perhaps with error bars).
+The computation of this estimate must take into account
+ * The computing resource (i.e. the distributions of speed and availability
+   of the volunteer hosts, and their number).
+ * The existing workload (batches currently in progress, and their state)
+ * The parameters of the batch
+ * The computing share and debt of the requesting user
+ * The batch scheduling policy
+In addition, the system should provide users with a web-based interface
+for viewing batches in progress,
+showing their fraction done,
+an updated estimate of their completion time,
+information about errors,
+and a button for aborting the batch.
+=== Job DAGs ===
+Batches are a special case of Directed Acyclic Graphs (DAGs) of jobs.
+The nodes in such a graph represent jobs;
+there is an edge from job A to job B if A must be completed
+before B can be started.
+A batch can be viewed as a graph consisting of
+a "start node" fanning out to N job nodes,
+then fanning back in to a "finish node".
+Science portals may offer the ability to submit collections of work
+that can be described as DAGs but not as batches:
+for example, a collection of units, each of which consists of
+running program A and using its output as input to program B.
+(Note: BOINC's wrapper mechanism, which allows multiple applications
+to be run in sequence as part of a BOINC job,
+supports some simple cases, but it not general because it
+requires dependent jobs to run on a single host).
+As an extension of the work described above,
+we could support DAGs rather than just batches.
+== Combining multiple scheduling types ==
+BOINC potentially supports the following types of work,
+each with its own scheduling mechanism:
+ * Job-stream: the input is a stream of jobs;
+   the goal is to maximize throughput and eventually finish all jobs.
+ * Batch: described above.
+ * Locality: a variant of job-stream in which
+   jobs are assigned according to the files resident on the client.
+ * Co-scheduling: like batch, except all jobs in each batch
+   must run simultaneously (e.g., MPI applications).
+A single project (such as a portal) may have work of some or all types.
+How should we handle multiple scheduling types -
+in particular, how can we maintain resource shares in their presence?
+One simple approach is to maintain, for each scheduling type T,
+the maximum debt D(T) of any user with an unsent job of type T.
+Then use this top-level policy:
+{{{
+for each scheduling type T in order of descending D(T)
+        send jobs of type T
+}}}
+How to maintain D(T):
+ * Job-stream: maintain in the feeder (consider only jobs in the cache)
+ * Locality: constraint: only 1 user can use locality scheduling
+ * Batch: maintain in batch module
+ * Co-scheduling: maintain in co-scheduling module
+== Maintaining debts ==
+A user's debt is adjusted each time a job is dispatched,
+based on the estimated FLOPS of the job.
+This quantity is recorded in the job record.
+When a job finishes or times out
+(at which point the actual computing done is known)
+the user debt is adjusted accordingly.
+== An idea for prioritizing batches ==
+== Prioritizing batches ==
 Goals:
 …
 job as a 1-element batch.
+The scheme doesn't use the idea of
+accumulated credit proposed above.
+This is replaced by LST(U).
+== Combining multiple scheduling types ==
+BOINC potentially supports the following types of work,
+each with its own scheduling mechanism:
+ * Job-stream: the input is a stream of jobs;
+   the goal is to maximize throughput and eventually finish all jobs.
+ * Batch: described above.
+ * Locality: a variant of job-stream in which
+   jobs are assigned according to the files resident on the client.
+ * Co-scheduling: like batch, except all jobs in each batch
+   must run simultaneously (e.g., MPI applications).
+A single project (such as a portal) may have work of some or all types.
+How should we handle multiple scheduling types -
+in particular, how can we maintain resource shares in their presence?
+One simple approach is to maintain, for each scheduling type T,
+the least LET(T) of any unsent job of type T.
+Then use this top-level policy:
+{{{
+for each scheduling type T in order of increasing LET(T)
+        send jobs of type T
+}}}
+How to maintain LET(T):
+ * Job-stream: maintain in the feeder (consider only jobs in the cache)
+ * Locality: constraint: only 1 user can use locality scheduling
+ * Batch: maintain in batch module
+ * Co-scheduling: maintain in co-scheduling module