Context Navigation

Changes between Version 24 and Version 25 of GpuWorkFetch

Timestamp:: Jan 26, 2009, 5:03:57 PM (16 years ago)
Author:: davea
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

GpuWorkFetch

-                      v24
+                      v25
 = Work fetch and GPUs =
+== Problems with the current work fetch policy ==
+The current work-fetch policy is essentially:
+This document describes changes to BOINC's work fetch mechanism,
+in the 6.7 client and the scheduler as of [17024].
+== Problems with the old work fetch policy ==
+The old work-fetch policy is essentially:
  * Do a weighted round-robin simulation, computing the CPU shortfall (i.e., the idle CPU time we expect during the work-buffering period).
  * If there's a CPU shortfall, request work from the project with highest long-term debt (LTD).
 …
  * LTD is computed solely on the basis of CPU time used, so it doesn't provide a meaningful comparison between projects that use only GPUs, or between a GPU and CPU projects.
-This document proposes a modification to the work-fetch system that solves these problems.
 == Example ==
 …
 == Terminology ==
+New abstraction: '''processing resource type''' or PRT.
+CPU and each coprocessor type are PRTs.
+New abstraction: '''processing resource type''' just "resource type".
+Examples:
+ * CPU
+ * A type of GPU
+ * the SPE processors in a Cell
 A job sent to a client is associated with an app version,
 …
 which is the max of the req_seconds.
 The semantics are: a scheduler should send jobs for a resource type
+The semantics: a scheduler should send jobs for a resource type
 only if the request for that type is nonzero.
 == Client ==
+=== Per-resource-type backoff ===
+We need to handle the situation where e.g. there's a GPU shortfall
+but no projects are supplying GPU work
+(for either permanent or transient reasons).
+We don't want an overall work-fetch backoff from those projects.
+Instead, we maintain a separate backoff timer per (project, resource type).
+The backoff interval is doubled up to a limit whenever we ask for work of that type and don't get any work;
+it's cleared whenever we get a job of that type.
+There is still an overall backoff timer for each project.
+This is triggered by:
+ * requests from the project
+ * RPC failures
+ * job errors
+and so on.
+*** Question:  If we need to contact a project for a tasks of two different types, and one of the backoffs is satisfied, do we ask for both types?
 === Long-term debt ===
 …
 but this would lose information.
 The current plan is:
+In the new model:
  * There is a separate LTD for each resource type
 …
 It's clear how it decreases; the question is, how is it increased?
 We need to avoid situations where LTD increases without bound.
+We propose the following:
+ * For each project P and resource R there is a boolean flag D(P, R) indicating whether P should accumulate debt for R.  The idea is that if D(P,R) is true, then it's likely that P would supply a job for R if we asked it.
+ * D(P, R) is initially false.
+ * If P supplies a job for R, D(P,R) is set to true.
+ * If we send P a request that doesn't return any jobs, then for each resource R for which req_seconds(R)>0, D(P,R) is set to false.
+     *** Proposed change.  This could be too sensitive to temporary outages - why not have the project respond with information about whether the resource type is currently supported in some other form than the project currently has work.  This would mean a 3 state response - "Work is returned", "No work, but the resource is supported", and "The resource is not supported".
+=== Per-resource-type backoff ===
+We need to handle the situation where e.g. there's a GPU shortfall
+but no projects are supplying GPU work
+(for either permanent or transient reasons).
+We don't want an overall work-fetch backoff from those projects.
+Instead, we maintain a separate backoff timer per (project, PRSC).
+The backoff interval is doubled up to a limit whenever we ask for work of that type and don't get any work;
+it's cleared whenever we get a job of that type.
+*** Proposed clarification - the overall contact backoff would be the minimum of the backoff for each resource type.
+*** Question:  If the project asks for a communications backoff, and one of the resource type backoffs would expire within the project requested backoff, how do we handle that?
+*** Question:  If we need to contact a project for a tasks of two different types, and one of the backoffs is satisfied, do we ask for both types?
+The design is as follows.
+A project P accumulates debt for a resource when:
+ * P is not backed off for that resource, and the backoff interval is not at the max.
+ * P is not suspended via GUI, and "no more tasks" is not set
+The rate at which P accumulates debt is its resource share relative
+to all the projects satisfying the above.
+When an application has used N instances of a resource for a time T,
+its debt decreases by an amount proportional to N*T.
 === Work-fetch state ===
 Each PRSC has its own set of data related to work fetch.
+Each resource has its own set of data related to work fetch.
 This is stored in an object of class PRSC_WORK_FETCH.
 …
 === debt accounting ===
 {{{
 for each resource type R
    for each project P