Context Navigation

Changes between Version 6 and Version 7 of ClientSchedOctTen

Timestamp:: Oct 28, 2010, 2:22:09 PM (15 years ago)
Author:: davea
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

ClientSchedOctTen

-                      v6
+                      v7
 and typically work is requested for all processor types.
+These policies fail to meet their goals in many cases.
+=== Resource shares not enforced ==
+These policies may fail to enforce resource shares.
 Here are two scenarios that illustrate the underlying problems:
 === Example 1 ===
+==== Example 1 ====
 A host has a fast GPU and a slow CPU.
 …
 and project A gets 100% of the GPU.
 === Example 2 ===
+==== Example 2 ====
 Same host.
 …
 All information about the relative debt of B and C is lost.
 === The bottom line ===
+=== Too many scheduler requests ===
+The current approach - extending the STD/LTD model to multiple processor types -
+seemed good at the time but turns out to be the wrong way.
+The work fetch mechanism has two buffer sizes: min and max.
+The current work fetch policy requests work whenever
+the buffer goes below max.
+This can cause the client to issue frequent small work requests.
+The original intention of min and max is that they are hysteresis limits:
+when the buffer goes below min,
+the client tries to fetch enough work to increase the buffer to max,
+and then issues no further requests until the buffer falls below min again.
+In this way, the interval between scheduler RPCs to a given project
+is at least (max-min).
 == Proposal: credit-driven scheduling ==
+The idea is to make resource share apply to credit.
+The idea is to make resource share apply to overall credit,
+not to individual resources types.
 If two projects have the same resource share, they should have the same RAC.
 Scheduling decisions should give preference to projects
 …
  * Jobs may fail to get credit, e.g. because they don't validate.
 Hence we will use a surrogate called '''estimated credit'''
 that is maintained by the client.
+Hence we will use a surrogate called '''estimated credit''',
+maintained by the client.
 If projects grant credit fairly, and if all jobs validate,
 then estimated credit is roughly equal to granted credit over the long term.
 Note: there is a potential advantage to using granted credit too.
 Doing so penalizes projects that grant inflated credit:
+Note: there is a potential advantage to using granted credit:
+doing so penalizes projects that grant inflated credit:
 the more credit a project grants, the less work a given host
 will do for it, assuming the host is attached to multiple projects.
 …
 with an averaging half-life of, say, a month.
+=== Work fetch ===
+The '''scheduling priority''' of a project P is then
+{{{
+SP(P) = share(P) - REC(P)
+}}}
+where REC is normalized so that it sums to 1.
+The work fetch policy also needs to take into account
+the amount of work currently queued for a project,
+so that it doesn't keep getting work from the same project.
+To accomplish this, we define Q(P) as the number of FLOPS currently queued for P,
+normalized so that the sum over projects is 1.
+In scenario 1 above,
+SP(A) will always be negative and
+SP(B) will always be positive.
+We then define the work-fetch priority of a project as
+{{{
+WFP(P) = share(P) - (RAF(P) + A*Q(P))/(1+A)
+}}}
+where A is a parameter, probably around 1.
+=== Job scheduling ===
+== Job scheduling ==
 As the job scheduling policy picks jobs to run (e.g. on a multiprocessor)
 it needs to take into account the jobs already scheduled,
 so that it doesn't always schedule multiple jobs from the same project.
+To accomplish this, as each job is scheduled we update
+RAF(P) as if the job had run for one scheduling period.
+The job-scheduling priority is then
+{{{
+JSP(P) = share(P) - B*RAF(P)
+}}}
+where B is a parameter, probably around 1.
+To accomplish this, as each job is scheduled we increment
+a copy of REC(P) as if the job had run for one scheduling period.
+== Work fetch ==
+The proposed work fetch policy:
+ * Work fetch for a given processor type
+   is initiated whenever the saturated period is less than min.
+ * ask the fetchable project with greatest SP(P) for "shortfall" seconds of work.
+ * whenever a scheduler RPC to project P is done
+   (e.g. to report results)
+   and SP(P) is greatest among fetchable projects for a given processor type,
+   request "shortfall" seconds of that type
+Notes:
+ * This will tend to get large (max-min) clumps of work for
+   a single project,
+   and variety will be lower than the current policy.