Changes between Version 6 and Version 7 of ClientSchedOctTen


Ignore:
Timestamp:
Oct 28, 2010, 2:22:09 PM (14 years ago)
Author:
davea
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • ClientSchedOctTen

    v6 v7  
    2222and typically work is requested for all processor types.
    2323
    24 These policies fail to meet their goals in many cases.
     24=== Resource shares not enforced ==
     25
     26These policies may fail to enforce resource shares.
    2527Here are two scenarios that illustrate the underlying problems:
    2628
    27 === Example 1 ===
     29==== Example 1 ====
    2830
    2931A host has a fast GPU and a slow CPU.
     
    3739and project A gets 100% of the GPU.
    3840
    39 === Example 2 ===
     41==== Example 2 ====
    4042
    4143Same host.
     
    4749All information about the relative debt of B and C is lost.
    4850
    49 === The bottom line ===
     51=== Too many scheduler requests ===
    5052
    51 The current approach - extending the STD/LTD model to multiple processor types -
    52 seemed good at the time but turns out to be the wrong way.
     53The work fetch mechanism has two buffer sizes: min and max.
     54The current work fetch policy requests work whenever
     55the buffer goes below max.
     56This can cause the client to issue frequent small work requests.
     57
     58The original intention of min and max is that they are hysteresis limits:
     59when the buffer goes below min,
     60the client tries to fetch enough work to increase the buffer to max,
     61and then issues no further requests until the buffer falls below min again.
     62In this way, the interval between scheduler RPCs to a given project
     63is at least (max-min).
    5364
    5465== Proposal: credit-driven scheduling ==
    5566
    56 The idea is to make resource share apply to credit.
     67The idea is to make resource share apply to overall credit,
     68not to individual resources types.
    5769If two projects have the same resource share, they should have the same RAC.
    5870Scheduling decisions should give preference to projects
     
    6678 * Jobs may fail to get credit, e.g. because they don't validate.
    6779
    68 Hence we will use a surrogate called '''estimated credit'''
    69 that is maintained by the client.
     80Hence we will use a surrogate called '''estimated credit''',
     81maintained by the client.
    7082If projects grant credit fairly, and if all jobs validate,
    7183then estimated credit is roughly equal to granted credit over the long term.
    7284
    73 Note: there is a potential advantage to using granted credit too.
    74 Doing so penalizes projects that grant inflated credit:
     85Note: there is a potential advantage to using granted credit:
     86doing so penalizes projects that grant inflated credit:
    7587the more credit a project grants, the less work a given host
    7688will do for it, assuming the host is attached to multiple projects.
     
    95107with an averaging half-life of, say, a month.
    96108
    97 === Work fetch ===
     109The '''scheduling priority''' of a project P is then
     110{{{
     111SP(P) = share(P) - REC(P)
     112}}}
     113where REC is normalized so that it sums to 1.
    98114
    99 The work fetch policy also needs to take into account
    100 the amount of work currently queued for a project,
    101 so that it doesn't keep getting work from the same project.
    102 To accomplish this, we define Q(P) as the number of FLOPS currently queued for P,
    103 normalized so that the sum over projects is 1.
     115In scenario 1 above,
     116SP(A) will always be negative and
     117SP(B) will always be positive.
    104118
    105 We then define the work-fetch priority of a project as
    106 {{{
    107 WFP(P) = share(P) - (RAF(P) + A*Q(P))/(1+A)
    108 }}}
    109 
    110 where A is a parameter, probably around 1.
    111 
    112 === Job scheduling ===
     119== Job scheduling ==
    113120
    114121As the job scheduling policy picks jobs to run (e.g. on a multiprocessor)
    115122it needs to take into account the jobs already scheduled,
    116123so that it doesn't always schedule multiple jobs from the same project.
    117 To accomplish this, as each job is scheduled we update
    118 RAF(P) as if the job had run for one scheduling period.
    119 The job-scheduling priority is then
    120 {{{
    121 JSP(P) = share(P) - B*RAF(P)
    122 }}}
    123 where B is a parameter, probably around 1.
     124To accomplish this, as each job is scheduled we increment
     125a copy of REC(P) as if the job had run for one scheduling period.
     126
     127== Work fetch ==
     128
     129The proposed work fetch policy:
     130
     131 * Work fetch for a given processor type
     132   is initiated whenever the saturated period is less than min.
     133 * ask the fetchable project with greatest SP(P) for "shortfall" seconds of work.
     134 * whenever a scheduler RPC to project P is done
     135   (e.g. to report results)
     136   and SP(P) is greatest among fetchable projects for a given processor type,
     137   request "shortfall" seconds of that type
     138
     139Notes:
     140
     141 * This will tend to get large (max-min) clumps of work for
     142   a single project,
     143   and variety will be lower than the current policy.
     144