Changes between Version 6 and Version 7 of ClientSched


Ignore:
Timestamp:
Nov 21, 2011, 12:21:42 AM (13 years ago)
Author:
davea
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • ClientSched

    v6 v7  
    11= Client scheduling policies =
    22
    3 NOTE: this document is outdated because of development in the 6.4 client
    4 to handle GPU and multithread apps.
    5 Changes include:
     3The client embodies two related scheduling policies:
    64
    7  1. app versions now include avg_ncpus, coprocessor usage, and
    8  a FLOPS estimate (this defaults to the CPU benchmark,
    9  but for GPU and multithread apps it will be different).
    10  This info is sent from the server.
    11  1. Estimating the duration of unstarted jobs:
    12  jobs are now associated with specific app versions.
    13  The estimated duration of an unstarted job is the WU's
    14  FLOP estimate divided by the app version's FLOPS,
    15  scaled by the duration correction factor.
    16  1. Duration correction factor: this is now based on elapsed time
    17  (i.e. wall time during which the job has been running) rather than CPU time.
    18  Yes, this is affected by non-BOINC CPU load; that's as it should be.
    19  1. "CPU efficiency" is no longer maintained; it's subsumed in DCF.
    20  1. Estimating the duration of running jobs:
    21  this is a weighted average of static and dynamic estimates.
    22  The dynamic estimate is now based on elapsed time rather than CPU time.
    23  So if a GPU job has been running for 5 min, is 25% done,
    24  and has used 1 min of CPU, its dynamic estimate is 20 min (not 4 min).
    25  1. Round-robin simulation: this was modified to reflect multi-thread
    26  and coproc apps (e.g., if the host has 1 GPU, only one coproc app
    27  can run at a time).
    28  If CPUs are idle because coprocs are in use,
    29  don't count it towards CPU shortfall.
    30  1. scheduler_cpus() and enforce_schedule() take coprocs and
    31  avg_ncpus into account.  They try to keep GPUs busy if possible.
     5 '''Task scheduling policy'''::
     6        Of the tasks that are runnable, which ones to execute?
     7 '''Work-fetch policy'''::
     8        When to ask a project for more work,
     9        which project to ask,
     10        and how much work of each processor type to ask for?
    3211
    33 ---------------
    34 
    35 This document describes three related parts of the BOINC core client:
    36 
    37  '''CPU scheduling policy'''::
    38         Of the results that are runnable, which ones to execute? BOINC will generally execute NCPUS results at once, where NCPUS is the minimum of the physical number of CPUs (counting hyperthreading) and the user's 'max_cpus' general preference.
    39  '''CPU scheduling enforcement'''::
    40         When to actually enforce the schedule (i.e. by preempting and starting tasks)? Sometimes it's preferable to delay the preemption of an application until it checkpoints.
    41  '''Work-fetch policy'''::
    42         When should the core client ask a project for more work, which project should it ask, and how much work should it ask for?
     12Note: a '''processor type''' is either CPU or a GPU vendor.
     13There may be multiple '''instances'' of each processor type.
    4314
    4415The goals of these policies are (in descending priority):
    4516
    46  1. Results should be completed and reported by their deadline (because results reported after their deadline may not have any value to the project and may not be granted credit).
    47  1. NCPUS processors should be kept busy.
    48  1. At any given point, a computer should have enough work so that NCPUS processors will be busy for at least min_queue days (min_queue is a user preference).
     17 1. Tasks should be completed and reported by their deadline
     18  (results reported after their deadline may not have any value to the project
     19  and may not be granted credit).
     20 1. All processors should be kept busy.
     21 1. At any given point, the computer should have enough work so that its processors
     22   will be busy for at least min_buf days and not much more than max_buf days.
    4923 1. Project resource shares should be honored over the long term.
    50  1. If a computer is attached to multiple projects, execution should rotate among projects on a frequent basis (as defined by the user's 'CPU scheduling period' preference).
    51  1. Execution should not switch between projects much more frequently than the scheduling period, Otherwise, if the 'remove processes from memory' preference is set, and some applications take a long time to resume from a checkpoint, lot of CPU time will be wasted.
     24 1. If a computer is attached to multiple projects,
     25   execution should rotate among projects on a frequent basis
     26   to improve the volunteer experience.
    5227
    53 In previous versions of BOINC, the core client attempted to maintain at least one result for each attached project, and would do weighted round-robin CPU scheduling among all projects. In some scenarios (any combination of slow computer, lots of projects, and tight deadlines) a computer could miss the deadlines of all its results. The new policies solve this problem as follows:
     28== Task scheduling simulation ==
    5429
    55  * Work fetch is limited to ensure that deadlines can be met. A computer attached to 10 projects might have work for only a few (perhaps only one) at a given time.
    56  * If deadlines are threatened, the CPU scheduling policy optimizes the likelihood of meeting deadlines, at the expense of variety.
     30The baseline task scheduling policy is '''weighted round-robin''' (WRR):
     31one-hour time slices are given to projects in turn,
     32with the number of time slices in proportion to the project's resource share.
    5733
    58 == Concepts and terms ==
     34Both task scheduling and work-fetch policies involve
     35first simulating the execution of existing tasks in FIFO order
     36under the WRR policy.
     37This simulation produces several outputs:
    5938
    60 === Wall CPU time ===
    61 '''Wall CPU time''' is the amount of wall-clock time a process has been runnable at the OS level. The actual CPU time may be less than this, e.g. if the process does a lot of paging, or if other (non-BOINC) processing jobs run at the same time.
    62 
    63 BOINC uses wall CPU time as the measure of CPU resource usage. Wall CPU time is more fair than actual CPU time in the case of paging apps. In addition, the measurement of actual CPU time depends on apps to report it correctly, and they may not do this.
    64 
    65 === Normalized CPU time === #NormalizedCPUTime
    66 The '''normalized CPU time''' of a result is an estimate of the wall time it will take to complete, taking into account
    67 
    68  * the fraction of time BOINC runs ('on-fraction')
    69  * the fraction of time computation is enabled ('active-fraction')
    70  * CPU efficiency (the ratio of actual CPU to wall CPU)
    71 
    72 but ''Not'' taking into account the Resource Share of Projects.
    73 
    74 === Project-normalized CPU time ===
    75 
    76 The '''project-normalized CPU time''' of a result is an estimate of the wall time it will take to complete, taking into account the above factors plus the project's resource share relative to other potentially runnable projects.
    77 
    78 The 'work_req' element of a scheduler RPC request is in units of project-normalized CPU time. In deciding how much work to send, the scheduler must take into account the project's resource share fraction, and the host's on-fraction and active-fraction.
    79 
    80 For example, suppose a host has 1 GFLOP/sec CPUs, the project's resource share fraction is 0.5, the host's on-fraction is 0.8 and the host's active-fraction is 0.9. Then the expected processing rate per CPU is
    81 
    82 {{{
    83 (1 GFLOP/sec)*0.5*0.8*0.9 = 0.36 GFLOP/sec
    84 }}}
    85 
    86 If the host requests 1000 project-normalized CPU seconds of work, the scheduler should send it at least 360 GFLOPs of work.
    87 
    88 === Result states ===
    89 
    90 R is '''runnable''' if
    91  * Neither R nor R.project is suspended, and
    92  * R's input files have been downloaded, and
    93  * R hasn't finished computing
    94 
    95 R is '''nearly runnable''' if
    96  * Neither R nor R.project is suspended, and
    97  * None of R's input files is in a 'download deferred' state.
    98  * R hasn't finished computing
    99 
    100 === Project states ===
    101 
    102 P is '''runnable''' if
    103  * P has at least one runnable result (this implies that P is not suspended).
    104 
    105 P is '''downloading''' if
    106  * P is not suspended, and
    107  * P has at least one result whose files are being downloaded and none of the downloads is deferred.
    108 
    109 P is '''fetchable''' (i.e. the work-fetch policy allows work to be fetched from it) if
    110  * P is not suspended, and
    111  * P is not deferred (i.e. its minimum RPC time is in the past), and
    112  * P's no-new-work flag is not set, and
    113  * P is not overworked (see definition below), and
    114  * a fetch of P's master file is not pending
    115 
    116 P is '''latency-limited''' if
    117  * The client's last scheduler RPC to P returned a 'no work because of deadlines' flag, and
    118  * the RPC reply's delay request has not yet elapsed.
    119 
    120 This means that P has work available, but didn't send any because the work's deadlines couldn't be met given the existing work queue. P is '''potentially runnable''' if
    121 
    122  * P is either runnable, downloading, fetchable, overworked, or latency-limited.
    123 
    124 This means that, to the best of the client's knowledge, it could do work for P if it wanted to.
    125 
    126 === Debt ===
    127 
    128 Intuitively, a project's 'debt' is how much work is owed to it, relative to other projects. BOINC uses two types of debt; each is defined for a set S of projects. In each case, the debt is recalculated periodically as follows:
    129  * A = the wall CPU time used by projects in S during this period
    130  * R = sum of resource shares of projects in S
    131  * For each project P in S:
    132   * F = P.resource_share / R (i.e., P's fractional resource share)
    133   * W = A*F (i.e., how much wall CPU time P should have gotten)
    134   * P.debt += W - P.wall_cpu_time (i.e. what P should have gotten            minus what it got).
    135  * P.debt is normalized so that the mean or minimum is zero.
    136 
    137 '''Short-term debt''' is used by the CPU scheduler. It is adjusted over the set of runnable projects. It is normalized so that minimum short-term debt is zero, and maximum short-term debt is no greater than 86,400 (i.e. one day).
    138 
    139 '''Long-term debt''' is used by the work-fetch policy. It is defined for all projects, and adjusted over the set of potentially runnable projects. It is normalized so that average long-term debt, over all project, is zero.
    140 
    141 == Round-robin simulation ==
    142 
    143 The CPU scheduling and work fetch policies use the results of a simulation of weighted round-robin scheduling applied to the set of nearly runnable results. The simulation takes into account on-fraction and active-fraction. It produces the following outputs:
    144 
    145  * deadline_missed(R): whether result R misses its deadline.
    146  * deadlines_missed(P): the number of results R of P for which deadline_missed(R).
    147  * total_shortfall: the additional normalized CPU time needed to keep all CPUs busy for the next min_queue seconds.
    148  * shortfall(P): the additional normalized CPU time needed for project P to keep it from running out of work in the next min_queue seconds.
    149 
    150 In the example below, projects A and B have resource shares 2 and 1 respectively. A has results A1 and A2, and B has result B1. The computer has two CPUs. From time 0 to 4 all three results run with equal weighting. At time 4 result A2 finishes. From time 4 to 8, project A gets only a 0.5 share because it has only one result. At time 8, result A1 finishes.
    151 
    152 In this case, shortfall(A) is 4, shortfall(B``) is 0, and total_shortfall is 2.
     39 * For each processor type, the '''shortfall''': that is, the number
     40   additional instance-seconds needed to keep all instances busy until
     41   max_buffer.  In the example below, the shortfall (the gray areas) is 13.
    15342
    15443[[Image(http://boinc.berkeley.edu/rr_sim.png)]]
    15544
    156 == CPU scheduling policy ==
     45 * For each processor type, the number of currently idle instances.
    15746
    158 The CPU scheduler uses an earliest-deadline-first (EDF) policy for results that are in danger of missing their deadline, and weighted round-robin among other projects if additional CPUs exist. This allows the client to meet deadlines that would otherwise be missed, while honoring resource shares over the long term. The scheduling policy is:
     47 * For each processor type, the '''saturated''': the amount of time
     48   that all instances are busy.
    15949
    160  1. Set the 'anticipated debt' of each project to its short-term debt
    161  1. Let P be the project with the earliest-deadline runnable result among projects with deadlines_missed(P)>0. Let R be P's earliest-deadline runnable result not scheduled yet. Tiebreaker: least index in result array.
    162  1. If such an R exists, schedule R, decrement P's anticipated debt, and decrement deadlines_missed(P).
    163  1. If there are more CPUs, and projects with deadlines_missed(P)>0, go to 1.
    164  1. If all CPUs are scheduled, stop.
    165  1. If there is a result R that is currently running, and has been running for less than the CPU scheduling period, schedule R and go to 5.
    166  1. Find the project P with the greatest anticipated debt, select one of P's runnable results (picking one that is already running, if possible, else the one received first from the project) and schedule that result.
    167  1. Decrement P's anticipated debt by the 'expected payoff' (the scheduling period divided by NCPUS).
    168  1. Go to 5.
     50 * For each task T, a flag T.deadline_miss indicating whether the task
     51   missed its deadline in the simulation.
    16952
    170 The CPU scheduler runs when a result is completed, when the end of the user-specified scheduling period is reached, when new results become runnable, or when the user performs a UI interaction (e.g. suspending or resuming a project or result).
     53After doing the WRR simulation, a second simulation is done in
     54which tasks that missed their deadline are executed in EDF order.
     55For each resource type, this yields a '''busy time''',
     56which is the period during which all instances are busy in this simulation.
     57This is passed to the scheduler; its significance is that no new jobs
     58can possibly be started before this time.
    17159
     60== Project scheduling priority ==
    17261
    173 == CPU schedule enforcement ==
     62Both scheduling policies involve a notion of '''project scheduling priority''',
     63a dynamic quantity that reflects how much processing has been done
     64recently by the project's tasks relative to its resource share.
    17465
    175 The CPU scheduler decides what results should run, but it doesn't enforce this decision. This enforcement is done by a separate '''scheduler enforcement function''', which is called by the CPU scheduler at its conclusion. Let X be the set of scheduled results that are not currently running, let Y be the set of running results that are not scheduled, and let T be the time the scheduler last ran. The enforcement policy is as follows:
    176 
    177  1. If deadline_missed(R) for some R in X, then preempt a result in Y, and run R (preempt the result with the least CPU wall time since checkpoint). Repeat as needed.
    178  1. If there is a result R in Y that checkpointed more recently than T, then preempt R and run a result in X.
    179 
    180 == Work-fetch policy ==
    181 A project P is '''overworked''' if
    182 
    183  * P.long_term_debt < -sched_period
    184 
    185 This condition occurs if P's results run in EDF mode (and in extreme cases, when a project with large negative LTD is detached). The work-fetch policy avoids getting work from overworked projects. This prevents a situation where a project with short deadlines gets more than its share of CPU time.
    186 
    187 The work-fetch policy uses the functions
     66The '''scheduling priority''' of a project P is computed as
    18867
    18968{{{
    190 frs(project P)
     69SP(P) = - REC(P)/resource_share(P)
    19170}}}
     71where REC(P) is the amount of processing done recently by P.
     72REC and resource_share are normalized to sum to 1 over all projects.
    19273
    193 P's fractional resource share among fetchable projects.
     74== Job scheduling ==
    19475
    195 The work-fetch policy function is called every few minutes (or as needed) by the scheduler RPC polling function. It sets the variable '''P.work_request_size''' for each project P, which is the number of seconds of work to request if we do a scheduler RPC to P. This is computed as follows:
     76The job scheduling policy is roughly as follows:
    19677
    197 {{{
    198 for each project P
    199     if P is suspended, deferred, overworked, or no-new-work
    200         P.work_request_size = 0
    201     else
    202         P.work_request_size = shortfall(P)
     78 * For each GPU type, schedule deadline-miss tasks in order of increasing deadline,
     79   until all instances are used.
     80 * Same for CPU.
     81 * For each GPU type, while there are still available instances,
     82   schedule non-deadline-miss tasks in order of project scheduling priority.
     83 * Same for CPU.
    20384
    204 if total_shortfall > 0
    205     if P.work_request_size==0 for all P
    206         for each project P
    207             if P is suspended, deferred, overworked, or no-new-work
    208                 continue
    209             P.work_request_size = 1
     85As each job is scheduled, we increment a copy of REC(P)
     86as if the job had run for one scheduling period.
     87This encourages the scheduler to pick jobs from the same project.
    21088
    211     if P.work_request_size==0 for all P
    212         for each project P
    213             if P is suspended, deferred, or no-new-work
    214                 continue
    215             P.work_request_size = 1
     89== Work fetch ==
    21690
    217     if P.work_request_size>0 for some P
    218         Normalize P.work_request_size so that they sum to total_shortfall
    219         and are proportional to P.resource_share
    220 }}}
     91The work fetch policy:
    22192
    222 For non-CPU-intensive projects, P.work_request_size is set to 1 if P has no nearly-runnable result, otherwise 0.
     93 * Work fetch for a given processor type is initiated
     94  whenever the saturated period is less than min_buffer.
     95 * Adjust SP(P) based on the amount of work currently queued
     96 * Ask the fetchable project with greatest SP(P) for "shortfall" seconds of work.
     97 * Whenever a scheduler RPC to project P is done
     98   (e.g. to report results) and SP(P) is greatest among fetchable projects
     99   for a given processor type, request "shortfall" seconds of that type.
    223100
    224 The scheduler RPC mechanism may select a project to contact because of a user request, an outstanding trickle-up message, or a result that is overdue for reporting. If it does so, it will also request work from that project. Otherwise, the RPC mechanism chooses the project P for which
    225 
    226 {{{
    227 P.work_request_size>0 and
    228 P.long_term_debt + shortfall(P) is greatest
    229 }}}
    230 
    231 and requests work from that project. Note: P.work_request_size is in units of normalized CPU time, so the actual work request (which is in units of project-normalized CPU time) is P.work_request_size divided by P's resource share fraction relative to potentially runnable projects.
    232 
    233 ----
    234 
    235 == Scheduler work-send policy ==
    236 
    237 NOTE: the following has not been implemented, and is independent of the above policies.
    238 
    239 The scheduler should avoid sending results whose deadlines are likely to be missed, or which are likely to cause existing results to miss their deadlines. This will be accomplished as follows:
    240  * Scheduler requests includes connection period, list of queued result (with estimated time remaining and deadline) and project resource fractions.
    241  * The scheduler won't send results whose deadlines are less than now + min_queue.
    242  * The scheduler does an EDF simulation of the initial workload to determine by how much each result misses its deadline. For each result R being considered for sending, the scheduler does an EDF simulation. If R meets its deadline and no result misses its deadline by more than it did previously, R is sent.
    243  * If the scheduler has work but doesn't send any because of deadline misses, it returns a 'no work because of deadlines' flag. If the last RPC to a project returned this flag, it is marked as latency-limited and accumulates LTD.
    244 
    245 ----
    246 
    247 == Describing scenarios ==
    248 
    249 We encourage the use of the following notation for describing scheduling scenarios (times are given in hours):
    250 
    251 P(C, D, R)
    252 
    253 This describes a project with
    254 
    255   * C = CPU time per task
    256   * D = delay bound
    257   * R = fractional resource share
    258 
    259 A scenario is described by a list of project, plus the following optional parameters:
    260   * NCPUS: number of CPUS (default 1)
    261   * min_queue
    262   * leave_in_memory
    263   * cpu_scheduling_period
    264 
    265 An example scenario description is:
    266 {{{
    267 P1(1000, 2000, .5)
    268 P2(1, 10, .5)
    269 NCPUS=4
    270 }}}
    271 
    272 == Scenarios ==
    273 
    274 === Scenario 1 ===
    275 
    276 {{{
    277 P1(0.1, 1, .5)
    278 P2(1, 24, .25)
    279 P3(1, 24, .25)
    280 NCPUS = 2
    281 leave_in_memory = false
    282 cpu_scheduling_period = 1
    283 }}}
    284 
    285 Typically one CPU will process 6-minute tasks for P1, and the other CPU will alternate between P2 and P3. It's critical that the scheduler run each task of P2 and P3 for the full CPU scheduling period. If we went strictly by debt, we'd end up switching between them every 6 minutes, and both P2 and P3 would have to resume from a checkpoint each time. For some apps (e.g. Einstein@home) resuming from a checking takes several minutes. So we'd end up wasting most of the time on one CPU.