Changes between Version 24 and Version 25 of GpuWorkFetch


Ignore:
Timestamp:
Jan 26, 2009, 5:03:57 PM (15 years ago)
Author:
davea
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • GpuWorkFetch

    v24 v25  
    11= Work fetch and GPUs =
    22
    3 == Problems with the current work fetch policy ==
    4 
    5 The current work-fetch policy is essentially:
     3This document describes changes to BOINC's work fetch mechanism,
     4in the 6.7 client and the scheduler as of [17024].
     5
     6== Problems with the old work fetch policy ==
     7
     8The old work-fetch policy is essentially:
    69 * Do a weighted round-robin simulation, computing the CPU shortfall (i.e., the idle CPU time we expect during the work-buffering period).
    710 * If there's a CPU shortfall, request work from the project with highest long-term debt (LTD).
     
    2124 * LTD is computed solely on the basis of CPU time used, so it doesn't provide a meaningful comparison between projects that use only GPUs, or between a GPU and CPU projects.
    2225
    23 This document proposes a modification to the work-fetch system that solves these problems.
    24 
    2526== Example ==
    2627
     
    3839== Terminology ==
    3940
    40 New abstraction: '''processing resource type''' or PRT.
    41 CPU and each coprocessor type are PRTs.
     41New abstraction: '''processing resource type''' just "resource type".
     42Examples:
     43 * CPU
     44 * A type of GPU
     45 * the SPE processors in a Cell
    4246
    4347A job sent to a client is associated with an app version,
     
    6569which is the max of the req_seconds.
    6670
    67 The semantics are: a scheduler should send jobs for a resource type
     71The semantics: a scheduler should send jobs for a resource type
    6872only if the request for that type is nonzero.
    6973
    7074== Client ==
     75
     76=== Per-resource-type backoff ===
     77
     78We need to handle the situation where e.g. there's a GPU shortfall
     79but no projects are supplying GPU work
     80(for either permanent or transient reasons).
     81We don't want an overall work-fetch backoff from those projects.
     82
     83Instead, we maintain a separate backoff timer per (project, resource type).
     84The backoff interval is doubled up to a limit whenever we ask for work of that type and don't get any work;
     85it's cleared whenever we get a job of that type.
     86
     87There is still an overall backoff timer for each project.
     88This is triggered by:
     89 * requests from the project
     90 * RPC failures
     91 * job errors
     92and so on.
     93
     94*** Question:  If we need to contact a project for a tasks of two different types, and one of the backoffs is satisfied, do we ask for both types?
    7195
    7296=== Long-term debt ===
     
    92116but this would lose information.
    93117
    94 The current plan is:
     118In the new model:
    95119
    96120 * There is a separate LTD for each resource type
     
    100124It's clear how it decreases; the question is, how is it increased?
    101125We need to avoid situations where LTD increases without bound.
    102 We propose the following:
    103 
    104  * For each project P and resource R there is a boolean flag D(P, R) indicating whether P should accumulate debt for R.  The idea is that if D(P,R) is true, then it's likely that P would supply a job for R if we asked it.
    105  * D(P, R) is initially false.
    106  * If P supplies a job for R, D(P,R) is set to true.
    107  * If we send P a request that doesn't return any jobs, then for each resource R for which req_seconds(R)>0, D(P,R) is set to false.
    108      *** Proposed change.  This could be too sensitive to temporary outages - why not have the project respond with information about whether the resource type is currently supported in some other form than the project currently has work.  This would mean a 3 state response - "Work is returned", "No work, but the resource is supported", and "The resource is not supported".
    109 
    110 === Per-resource-type backoff ===
    111 
    112 We need to handle the situation where e.g. there's a GPU shortfall
    113 but no projects are supplying GPU work
    114 (for either permanent or transient reasons).
    115 We don't want an overall work-fetch backoff from those projects.
    116 
    117 Instead, we maintain a separate backoff timer per (project, PRSC).
    118 The backoff interval is doubled up to a limit whenever we ask for work of that type and don't get any work;
    119 it's cleared whenever we get a job of that type.
    120 
    121 *** Proposed clarification - the overall contact backoff would be the minimum of the backoff for each resource type. 
    122 *** Question:  If the project asks for a communications backoff, and one of the resource type backoffs would expire within the project requested backoff, how do we handle that?
    123 *** Question:  If we need to contact a project for a tasks of two different types, and one of the backoffs is satisfied, do we ask for both types?
     126
     127The design is as follows.
     128A project P accumulates debt for a resource when:
     129 * P is not backed off for that resource, and the backoff interval is not at the max.
     130 * P is not suspended via GUI, and "no more tasks" is not set
     131
     132The rate at which P accumulates debt is its resource share relative
     133to all the projects satisfying the above.
     134
     135When an application has used N instances of a resource for a time T,
     136its debt decreases by an amount proportional to N*T.
    124137
    125138=== Work-fetch state ===
    126139
    127 Each PRSC has its own set of data related to work fetch.
     140Each resource has its own set of data related to work fetch.
    128141This is stored in an object of class PRSC_WORK_FETCH.
    129142
     
    182195=== debt accounting ===
    183196{{{
     197
    184198for each resource type R
    185199   for each project P