Changes between Version 7 and Version 8 of GpuWorkFetch


Ignore:
Timestamp:
Dec 26, 2008, 11:19:22 AM (15 years ago)
Author:
davea
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • GpuWorkFetch

    v7 v8  
    77 * If there's a shortfall, request work from the project with highest LTD
    88
    9 The scheduler request has a single number "work_req_seconds"
     9The scheduler request has a scalar "work_req_seconds"
    1010indicating the total duration of jobs being requested.
    1111
    1212This policy has various problems.
    1313
    14  * There's no way for the client to say "I have N idle CPUs, so send me enough jobs to use them all".
     14 * There's no way for the client to say "I have N idle CPUs; send me enough jobs to use them all".
    1515
    1616And many problems related to GPUs:
     
    5757There are two processing resource types: CPU and CUDA.
    5858
    59 Each PRSC has its own
    60 
    61 ----------------------
    62 RESOURCE_WORK_FETCH
    63         base class for the work-fetch policy of a resource
    64         derived classes include all RR sim - related data
    65 
    66         clear()
    67                 called before RR sim
    68 
    69         prepare()
    70                 called before exists_fetchable_project()
    71                 sees if there's project to req from for this resource,
    72                 and caches that
    73 
    74         bool exists_fetchable_project()
    75                 there's a project we can ask for work for this resource
    76 
    77         select_project(priority, char buf)
    78                 if the importance of getting work for this resource is P,
    79                 chooses and returns a PROJECT to request work from,
    80                 and a string to put in the request message
    81                 Choose the project for which LTD + expected payoff is largest
    82 
    83         values for priority:
    84                 DONT_NEED:
    85                         no shortfalls
    86                 NEED: a shortfall, but no idle devices right now
    87                 NEED_NOW: idle devices right not
    88 
    89         runnable_resource_share()
    90 
    91         get_priority()
    92 
    93         bool count_towards_share(PROJECT p)
    94                 whether to count p's resource share in the total for this rsc
    95                 == whether we've got a job of this type in last 30 days
    96 
    97         add_shortfall(PROJECT, dt)
    98                 add x to this project's shortfall,
    99                 where x = dt*(share - instances used)
    100 
    101         double total_share()
    102                 total resource share of projects we're counting
    103 
    104         accumulate_debt(dt)
    105                 for each project p
    106                         x = insts of this device used by P's running jobs
    107                         y = P's share of this device
    108                         update P's LTD
    109 
    110 The following defined in base class:
    111         accumulate_shortfall(dt, i, n)
    112                 i = instances in use, n = total instances
    113                 nidle = n - i
    114                 max_nidle max= nidle
    115                 shortfall += dt*(nidle)
    116                 for each project p for which count_towards_share(p)
    117                         add_proj_shortfall(p, dt)
    118 
    119         data members:
    120                 double shortfall
    121                 double max_nidle
    122 
    123         data per project: (* means save in state file)
    124                 double shortfall
    125                 int last_job*
    126                         last time we had a job from this proj using this rsc
    127                         if the time is within last N days (30?)
    128                         we assume that the project may possibly have jobs of that type
    129                 bool runnable
    130                 max deficit
    131                 backoff timer*
    132                         how long to wait until ask project for work only for this rsc
    133                         double this any time we ask only for work for this rsc and get none
    134                         (maximum 24 hours)
    135                         clear it when we have a job that uses the rsc
    136                 double share
    137                         # of instances this project should get based on RS
    138                 double long_term_debt*
    139 
    140 derived classes:
    141         CPU_WORK_FETCH
    142         CUDA_WORK_FETCH
    143                 we could eventually subclass this from COPROC_WORK_FETCH
    144 ---------------------
    145 debt accounting
    146         for each resource type
    147                 R.accumulate_debt(dt)
    148 ---------------------
    149 RR sim
    150 
     59Each PRSC has its own set of data related to work fetch.
     60This is stored in an object of class PRSC_WORK_FETCH.
     61
     62Its data members are:
     63
     64'''double shortfall''': shortfall for this resource
     65'''double max_nidle''': number of idle instances
     66
     67Its member functions are:
     68
     69'''clear()''': called at the start of RR simulation
     70
     71'''prepare()''': called before exists_fetchable_project().
     72sees if there's project to req from for this resource, and caches it
     73
     74'''bool exists_fetchable_project()''':
     75there's a project we can ask for work for this resource
     76
     77'''select_project(priority, char buf)''':
     78if the importance of getting work for this resource is P,
     79chooses and returns a PROJECT to request work from,
     80and a string to put in the request message
     81Choose the project for which LTD + expected payoff is largest
     82
     83Values for priority:
     84 * DONT_NEED: no shortfalls
     85 * NEED: a shortfall, but no idle devices right now
     86 * NEED_NOW: idle devices right now
     87
     88'''runnable_resource_share()''': total resource share of projects with
     89runnable jobs for this resource.
     90
     91'''get_priority()'''
     92
     93'''bool count_towards_share(PROJECT p)''':
     94whether to count p's resource share in the total for this rsc
     95== whether we've got a job of this type in last 30 days
     96
     97'''add_shortfall(PROJECT, dt)''':
     98add x to this project's shortfall,
     99where x = dt*(share - instances used)
     100
     101'''double total_share()''':
     102total resource share of projects we're counting
     103
     104'''accumulate_debt(dt)''':
     105for each project p:
     106{{{
     107x = insts of this device used by P's running jobs
     108y = P's share of this device
     109update P's LTD
     110}}}
     111
     112
     113'''accumulate_shortfall(dt, i, n)''':
     114{{{
     115i = instances in use, n = total instances
     116nidle = n - i
     117max_nidle max= nidle
     118shortfall += dt*(nidle)
     119for each project p for which count_towards_share(p)
     120    add_proj_shortfall(p, dt)
     121}}}
     122
     123
     124Each PRSC also needs to have some per-project data.
     125This is stored in an object of class PRSC_PROJECT_DATA.
     126Its members include (* means save in state file):
     127
     128'''double shortfall'''
     129
     130'''int last_job'''*: last time we had a job from this proj using this rsc
     131if the time is within last N days (30?)
     132we assume that the project may possibly have jobs of that type
     133
     134'''bool runnable'''
     135
     136'''max deficit'''
     137
     138'''backoff timer'''*:  how long to wait until ask project for work only for this rsc
     139double this any time we ask only for work for this rsc and get none
     140(maximum 24 hours)
     141clear it when we have a job that uses the rsc
     142
     143'''double share''': # of instances this project should get based on RS
     144
     145'''double long_term_debt*'''
     146
     147
     148
     149=== debt accounting ===
     150{{{
     151for each resource type
     152   R.accumulate_debt(dt)
     153}}}
     154
     155=== RR simulation ===
     156
     157{{{
    151158do simulation as current
    152159on completion of an interval dt
    153160        cpu_work_fetch.accumulate_shortfall(dt)
    154161        cuda_work_fetch.accumulate_shortfall(dt)
    155 
     162}}}
    156163
    157164--------------------