Changes between Version 5 and Version 6 of GpuWorkFetch


Ignore:
Timestamp:
Dec 24, 2008, 3:14:39 PM (15 years ago)
Author:
davea
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • GpuWorkFetch

    v5 v6  
    2121== New policy ==
    2222
    23 Notion of "processor type": CPU is 1 type, each coproc is another.
     23{{{
     24A CPU job is one that uses only CPU time
     25A CUDA job is one that uses CUDA (and may use CPU as well)
     26----------------------
     27RESOURCE_WORK_FETCH
     28        base class for the work-fetch policy of a resource
     29        derived classes include all RR sim - related data
    2430
    25 Keep track of which projects can use which processor type.
    26 Data structure, per (project, processor type):
    27  * LTD
    28  * last time project sent a job that used this type
    29  * shortfall
     31        clear()
     32                called before RR sim
    3033
    31 Round-robin simulator computes:
    32  * for each proc type:
    33   * overall shortfall
    34   * for each project
    35    * shortfall (determines work req)
    36    * max idle instances
     34        prepare()
     35                called before exists_fetchable_project()
     36                sees if there's project to req from for this resource,
     37                and caches that
    3738
    38 Scheduler request includes:
    39  * for each proc type, # idle, and # of seconds to fill
    40  * still includes work_req_seconds (for backwards compat)
     39        bool exists_fetchable_project()
     40                there's a project we can ask for work for this resource
    4141
    42 Work fetch:
    43  * do RR sim
    44  * for each proc type T (start with coprocs)
    45   * if shortfall
    46    * P = project with recent jobs for T and largest LTD(P, T)
    47    * send P a request with
    48     * T.idle
    49     * T.shortfall
    50     * if T is CPU, work_req_seconds = T.shortfall
     42        select_project(priority, char buf)
     43                if the importance of getting work for this resource is P,
     44                chooses and returns a PROJECT to request work from,
     45                and a string to put in the request message
     46                Choose the project for which LTD + expected payoff is largest
    5147
    52 CPU sched policy
     48        values for priority:
     49                DONT_NEED:
     50                        no shortfalls
     51                NEED: a shortfall, but no idle devices right now
     52                NEED_NOW: idle devices right not
    5353
    54 Server job send
     54        runnable_resource_share()
    5555
     56        get_priority()
     57
     58        bool count_towards_share(PROJECT p)
     59                whether to count p's resource share in the total for this rsc
     60                == whether we've got a job of this type in last 30 days
     61
     62        add_shortfall(PROJECT, dt)
     63                add x to this project's shortfall,
     64                where x = dt*(share - instances used)
     65
     66        double total_share()
     67                total resource share of projects we're counting
     68
     69        accumulate_debt(dt)
     70                for each project p
     71                        x = insts of this device used by P's running jobs
     72                        y = P's share of this device
     73                        update P's LTD
     74
     75The following defined in base class:
     76        accumulate_shortfall(dt, i, n)
     77                i = instances in use, n = total instances
     78                nidle = n - i
     79                max_nidle max= nidle
     80                shortfall += dt*(nidle)
     81                for each project p for which count_towards_share(p)
     82                        add_proj_shortfall(p, dt)
     83
     84        data members:
     85                double shortfall
     86                double max_nidle
     87
     88        data per project: (* means save in state file)
     89                double shortfall
     90                int last_job*
     91                        last time we had a job from this proj using this rsc
     92                        if the time is within last N days (30?)
     93                        we assume that the project may possibly have jobs of that type
     94                bool runnable
     95                max deficit
     96                backoff timer*
     97                        how long to wait until ask project for work only for this rsc
     98                        double this any time we ask only for work for this rsc and get none
     99                        (maximum 24 hours)
     100                        clear it when we have a job that uses the rsc
     101                double share
     102                        # of instances this project should get based on RS
     103                double long_term_debt*
     104
     105derived classes:
     106        CPU_WORK_FETCH
     107        CUDA_WORK_FETCH
     108                we could eventually subclass this from COPROC_WORK_FETCH
     109---------------------
     110debt accounting
     111        for each resource type
     112                R.accumulate_debt(dt)
     113---------------------
     114RR sim
     115
     116do simulation as current
     117on completion of an interval dt
     118        cpu_work_fetch.accumulate_shortfall(dt)
     119        cuda_work_fetch.accumulate_shortfall(dt)
     120
     121
     122--------------------
     123scheduler request msg
     124double work_req_seconds
     125double cuda_req_seconds
     126bool send_only_cpu
     127bool send_only_cuda
     128double ninstances_cpu
     129double ninstances_cuda
     130
     131--------------------
     132work fetch
     133
     134We need to deal w/ situation where there's GPU shortfall
     135        but no projects are supplying GPU work.
     136        We don't want an overall backoff from those projects.
     137        Solution: maintain separate backoff timer per resource
     138
     139send_req(p)
     140        switch cpu_work_fetch.priority
     141                case DONT_NEED
     142                        set no_cpu in req message
     143                case NEED, NEED_NOW:
     144                        work_req_sec = p.cpu_shortfall
     145                        ncpus_idle = p.max_idle_cpus
     146        switch cuda_work_fetch.priority
     147                case DONT_NEED
     148                        set no_cuda in the req message
     149                case NEED, NEED_NOW:
     150
     151for prior = NEED_NOW, NEED
     152        for each coproc C (in decreasing order of importance)
     153        p = C.work_fetch.select_proj(prior, msg);
     154                if p
     155                        put msg in req message
     156                        send_req(p)
     157                        return
     158                else
     159        p = cpu_work_fetch(prior)
     160                if p
     161                        send_req(p)
     162                        return
     163
     164--------------------
     165When get scheduler reply
     166        if request.
     167--------------------
     168scheduler
     169}}}