Context Navigation

Changes between Version 5 and Version 6 of GpuWorkFetch

Timestamp:: Dec 24, 2008, 3:14:39 PM (17 years ago)
Author:: davea
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

GpuWorkFetch

-                      v5
+                      v6
 == New policy ==
+Notion of "processor type": CPU is 1 type, each coproc is another.
+{{{
+A CPU job is one that uses only CPU time
+A CUDA job is one that uses CUDA (and may use CPU as well)
+----------------------
+RESOURCE_WORK_FETCH
+        base class for the work-fetch policy of a resource
+        derived classes include all RR sim - related data
+Keep track of which projects can use which processor type.
+Data structure, per (project, processor type):
+ * LTD
+ * last time project sent a job that used this type
+ * shortfall
+        clear()
+                called before RR sim
+Round-robin simulator computes:
+ * for each proc type:
+  * overall shortfall
+  * for each project
+   * shortfall (determines work req)
+   * max idle instances
+        prepare()
+                called before exists_fetchable_project()
+                sees if there's project to req from for this resource,
+                and caches that
+Scheduler request includes:
+ * for each proc type, # idle, and # of seconds to fill
+ * still includes work_req_seconds (for backwards compat)
+        bool exists_fetchable_project()
+                there's a project we can ask for work for this resource
+Work fetch:
+ * do RR sim
+ * for each proc type T (start with coprocs)
+  * if shortfall
+   * P = project with recent jobs for T and largest LTD(P, T)
+   * send P a request with
+    * T.idle
+    * T.shortfall
+    * if T is CPU, work_req_seconds = T.shortfall
+        select_project(priority, char buf)
+                if the importance of getting work for this resource is P,
+                chooses and returns a PROJECT to request work from,
+                and a string to put in the request message
+                Choose the project for which LTD + expected payoff is largest
+CPU sched policy
+        values for priority:
+                DONT_NEED:
+                        no shortfalls
+                NEED: a shortfall, but no idle devices right now
+                NEED_NOW: idle devices right not
+Server job send
+        runnable_resource_share()
+        get_priority()
+        bool count_towards_share(PROJECT p)
+                whether to count p's resource share in the total for this rsc
+                == whether we've got a job of this type in last 30 days
+        add_shortfall(PROJECT, dt)
+                add x to this project's shortfall,
+                where x = dt*(share - instances used)
+        double total_share()
+                total resource share of projects we're counting
+        accumulate_debt(dt)
+                for each project p
+                        x = insts of this device used by P's running jobs
+                        y = P's share of this device
+                        update P's LTD
+The following defined in base class:
+        accumulate_shortfall(dt, i, n)
+                i = instances in use, n = total instances
+                nidle = n - i
+                max_nidle max= nidle
+                shortfall += dt*(nidle)
+                for each project p for which count_towards_share(p)
+                        add_proj_shortfall(p, dt)
+        data members:
+                double shortfall
+                double max_nidle
+        data per project: (* means save in state file)
+                double shortfall
+                int last_job*
+                        last time we had a job from this proj using this rsc
+                        if the time is within last N days (30?)
+                        we assume that the project may possibly have jobs of that type
+                bool runnable
+                max deficit
+                backoff timer*
+                        how long to wait until ask project for work only for this rsc
+                        double this any time we ask only for work for this rsc and get none
+                        (maximum 24 hours)
+                        clear it when we have a job that uses the rsc
+                double share
+                        # of instances this project should get based on RS
+                double long_term_debt*
+derived classes:
+        CPU_WORK_FETCH
+        CUDA_WORK_FETCH
+                we could eventually subclass this from COPROC_WORK_FETCH
+---------------------
+debt accounting
+        for each resource type
+                R.accumulate_debt(dt)
+---------------------
+RR sim
+do simulation as current
+on completion of an interval dt
+        cpu_work_fetch.accumulate_shortfall(dt)
+        cuda_work_fetch.accumulate_shortfall(dt)
+--------------------
+scheduler request msg
+double work_req_seconds
+double cuda_req_seconds
+bool send_only_cpu
+bool send_only_cuda
+double ninstances_cpu
+double ninstances_cuda
+--------------------
+work fetch
+We need to deal w/ situation where there's GPU shortfall
+        but no projects are supplying GPU work.
+        We don't want an overall backoff from those projects.
+        Solution: maintain separate backoff timer per resource
+send_req(p)
+        switch cpu_work_fetch.priority
+                case DONT_NEED
+                        set no_cpu in req message
+                case NEED, NEED_NOW:
+                        work_req_sec = p.cpu_shortfall
+                        ncpus_idle = p.max_idle_cpus
+        switch cuda_work_fetch.priority
+                case DONT_NEED
+                        set no_cuda in the req message
+                case NEED, NEED_NOW:
+for prior = NEED_NOW, NEED
+        for each coproc C (in decreasing order of importance)
+        p = C.work_fetch.select_proj(prior, msg);
+                if p
+                        put msg in req message
+                        send_req(p)
+                        return
+                else
+        p = cpu_work_fetch(prior)
+                if p
+                        send_req(p)
+                        return
+--------------------
+When get scheduler reply
+        if request.
+--------------------
+scheduler
+}}}