42 | | Work fetch: |
43 | | * do RR sim |
44 | | * for each proc type T (start with coprocs) |
45 | | * if shortfall |
46 | | * P = project with recent jobs for T and largest LTD(P, T) |
47 | | * send P a request with |
48 | | * T.idle |
49 | | * T.shortfall |
50 | | * if T is CPU, work_req_seconds = T.shortfall |
| 42 | select_project(priority, char buf) |
| 43 | if the importance of getting work for this resource is P, |
| 44 | chooses and returns a PROJECT to request work from, |
| 45 | and a string to put in the request message |
| 46 | Choose the project for which LTD + expected payoff is largest |
| 56 | get_priority() |
| 57 | |
| 58 | bool count_towards_share(PROJECT p) |
| 59 | whether to count p's resource share in the total for this rsc |
| 60 | == whether we've got a job of this type in last 30 days |
| 61 | |
| 62 | add_shortfall(PROJECT, dt) |
| 63 | add x to this project's shortfall, |
| 64 | where x = dt*(share - instances used) |
| 65 | |
| 66 | double total_share() |
| 67 | total resource share of projects we're counting |
| 68 | |
| 69 | accumulate_debt(dt) |
| 70 | for each project p |
| 71 | x = insts of this device used by P's running jobs |
| 72 | y = P's share of this device |
| 73 | update P's LTD |
| 74 | |
| 75 | The following defined in base class: |
| 76 | accumulate_shortfall(dt, i, n) |
| 77 | i = instances in use, n = total instances |
| 78 | nidle = n - i |
| 79 | max_nidle max= nidle |
| 80 | shortfall += dt*(nidle) |
| 81 | for each project p for which count_towards_share(p) |
| 82 | add_proj_shortfall(p, dt) |
| 83 | |
| 84 | data members: |
| 85 | double shortfall |
| 86 | double max_nidle |
| 87 | |
| 88 | data per project: (* means save in state file) |
| 89 | double shortfall |
| 90 | int last_job* |
| 91 | last time we had a job from this proj using this rsc |
| 92 | if the time is within last N days (30?) |
| 93 | we assume that the project may possibly have jobs of that type |
| 94 | bool runnable |
| 95 | max deficit |
| 96 | backoff timer* |
| 97 | how long to wait until ask project for work only for this rsc |
| 98 | double this any time we ask only for work for this rsc and get none |
| 99 | (maximum 24 hours) |
| 100 | clear it when we have a job that uses the rsc |
| 101 | double share |
| 102 | # of instances this project should get based on RS |
| 103 | double long_term_debt* |
| 104 | |
| 105 | derived classes: |
| 106 | CPU_WORK_FETCH |
| 107 | CUDA_WORK_FETCH |
| 108 | we could eventually subclass this from COPROC_WORK_FETCH |
| 109 | --------------------- |
| 110 | debt accounting |
| 111 | for each resource type |
| 112 | R.accumulate_debt(dt) |
| 113 | --------------------- |
| 114 | RR sim |
| 115 | |
| 116 | do simulation as current |
| 117 | on completion of an interval dt |
| 118 | cpu_work_fetch.accumulate_shortfall(dt) |
| 119 | cuda_work_fetch.accumulate_shortfall(dt) |
| 120 | |
| 121 | |
| 122 | -------------------- |
| 123 | scheduler request msg |
| 124 | double work_req_seconds |
| 125 | double cuda_req_seconds |
| 126 | bool send_only_cpu |
| 127 | bool send_only_cuda |
| 128 | double ninstances_cpu |
| 129 | double ninstances_cuda |
| 130 | |
| 131 | -------------------- |
| 132 | work fetch |
| 133 | |
| 134 | We need to deal w/ situation where there's GPU shortfall |
| 135 | but no projects are supplying GPU work. |
| 136 | We don't want an overall backoff from those projects. |
| 137 | Solution: maintain separate backoff timer per resource |
| 138 | |
| 139 | send_req(p) |
| 140 | switch cpu_work_fetch.priority |
| 141 | case DONT_NEED |
| 142 | set no_cpu in req message |
| 143 | case NEED, NEED_NOW: |
| 144 | work_req_sec = p.cpu_shortfall |
| 145 | ncpus_idle = p.max_idle_cpus |
| 146 | switch cuda_work_fetch.priority |
| 147 | case DONT_NEED |
| 148 | set no_cuda in the req message |
| 149 | case NEED, NEED_NOW: |
| 150 | |
| 151 | for prior = NEED_NOW, NEED |
| 152 | for each coproc C (in decreasing order of importance) |
| 153 | p = C.work_fetch.select_proj(prior, msg); |
| 154 | if p |
| 155 | put msg in req message |
| 156 | send_req(p) |
| 157 | return |
| 158 | else |
| 159 | p = cpu_work_fetch(prior) |
| 160 | if p |
| 161 | send_req(p) |
| 162 | return |
| 163 | |
| 164 | -------------------- |
| 165 | When get scheduler reply |
| 166 | if request. |
| 167 | -------------------- |
| 168 | scheduler |
| 169 | }}} |