| 42 |  | Work fetch: | 
                        | 43 |  | * do RR sim | 
                        | 44 |  | * for each proc type T (start with coprocs) | 
                        | 45 |  | * if shortfall | 
                        | 46 |  | * P = project with recent jobs for T and largest LTD(P, T) | 
                        | 47 |  | * send P a request with | 
                        | 48 |  | * T.idle | 
                        | 49 |  | * T.shortfall | 
                        | 50 |  | * if T is CPU, work_req_seconds = T.shortfall | 
                      
                        |  | 42 | select_project(priority, char buf) | 
                        |  | 43 | if the importance of getting work for this resource is P, | 
                        |  | 44 | chooses and returns a PROJECT to request work from, | 
                        |  | 45 | and a string to put in the request message | 
                        |  | 46 | Choose the project for which LTD + expected payoff is largest | 
            
                  
                          |  | 56 | get_priority() | 
                          |  | 57 |  | 
                          |  | 58 | bool count_towards_share(PROJECT p) | 
                          |  | 59 | whether to count p's resource share in the total for this rsc | 
                          |  | 60 | == whether we've got a job of this type in last 30 days | 
                          |  | 61 |  | 
                          |  | 62 | add_shortfall(PROJECT, dt) | 
                          |  | 63 | add x to this project's shortfall, | 
                          |  | 64 | where x = dt*(share - instances used) | 
                          |  | 65 |  | 
                          |  | 66 | double total_share() | 
                          |  | 67 | total resource share of projects we're counting | 
                          |  | 68 |  | 
                          |  | 69 | accumulate_debt(dt) | 
                          |  | 70 | for each project p | 
                          |  | 71 | x = insts of this device used by P's running jobs | 
                          |  | 72 | y = P's share of this device | 
                          |  | 73 | update P's LTD | 
                          |  | 74 |  | 
                          |  | 75 | The following defined in base class: | 
                          |  | 76 | accumulate_shortfall(dt, i, n) | 
                          |  | 77 | i = instances in use, n = total instances | 
                          |  | 78 | nidle = n - i | 
                          |  | 79 | max_nidle max= nidle | 
                          |  | 80 | shortfall += dt*(nidle) | 
                          |  | 81 | for each project p for which count_towards_share(p) | 
                          |  | 82 | add_proj_shortfall(p, dt) | 
                          |  | 83 |  | 
                          |  | 84 | data members: | 
                          |  | 85 | double shortfall | 
                          |  | 86 | double max_nidle | 
                          |  | 87 |  | 
                          |  | 88 | data per project: (* means save in state file) | 
                          |  | 89 | double shortfall | 
                          |  | 90 | int last_job* | 
                          |  | 91 | last time we had a job from this proj using this rsc | 
                          |  | 92 | if the time is within last N days (30?) | 
                          |  | 93 | we assume that the project may possibly have jobs of that type | 
                          |  | 94 | bool runnable | 
                          |  | 95 | max deficit | 
                          |  | 96 | backoff timer* | 
                          |  | 97 | how long to wait until ask project for work only for this rsc | 
                          |  | 98 | double this any time we ask only for work for this rsc and get none | 
                          |  | 99 | (maximum 24 hours) | 
                          |  | 100 | clear it when we have a job that uses the rsc | 
                          |  | 101 | double share | 
                          |  | 102 | # of instances this project should get based on RS | 
                          |  | 103 | double long_term_debt* | 
                          |  | 104 |  | 
                          |  | 105 | derived classes: | 
                          |  | 106 | CPU_WORK_FETCH | 
                          |  | 107 | CUDA_WORK_FETCH | 
                          |  | 108 | we could eventually subclass this from COPROC_WORK_FETCH | 
                          |  | 109 | --------------------- | 
                          |  | 110 | debt accounting | 
                          |  | 111 | for each resource type | 
                          |  | 112 | R.accumulate_debt(dt) | 
                          |  | 113 | --------------------- | 
                          |  | 114 | RR sim | 
                          |  | 115 |  | 
                          |  | 116 | do simulation as current | 
                          |  | 117 | on completion of an interval dt | 
                          |  | 118 | cpu_work_fetch.accumulate_shortfall(dt) | 
                          |  | 119 | cuda_work_fetch.accumulate_shortfall(dt) | 
                          |  | 120 |  | 
                          |  | 121 |  | 
                          |  | 122 | -------------------- | 
                          |  | 123 | scheduler request msg | 
                          |  | 124 | double work_req_seconds | 
                          |  | 125 | double cuda_req_seconds | 
                          |  | 126 | bool send_only_cpu | 
                          |  | 127 | bool send_only_cuda | 
                          |  | 128 | double ninstances_cpu | 
                          |  | 129 | double ninstances_cuda | 
                          |  | 130 |  | 
                          |  | 131 | -------------------- | 
                          |  | 132 | work fetch | 
                          |  | 133 |  | 
                          |  | 134 | We need to deal w/ situation where there's GPU shortfall | 
                          |  | 135 | but no projects are supplying GPU work. | 
                          |  | 136 | We don't want an overall backoff from those projects. | 
                          |  | 137 | Solution: maintain separate backoff timer per resource | 
                          |  | 138 |  | 
                          |  | 139 | send_req(p) | 
                          |  | 140 | switch cpu_work_fetch.priority | 
                          |  | 141 | case DONT_NEED | 
                          |  | 142 | set no_cpu in req message | 
                          |  | 143 | case NEED, NEED_NOW: | 
                          |  | 144 | work_req_sec = p.cpu_shortfall | 
                          |  | 145 | ncpus_idle = p.max_idle_cpus | 
                          |  | 146 | switch cuda_work_fetch.priority | 
                          |  | 147 | case DONT_NEED | 
                          |  | 148 | set no_cuda in the req message | 
                          |  | 149 | case NEED, NEED_NOW: | 
                          |  | 150 |  | 
                          |  | 151 | for prior = NEED_NOW, NEED | 
                          |  | 152 | for each coproc C (in decreasing order of importance) | 
                          |  | 153 | p = C.work_fetch.select_proj(prior, msg); | 
                          |  | 154 | if p | 
                          |  | 155 | put msg in req message | 
                          |  | 156 | send_req(p) | 
                          |  | 157 | return | 
                          |  | 158 | else | 
                          |  | 159 | p = cpu_work_fetch(prior) | 
                          |  | 160 | if p | 
                          |  | 161 | send_req(p) | 
                          |  | 162 | return | 
                          |  | 163 |  | 
                          |  | 164 | -------------------- | 
                          |  | 165 | When get scheduler reply | 
                          |  | 166 | if request. | 
                          |  | 167 | -------------------- | 
                          |  | 168 | scheduler | 
                          |  | 169 | }}} |