| 1 | = Scheduler job matchmaking = |
| 2 | |
| 3 | Currently the scheduler's work-sending algorithm is: |
| 4 | * if host is reliable, scan entire job array, looking for retries |
| 5 | * if using HR, scan entire job array, looking for jobs committed to this HR class |
| 6 | * if still need work, scan entire job array with no restrictions |
| 7 | |
| 8 | This is bad for several reasons: |
| 9 | * inefficient: repeated array scans |
| 10 | * inflexible policy: why should reliable be more important than HR? |
| 11 | * hard to add new policies, like sending "hard" jobs to fasts hosts |
| 12 | |
| 13 | To solve these problems we're going to change this part |
| 14 | of the scheduler. |
| 15 | Basic idea: given a job J and a host H, |
| 16 | there is a function V(J, H) that represents the value of sending J to H. |
| 17 | V(J, H) might reflect various factors: |
| 18 | |
| 19 | * the computational "hardness" of J |
| 20 | * CPU/deadline ratio |
| 21 | * RAM or disk requirements |
| 22 | * H already has files required by J(or jobs already planned for sending have such files) |
| 23 | * J is a retry and H is a fast/reliable host |
| 24 | * J has already been committed to H's HR class |
| 25 | |
| 26 | BOINC includes a default value function. |
| 27 | Projects can tweak its weights, or define their own value function. |
| 28 | |
| 29 | == Details == |
| 30 | |
| 31 | functions: |
| 32 | fast_filter():: feasibility checks that can be done with no DB access |
| 33 | slow_filter():: feasibility checks that need DB access |
| 34 | |
| 35 | When we scan a slot, the possibilities are: |
| 36 | * slot is empty |
| 37 | * slot is locked by another scheduler |
| 38 | * job fails fast filter |
| 39 | * job fails slow filter |
| 40 | * none of the above |
| 41 | |
| 42 | Parameters: |
| 43 | N:: scan at least this many slots |
| 44 | (if scan N slots and have enough jobs to send, stop) |
| 45 | M:: scan at most this many slots |
| 46 | (even if don't have enough jobs to send yet) |
| 47 | L:: if scan this many locked slots, print warning msg |
| 48 | (should increase shmem size) |
| 49 | |
| 50 | logic: |
| 51 | {{{ |
| 52 | acquire semaphore |
| 53 | i = random index in shmem |
| 54 | x = ordered list of jobs to send (empty) |
| 55 | slots_scanned = 0 |
| 56 | slots_locked = 0 |
| 57 | loop |
| 58 | i = i+1 % array_size |
| 59 | slots_scanned++ |
| 60 | if slots_scanned > M |
| 61 | break |
| 62 | if shmem[i] is empty |
| 63 | continue |
| 64 | if shmem[i] is locked |
| 65 | slots_locked++ |
| 66 | continue |
| 67 | j = shmem[i] |
| 68 | if !fast_filter(j) continue; |
| 69 | v = v(h, j) |
| 70 | if v > lowest score in x |
| 71 | lock j |
| 72 | release semaphore |
| 73 | if !slow_filter(j) |
| 74 | acquire semaphore |
| 75 | unlock j |
| 76 | continue |
| 77 | acquire semaphore |
| 78 | if x satisfies client's work request |
| 79 | unlock and remove lowest-score element of x |
| 80 | add j to x |
| 81 | if x satisfies client's work request and slots_scanned >= N |
| 82 | break; |
| 83 | for each job j in x |
| 84 | mark slot j as empty |
| 85 | release semaphore |
| 86 | if slots_locked > L |
| 87 | print "need bigger array" message |
| 88 | }}} |