29 | | * We should eliminate as much config as possible. |
30 | | There should be no thresholds for turnaround time. |
31 | | (especially a project-wide one; this should be per app). |
32 | | * The notion of "reliable host" need not be binary. |
33 | | Maybe we should do it in terms of order statistics - |
34 | | 50th percentile hosts, 90th percentile, etc. |
35 | | Note: this is on a per (host, app version) basis. |
36 | | * We need to think about how this interacts with HR. |
| 28 | * Support the QoS features. |
| 29 | * Avoid assigning "tight" deadlines unnecessarily, because |
| 30 | * Doing so may make it impossible to assign tight deadlines |
| 31 | to jobs that actually need it. |
| 32 | * Tight deadlines often force the client to preempt other jobs, |
| 33 | which irritates some volunteers. |
| 34 | * Avoid long delays between the completion of a job instance |
| 35 | and its validation. |
| 36 | These delays irritate volunteers and increase server disk usage. |
| 37 | * Minimize server configuration. |
43 | | * Send special jobs only to special hosts. |
44 | | The danger: a special job may sit in the cache |
45 | | for a long time, maybe forever. |
46 | | * If we get a request from a non-special host, |
47 | | and we can't satisfy it with non-special jobs, |
48 | | send it special jobs too. |
49 | | The danger: special jobs may be sent to a slow or unreliable host. |
| 41 | We need a way to identify hosts that can turn around jobs quickly |
| 42 | and reliably. |
| 43 | Notes: |
| 44 | * This is a property of (host, app version), not host. |
| 45 | * This is not the same as processor speed. |
| 46 | A host may have high turnaround time for various reasons: |
| 47 | * Large min work buffer size. |
| 48 | * Attached to lots of other projects. |
| 49 | * Long periods of unavailability or network disconneciton. |
51 | | Compromises are possible; |
52 | | e.g. we could associate a "min percentile" with each job in cache, |
53 | | and send a job only to (host, app version) of that percentile or greater. |
54 | | The min percentile could be decayed over time |
55 | | so that job would always eventually get sent. |
| 51 | We propose the following. |
| 52 | For each app A: |
| 53 | |
| 54 | * For each (host, app version) let X be the percentile |
| 55 | of turnaround time |
| 56 | * For each (host, app version) let Y be the percentile |
| 57 | of "consecutive valid results" (or +infinity if > 10) |
| 58 | over all active hosts and all current app versions. |
| 59 | * Let P(H, AV) = min(X, Y) |
| 60 | |
| 61 | This will be computed periodically (say, 24 hours) |
| 62 | by a utility program. |
| 63 | |
| 64 | Notes: |
| 65 | * When a new app version is deployed, |
| 66 | the host_app_version records for the previous version should be copied, |
| 67 | on the assumption that hosts reliable for on version |
| 68 | will be reliable for the next. |
| 69 | |
| 70 | == Batch completion estimation == |
| 71 | |
| 72 | The proposed policies require estimates C(B) of batch completion, |
| 73 | I'm not sure exactly how to compute these, but |
| 74 | * it should be based on completed and validated jobs rather than |
| 75 | a priori FLOPs estimates. |
| 76 | * it should reflect (host, app version) information |
| 77 | (e.g. turnaround and elapsed time statistics) |
| 78 | for the hosts that have completed jobs, |
| 79 | and for the host population as a whole |
| 80 | * They should be computed by a daemon process, |
| 81 | triggered by the passage time and |
| 82 | by the validation of jobs in the batch. |
| 83 | |
| 84 | Notes: |
| 85 | * C(B) is different from the "logical end time" of the batch |
| 86 | used in batch scheduling. |
| 87 | * For long-deadline batches, C(B) should probably be at least |
| 88 | the original delay bound plus the greatest dispatch time of first |
| 89 | instance of a job. |
| 90 | I.e. if it takes a long time to dispatch the first instances, |
| 91 | adjust the deadline accordingly to avoid creating a deadline crunch. |
| 92 | |
| 93 | == Proposed feeder policy == |
| 94 | |
| 95 | Proposed enumeration order: |
| 96 | {{{ |
| 97 | (LET(J) asc, nretries desc) |
| 98 | }}} |
| 99 | where LET(J) is the logical end time of the job's batch. |
| 100 | |
| 101 | == Proposed scheduler policy == |
| 102 | |
| 103 | For each processor type T (CPU and GPU) we have a "busy time" BT: |
| 104 | the time already committed to high-priority jobs. |
| 105 | For a given job J we can compute the estimated runtime R. |
| 106 | The earliest we can expect to finish J is then BT + R, |
| 107 | so that's the earliest deadline we can assign. |
| 108 | Call this MD(J, T). |
| 109 | |
| 110 | For each app A and processor type T, compute the best app version |
| 111 | BAV(A, T) at the start of handling each request. |
| 112 | |
| 113 | The rough policy then is: |
| 114 | {{{ |
| 115 | for each job J in the array, belonging to batch B |
| 116 | for each usable app version AV of type T |
| 117 | if B is AFAP an there's no estimate yet |
| 118 | if P(H, AV) > 50% |
| 119 | send J using AV, with deadline BT + R |
| 120 | else |
| 121 | x = MD(J, T) |
| 122 | if x < C(B) |
| 123 | send J using AV, with deadline C(B) |
| 124 | else if P(H, AV) > 90% |
| 125 | send J using AV, with deadline x |
| 126 | }}} |
| 127 | |
| 128 | Make an initial pass through the array |
| 129 | sending only jobs that have a percentile requirement. |
| 130 | |
| 131 | Notes |
| 132 | |
| 133 | * The 50% and 90% can be parameterized |
| 134 | * Retries are not handled differently at this level, |
| 135 | although we could add a restriction like sending |
| 136 | them only to top 50% hosts |
| 137 | * In the startup case (e.g. new app) no hosts will be high percentile. |
| 138 | How to avoid starvation? |
| 139 | * I think that score-based scheduling is now deprecated. |
| 140 | The feasibility and/or desirability of a job may depend |
| 141 | on what other jobs we're sending, |
| 142 | so it doesn't make sense to assign it a score in isolation. |
| 143 | It's simpler to scan jobs and make a final decision for each one. |
| 144 | There are a few properties we need to give priority to: |
| 145 | * Limited locality scheduling |
| 146 | * beta jobs |
| 147 | We can handle these in separate passes, as we're doing now. |