| 1 | = Server scheduling improvements = |
| 2 | |
| 3 | By default, the BOINC scheduler dispatches jobs in the order returned by |
| 4 | a database select, which is more or less FIFO. |
| 5 | |
| 6 | This is non-optimal in the following situations: |
| 7 | |
| 8 | * If a job fails or times out, a '''retry job''' is created. |
| 9 | If there are lots of sendable jobs already in the DB, |
| 10 | it may be days or weeks before the retry job is dispatched. |
| 11 | During this period, completed replicas are uncredited and take up disk space. |
| 12 | * Jobs in the tail end of a batch should be done faster. |
| 13 | |
| 14 | To optimize these situations, there are two policies we can play with: |
| 15 | |
| 16 | * The order in which the feeder enumerates jobs from the DB. |
| 17 | * Preferentially sending particular jobs to fast/reliable hosts. |
| 18 | |
| 19 | BOINC has mechanisms in the [BackendPrograms#feeder feeder] |
| 20 | and [ProjectOptions#Acceleratingretries scheduler] that address these issues |
| 21 | to some extent. |
| 22 | However, these mechanisms are out of date. |
| 23 | This is a proposal for revisions to these mechanisms. |
| 24 | |
| 25 | (to be completed) |
| 26 | |
| 27 | == Notes == |
| 28 | |
| 29 | * We should eliminate as much config as possible. |
| 30 | There should be no thresholds for turnaround time. |
| 31 | (especially a project-wide one; this should be per app). |
| 32 | * The notion of "reliable host" need not be binary. |
| 33 | Maybe we should do it in terms of order statistics - |
| 34 | 50th percentile hosts, 90th percentile, etc. |
| 35 | Note: this is on a per (host, app version) basis. |
| 36 | * We need to think about how this interacts with HR. |
| 37 | |
| 38 | We need to think carefully about the dispatch model. |
| 39 | In general we have some "special" jobs in cache |
| 40 | and we get RPCs, some from "special" hosts. |
| 41 | Two extreme policies: |
| 42 | |
| 43 | * Send special jobs only to special hosts. |
| 44 | The danger: a special job may sit in the cache |
| 45 | for a long time, maybe forever. |
| 46 | * If we get a request from a non-special host, |
| 47 | and we can't satisfy it with non-special jobs, |
| 48 | send it special jobs too. |
| 49 | The danger: special jobs may be sent to a slow or unreliable host. |
| 50 | |
| 51 | Compromises are possible; |
| 52 | e.g. we could associate a "min percentile" with each job in cache, |
| 53 | and send a job only to (host, app version) of that percentile or greater. |
| 54 | The min percentile could be decayed over time |
| 55 | so that job would always eventually get sent. |