| | 48 | |
| | 49 | === The bottom line === |
| | 50 | |
| | 51 | The current approach - extending the STD/LTD model to multiple resource types - |
| | 52 | seemed good at the time but turns out to be the wrong way. |
| | 53 | |
| | 54 | == Proposal: credit-driven scheduling == |
| | 55 | |
| | 56 | The idea is to make resource share apply to credit. |
| | 57 | If two projects have the same resource share, |
| | 58 | they should have the same RAC. |
| | 59 | This suggests the following principle, |
| | 60 | which can apply to both work fetch and job scheduling: |
| | 61 | |
| | 62 | * Normalize RAC and resource share so that each one sums to 1 across projects. |
| | 63 | * For a project P, let G(P) = share(P) - RAC(P). |
| | 64 | * Give priority to projects for which G(P) is highest, |
| | 65 | i.e. that aren't getting as much credit as they should. |
| | 66 | |
| | 67 | This does 2 things: |
| | 68 | |
| | 69 | * It's the correct semantics for resource share: |
| | 70 | they now control something that volunteers can actually see, |
| | 71 | namely credit. |
| | 72 | * It penalizes projects that grant inflated credit: |
| | 73 | the more credit a project grants, the less work a given host |
| | 74 | will do for it, assuming the host is attached to multiple projects. |
| | 75 | (The converse is potentially true - a project would get more work done |
| | 76 | by granting less credit. This is minimized by a mechanism described below.) |
| | 77 | |
| | 78 | Note: I've glossed over the issue of the time scale over which RAC is averaged. |
| | 79 | The RAC reported by servers has a half-life of a week. |
| | 80 | For purposes of scheduling a different (probably longer) period would be better. |
| | 81 | The client could potentially compute its own RAC |
| | 82 | based on changes in total credit. |
| | 83 | However, it's probably OK to just use the server-reported RAC. |
| | 84 | |
| | 85 | === Recent average FLOPS === |
| | 86 | |
| | 87 | There are some problems with credit-driven scheduling: |
| | 88 | |
| | 89 | * There may be a long and variable delay between completing a job |
| | 90 | and getting credit for it. |
| | 91 | * Jobs may fail to get credit, e.g. because they don't validate. |
| | 92 | |
| | 93 | To deal with these issues, |
| | 94 | I propose using not just RAC by itself, |
| | 95 | but the combination of RAC and '''recent average FLOPS''' (RAF) per project. |
| | 96 | This is intended to address the above 2 issues, |
| | 97 | and the issue of projects that grant too little credit. |
| | 98 | |
| | 99 | === Work fetch === |
| | 100 | |
| | 101 | In addition to G(P), the work fetch policy also needs to take into account |
| | 102 | the amount of work currently queued for a project, |
| | 103 | so that it doesn't keep getting work from the same project. |
| | 104 | To accomplish this, we define Q(P) as the number of FLOPS currently queued for P, |
| | 105 | normalized so that the sum over projects is 1. |
| | 106 | |
| | 107 | We then define the work-fetch priority of a project as |
| | 108 | {{{ |
| | 109 | WFP(P) = share(P) - (RAC(P) + A*RAF(P) + B*Q(P))/(1+A+B) |
| | 110 | }}} |
| | 111 | |
| | 112 | Where A and B are parameters, probably around 1. |
| | 113 | |
| | 114 | === Job scheduling === |
| | 115 | |
| | 116 | As the job scheduling policy picks jobs to run (e.g. on a multiprocessor) |
| | 117 | it needs to take into account the jobs already scheduled, |
| | 118 | so that it doesn't always schedule multiple jobs from the same project. |
| | 119 | To accomplish this, as each job is scheduled we update |
| | 120 | RAF(P) as if the job had run for one scheduling period. |