| 48 | |
| 49 | === The bottom line === |
| 50 | |
| 51 | The current approach - extending the STD/LTD model to multiple resource types - |
| 52 | seemed good at the time but turns out to be the wrong way. |
| 53 | |
| 54 | == Proposal: credit-driven scheduling == |
| 55 | |
| 56 | The idea is to make resource share apply to credit. |
| 57 | If two projects have the same resource share, |
| 58 | they should have the same RAC. |
| 59 | This suggests the following principle, |
| 60 | which can apply to both work fetch and job scheduling: |
| 61 | |
| 62 | * Normalize RAC and resource share so that each one sums to 1 across projects. |
| 63 | * For a project P, let G(P) = share(P) - RAC(P). |
| 64 | * Give priority to projects for which G(P) is highest, |
| 65 | i.e. that aren't getting as much credit as they should. |
| 66 | |
| 67 | This does 2 things: |
| 68 | |
| 69 | * It's the correct semantics for resource share: |
| 70 | they now control something that volunteers can actually see, |
| 71 | namely credit. |
| 72 | * It penalizes projects that grant inflated credit: |
| 73 | the more credit a project grants, the less work a given host |
| 74 | will do for it, assuming the host is attached to multiple projects. |
| 75 | (The converse is potentially true - a project would get more work done |
| 76 | by granting less credit. This is minimized by a mechanism described below.) |
| 77 | |
| 78 | Note: I've glossed over the issue of the time scale over which RAC is averaged. |
| 79 | The RAC reported by servers has a half-life of a week. |
| 80 | For purposes of scheduling a different (probably longer) period would be better. |
| 81 | The client could potentially compute its own RAC |
| 82 | based on changes in total credit. |
| 83 | However, it's probably OK to just use the server-reported RAC. |
| 84 | |
| 85 | === Recent average FLOPS === |
| 86 | |
| 87 | There are some problems with credit-driven scheduling: |
| 88 | |
| 89 | * There may be a long and variable delay between completing a job |
| 90 | and getting credit for it. |
| 91 | * Jobs may fail to get credit, e.g. because they don't validate. |
| 92 | |
| 93 | To deal with these issues, |
| 94 | I propose using not just RAC by itself, |
| 95 | but the combination of RAC and '''recent average FLOPS''' (RAF) per project. |
| 96 | This is intended to address the above 2 issues, |
| 97 | and the issue of projects that grant too little credit. |
| 98 | |
| 99 | === Work fetch === |
| 100 | |
| 101 | In addition to G(P), the work fetch policy also needs to take into account |
| 102 | the amount of work currently queued for a project, |
| 103 | so that it doesn't keep getting work from the same project. |
| 104 | To accomplish this, we define Q(P) as the number of FLOPS currently queued for P, |
| 105 | normalized so that the sum over projects is 1. |
| 106 | |
| 107 | We then define the work-fetch priority of a project as |
| 108 | {{{ |
| 109 | WFP(P) = share(P) - (RAC(P) + A*RAF(P) + B*Q(P))/(1+A+B) |
| 110 | }}} |
| 111 | |
| 112 | Where A and B are parameters, probably around 1. |
| 113 | |
| 114 | === Job scheduling === |
| 115 | |
| 116 | As the job scheduling policy picks jobs to run (e.g. on a multiprocessor) |
| 117 | it needs to take into account the jobs already scheduled, |
| 118 | so that it doesn't always schedule multiple jobs from the same project. |
| 119 | To accomplish this, as each job is scheduled we update |
| 120 | RAF(P) as if the job had run for one scheduling period. |