| | 1 | = Client scheduling changes = |
| | 2 | |
| | 3 | Design document for changes to the client work fetch and job scheduling policies, |
| | 4 | started Oct 2010. |
| | 5 | |
| | 6 | This supercedes the following design docs: |
| | 7 | * GpuWorkFetch |
| | 8 | * GpuSched |
| | 9 | * ClientSched |
| | 10 | |
| | 11 | == Problems with current system == |
| | 12 | |
| | 13 | The current policies, described [GpuWorkFetch here], |
| | 14 | maintain long- and short-term debts for each |
| | 15 | (project, resource type) pair. |
| | 16 | |
| | 17 | Job scheduling for a given resource type is based on STD. |
| | 18 | Projects with greater STD for the resource are given priority. |
| | 19 | |
| | 20 | Work fetch is based on a weighted sum of LTDs. |
| | 21 | Work is typically fetched from the project for which this sum is greatest, |
| | 22 | and typically work is requested for all resource types. |
| | 23 | |
| | 24 | These policies fail to meet their goals in many cases. |
| | 25 | Here are two scenarios that illustrate the underlying problems: |
| | 26 | |
| | 27 | === Example 1 === |
| | 28 | |
| | 29 | A host has a fast GPU and a slow CPU. |
| | 30 | Project A has apps for both GPU and CPU. |
| | 31 | Project B has apps only for CPU. |
| | 32 | Equal resource shares. |
| | 33 | |
| | 34 | In the current system each project will get 50% of the CPU. |
| | 35 | The target behavior, which matches resource shares better, |
| | 36 | is that project B gets 100% of the CPU |
| | 37 | and project A gets 100% of the GPU. |
| | 38 | |
| | 39 | === Example 2 === |
| | 40 | |
| | 41 | Same host. |
| | 42 | Additional project C has only CPU apps. |
| | 43 | |
| | 44 | In this case A's CPU LTD will stay around zero, |
| | 45 | and the CPU LTD for B and C goes unboundedly negative, |
| | 46 | and gets clamped at the cutoff. |
| | 47 | All information about the relative debt of B and C is lost. |