| 1 | = Client scheduling changes = |
| 2 | |
| 3 | Design document for changes to the client work fetch and job scheduling policies, |
| 4 | started Oct 2010. |
| 5 | |
| 6 | This supercedes the following design docs: |
| 7 | * GpuWorkFetch |
| 8 | * GpuSched |
| 9 | * ClientSched |
| 10 | |
| 11 | == Problems with current system == |
| 12 | |
| 13 | The current policies, described [GpuWorkFetch here], |
| 14 | maintain long- and short-term debts for each |
| 15 | (project, resource type) pair. |
| 16 | |
| 17 | Job scheduling for a given resource type is based on STD. |
| 18 | Projects with greater STD for the resource are given priority. |
| 19 | |
| 20 | Work fetch is based on a weighted sum of LTDs. |
| 21 | Work is typically fetched from the project for which this sum is greatest, |
| 22 | and typically work is requested for all resource types. |
| 23 | |
| 24 | These policies fail to meet their goals in many cases. |
| 25 | Here are two scenarios that illustrate the underlying problems: |
| 26 | |
| 27 | === Example 1 === |
| 28 | |
| 29 | A host has a fast GPU and a slow CPU. |
| 30 | Project A has apps for both GPU and CPU. |
| 31 | Project B has apps only for CPU. |
| 32 | Equal resource shares. |
| 33 | |
| 34 | In the current system each project will get 50% of the CPU. |
| 35 | The target behavior, which matches resource shares better, |
| 36 | is that project B gets 100% of the CPU |
| 37 | and project A gets 100% of the GPU. |
| 38 | |
| 39 | === Example 2 === |
| 40 | |
| 41 | Same host. |
| 42 | Additional project C has only CPU apps. |
| 43 | |
| 44 | In this case A's CPU LTD will stay around zero, |
| 45 | and the CPU LTD for B and C goes unboundedly negative, |
| 46 | and gets clamped at the cutoff. |
| 47 | All information about the relative debt of B and C is lost. |