| | 28 | == Example == |
| | 29 | |
| | 30 | Suppose that: |
| | 31 | * Project A has only GPU jobs and project B has both GPU and CPU jobs. |
| | 32 | * A host is attached to projects A and B with equal resource shares. |
| | 33 | * The host's GPU is twice as fast as its CPU. |
| | 34 | |
| | 35 | In this case, the target behavior is for the host to use |
| | 36 | 100% of the CPU for project B, |
| | 37 | 25% of the GPU for project B, |
| | 38 | and 75% of the GPU for project A. |
| | 39 | This provides equal processing to the two projects. |
| | 40 | |
| 51 | | For compatibility with old servers, the message still has '''work_req_seconds'''; |
| 52 | | this is the max of (cpu,cuda)_req_seconds. |
| | 61 | For compatibility with old servers, the message still has '''work_req_seconds''', |
| | 62 | which is the max of (cpu,cuda)_req_seconds. |
| | 63 | |
| | 64 | The semantics are: a scheduler should send jobs for a resource type |
| | 65 | only if the request for that type is nonzero. |
| 65 | | The notion of long-term debt |
| | 78 | === Long-term debt === |
| | 79 | |
| | 80 | We'll continue to use the idea of '''long-term debt''' (LTD). |
| | 81 | LTD represents how much work is "owed" to each project. |
| | 82 | This increases over time in proportion to its resource share, |
| | 83 | and decreases as it uses resources. |
| | 84 | Simplified summary: when we need work for a resource, |
| | 85 | we ask the project that may have that type of job and whose LTD is greatest. |
| | 86 | |
| | 87 | The idea of using RAC as a surrogate for LTD was set aside for various reasons. |
| | 88 | |
| | 89 | The notion of LTD needs to span resources; |
| | 90 | otherwise, in the above example, projects A and B would each get 50% of the GPU. |
| | 91 | |
| | 92 | On the other hand, if there's a single cross-resource LTD, |
| | 93 | and only one project has GPU jobs, |
| | 94 | then its LTD would go unboundedly negative, |
| | 95 | and the others would go unboundedly positive. |
| | 96 | This is undesirable. |
| | 97 | It could be fixed by limiting the LTD to a finite range, |
| | 98 | but this would lose information. |
| | 99 | |
| | 100 | So the current plan is: |
| | 101 | |
| | 102 | * There is a separate LTD for each resource |
| | 103 | * The "overall LTD", which is used in the work-fetch decision, is the sum of the resource LTDs, weighted by the speed of the resource (FLOPs per instance-second). |
| | 104 | |
| | 105 | |