| 28 | == Example == |
| 29 | |
| 30 | Suppose that: |
| 31 | * Project A has only GPU jobs and project B has both GPU and CPU jobs. |
| 32 | * A host is attached to projects A and B with equal resource shares. |
| 33 | * The host's GPU is twice as fast as its CPU. |
| 34 | |
| 35 | In this case, the target behavior is for the host to use |
| 36 | 100% of the CPU for project B, |
| 37 | 25% of the GPU for project B, |
| 38 | and 75% of the GPU for project A. |
| 39 | This provides equal processing to the two projects. |
| 40 | |
51 | | For compatibility with old servers, the message still has '''work_req_seconds'''; |
52 | | this is the max of (cpu,cuda)_req_seconds. |
| 61 | For compatibility with old servers, the message still has '''work_req_seconds''', |
| 62 | which is the max of (cpu,cuda)_req_seconds. |
| 63 | |
| 64 | The semantics are: a scheduler should send jobs for a resource type |
| 65 | only if the request for that type is nonzero. |
65 | | The notion of long-term debt |
| 78 | === Long-term debt === |
| 79 | |
| 80 | We'll continue to use the idea of '''long-term debt''' (LTD). |
| 81 | LTD represents how much work is "owed" to each project. |
| 82 | This increases over time in proportion to its resource share, |
| 83 | and decreases as it uses resources. |
| 84 | Simplified summary: when we need work for a resource, |
| 85 | we ask the project that may have that type of job and whose LTD is greatest. |
| 86 | |
| 87 | The idea of using RAC as a surrogate for LTD was set aside for various reasons. |
| 88 | |
| 89 | The notion of LTD needs to span resources; |
| 90 | otherwise, in the above example, projects A and B would each get 50% of the GPU. |
| 91 | |
| 92 | On the other hand, if there's a single cross-resource LTD, |
| 93 | and only one project has GPU jobs, |
| 94 | then its LTD would go unboundedly negative, |
| 95 | and the others would go unboundedly positive. |
| 96 | This is undesirable. |
| 97 | It could be fixed by limiting the LTD to a finite range, |
| 98 | but this would lose information. |
| 99 | |
| 100 | So the current plan is: |
| 101 | |
| 102 | * There is a separate LTD for each resource |
| 103 | * The "overall LTD", which is used in the work-fetch decision, is the sum of the resource LTDs, weighted by the speed of the resource (FLOPs per instance-second). |
| 104 | |
| 105 | |