| 75 | |
| 76 | === Per-resource-type backoff === |
| 77 | |
| 78 | We need to handle the situation where e.g. there's a GPU shortfall |
| 79 | but no projects are supplying GPU work |
| 80 | (for either permanent or transient reasons). |
| 81 | We don't want an overall work-fetch backoff from those projects. |
| 82 | |
| 83 | Instead, we maintain a separate backoff timer per (project, resource type). |
| 84 | The backoff interval is doubled up to a limit whenever we ask for work of that type and don't get any work; |
| 85 | it's cleared whenever we get a job of that type. |
| 86 | |
| 87 | There is still an overall backoff timer for each project. |
| 88 | This is triggered by: |
| 89 | * requests from the project |
| 90 | * RPC failures |
| 91 | * job errors |
| 92 | and so on. |
| 93 | |
| 94 | *** Question: If we need to contact a project for a tasks of two different types, and one of the backoffs is satisfied, do we ask for both types? |
102 | | We propose the following: |
103 | | |
104 | | * For each project P and resource R there is a boolean flag D(P, R) indicating whether P should accumulate debt for R. The idea is that if D(P,R) is true, then it's likely that P would supply a job for R if we asked it. |
105 | | * D(P, R) is initially false. |
106 | | * If P supplies a job for R, D(P,R) is set to true. |
107 | | * If we send P a request that doesn't return any jobs, then for each resource R for which req_seconds(R)>0, D(P,R) is set to false. |
108 | | *** Proposed change. This could be too sensitive to temporary outages - why not have the project respond with information about whether the resource type is currently supported in some other form than the project currently has work. This would mean a 3 state response - "Work is returned", "No work, but the resource is supported", and "The resource is not supported". |
109 | | |
110 | | === Per-resource-type backoff === |
111 | | |
112 | | We need to handle the situation where e.g. there's a GPU shortfall |
113 | | but no projects are supplying GPU work |
114 | | (for either permanent or transient reasons). |
115 | | We don't want an overall work-fetch backoff from those projects. |
116 | | |
117 | | Instead, we maintain a separate backoff timer per (project, PRSC). |
118 | | The backoff interval is doubled up to a limit whenever we ask for work of that type and don't get any work; |
119 | | it's cleared whenever we get a job of that type. |
120 | | |
121 | | *** Proposed clarification - the overall contact backoff would be the minimum of the backoff for each resource type. |
122 | | *** Question: If the project asks for a communications backoff, and one of the resource type backoffs would expire within the project requested backoff, how do we handle that? |
123 | | *** Question: If we need to contact a project for a tasks of two different types, and one of the backoffs is satisfied, do we ask for both types? |
| 126 | |
| 127 | The design is as follows. |
| 128 | A project P accumulates debt for a resource when: |
| 129 | * P is not backed off for that resource, and the backoff interval is not at the max. |
| 130 | * P is not suspended via GUI, and "no more tasks" is not set |
| 131 | |
| 132 | The rate at which P accumulates debt is its resource share relative |
| 133 | to all the projects satisfying the above. |
| 134 | |
| 135 | When an application has used N instances of a resource for a time T, |
| 136 | its debt decreases by an amount proportional to N*T. |