| | 75 | |
| | 76 | === Per-resource-type backoff === |
| | 77 | |
| | 78 | We need to handle the situation where e.g. there's a GPU shortfall |
| | 79 | but no projects are supplying GPU work |
| | 80 | (for either permanent or transient reasons). |
| | 81 | We don't want an overall work-fetch backoff from those projects. |
| | 82 | |
| | 83 | Instead, we maintain a separate backoff timer per (project, resource type). |
| | 84 | The backoff interval is doubled up to a limit whenever we ask for work of that type and don't get any work; |
| | 85 | it's cleared whenever we get a job of that type. |
| | 86 | |
| | 87 | There is still an overall backoff timer for each project. |
| | 88 | This is triggered by: |
| | 89 | * requests from the project |
| | 90 | * RPC failures |
| | 91 | * job errors |
| | 92 | and so on. |
| | 93 | |
| | 94 | *** Question: If we need to contact a project for a tasks of two different types, and one of the backoffs is satisfied, do we ask for both types? |
| 102 | | We propose the following: |
| 103 | | |
| 104 | | * For each project P and resource R there is a boolean flag D(P, R) indicating whether P should accumulate debt for R. The idea is that if D(P,R) is true, then it's likely that P would supply a job for R if we asked it. |
| 105 | | * D(P, R) is initially false. |
| 106 | | * If P supplies a job for R, D(P,R) is set to true. |
| 107 | | * If we send P a request that doesn't return any jobs, then for each resource R for which req_seconds(R)>0, D(P,R) is set to false. |
| 108 | | *** Proposed change. This could be too sensitive to temporary outages - why not have the project respond with information about whether the resource type is currently supported in some other form than the project currently has work. This would mean a 3 state response - "Work is returned", "No work, but the resource is supported", and "The resource is not supported". |
| 109 | | |
| 110 | | === Per-resource-type backoff === |
| 111 | | |
| 112 | | We need to handle the situation where e.g. there's a GPU shortfall |
| 113 | | but no projects are supplying GPU work |
| 114 | | (for either permanent or transient reasons). |
| 115 | | We don't want an overall work-fetch backoff from those projects. |
| 116 | | |
| 117 | | Instead, we maintain a separate backoff timer per (project, PRSC). |
| 118 | | The backoff interval is doubled up to a limit whenever we ask for work of that type and don't get any work; |
| 119 | | it's cleared whenever we get a job of that type. |
| 120 | | |
| 121 | | *** Proposed clarification - the overall contact backoff would be the minimum of the backoff for each resource type. |
| 122 | | *** Question: If the project asks for a communications backoff, and one of the resource type backoffs would expire within the project requested backoff, how do we handle that? |
| 123 | | *** Question: If we need to contact a project for a tasks of two different types, and one of the backoffs is satisfied, do we ask for both types? |
| | 126 | |
| | 127 | The design is as follows. |
| | 128 | A project P accumulates debt for a resource when: |
| | 129 | * P is not backed off for that resource, and the backoff interval is not at the max. |
| | 130 | * P is not suspended via GUI, and "no more tasks" is not set |
| | 131 | |
| | 132 | The rate at which P accumulates debt is its resource share relative |
| | 133 | to all the projects satisfying the above. |
| | 134 | |
| | 135 | When an application has used N instances of a resource for a time T, |
| | 136 | its debt decreases by an amount proportional to N*T. |