| 114 | | {{{ |
| 115 | | <reliable_max_avg_turnaround>secs</reliable_max_avg_turnaround> |
| 116 | | <reliable_max_error_rate>X</reliable_max_error_rate> |
| 117 | | }}} |
| 118 | | Hosts whose average turnaround is at most reliable_max_avg_turnaround |
| 119 | | and whose error rate is at most reliable_max_error_rate |
| 120 | | are considered 'reliable'. |
| 121 | | {{{ |
| 122 | | <reliable_reduced_delay_bound>X</reliable_reduced_delay_bound> |
| 123 | | }}} |
| 124 | | When a result is sent to a reliable host, multiply the delay bound by reliable_reduced_delay_bound (typically 0.5 or so). |
| 125 | | {{{ |
| 126 | | <reliable_on_priority>X</reliable_on_priority> |
| 127 | | <reliable_priority_on_over>X</reliable_priority_on_over> |
| 128 | | <reliable_priority_on_over_except_error>X</reliable_priority_on_over_except_error> |
| 129 | | }}} |
| 130 | | Results with priority at least '''reliable_on_priority''' will be sent only to reliable hosts. |
| 131 | | Increase priority of duplicate results by '''reliable_priority_on_over'''; |
| 132 | | increase priority of duplicates caused by timeout (not error) by '''reliable_priority_on_over_except_error'''. |
| 133 | | |
| | 138 | == Scheduling: accelerating retries == |
| | 139 | |
| | 140 | The goal of this mechanism (which works with job-cache and matchmaker scheduling, |
| | 141 | but not locality scheduling) is to send timeout-generated retries to |
| | 142 | hosts that are likely to finish them fast. |
| | 143 | Here's how it works: |
| | 144 | * Hosts are deemed "reliable" (a slight misnomer) if they satisfy turnaround time and error rate criteria. |
| | 145 | * A job instance is deemed "need-reliable" if its priority is above a threshold. |
| | 146 | * The scheduler tries to send need-reliable jobs to reliable hosts. When it does, it reduces the delay bound of the job. |
| | 147 | * When job replicas are created in response to errors or timeouts, their priority is raised relative to the job's base priority. |
| | 148 | |
| | 149 | The configurable parameters are: |
| | 150 | {{{ |
| | 151 | <reliable_on_priority>X</reliable_on_priority> |
| | 152 | }}} |
| | 153 | Results with priority at least '''reliable_on_priority''' are treated as "need-reliable". |
| | 154 | With matchmaker scheduling, they'll be sent preferentially to reliable hosts; |
| | 155 | with job-cache scheduling, they'll be sent ONLY to reliable hosts. |
| | 156 | |
| | 157 | {{{ |
| | 158 | <reliable_max_avg_turnaround>secs</reliable_max_avg_turnaround> |
| | 159 | <reliable_max_error_rate>X</reliable_max_error_rate> |
| | 160 | }}} |
| | 161 | Hosts whose average turnaround is at most reliable_max_avg_turnaround |
| | 162 | and whose error rate is at most reliable_max_error_rate are considered 'reliable'. |
| | 163 | Make sure you set these low enough that a significant fraction (e.g. 25%) of your hosts qualify. |
| | 164 | {{{ |
| | 165 | <reliable_reduced_delay_bound>X</reliable_reduced_delay_bound> |
| | 166 | }}} |
| | 167 | When a need-reliable result is sent to a reliable host, |
| | 168 | multiply the delay bound by '''reliable_reduced_delay_bound''' (typically 0.5 or so). |
| | 169 | {{{ |
| | 170 | <reliable_priority_on_over>X</reliable_priority_on_over> |
| | 171 | <reliable_priority_on_over_except_error>X</reliable_priority_on_over_except_error> |
| | 172 | }}} |
| | 173 | |
| | 174 | If '''reliable_priority_on_over''' is nonzero, |
| | 175 | increase the priority of duplicate jobs by that amount over the job's base priority. |
| | 176 | Otherwise, if '''reliable_priority_on_over_except_error''' is nonzero, |
| | 177 | increase the priority of duplicates caused by timeout (not error) by that amount. |
| | 178 | (Typically only one of these is nonzero, and is equal to '''reliable_on_priority'''.) |
| | 179 | |
| | 180 | NOTE: this mechanism can be used to preferentially send ANY job, |
| | 181 | not just retries, to fast/reliable hosts. |
| | 182 | To do so, set the workunit's priority to '''reliable_on_priority''' or greater. |
| | 183 | |