114 | | {{{ |
115 | | <reliable_max_avg_turnaround>secs</reliable_max_avg_turnaround> |
116 | | <reliable_max_error_rate>X</reliable_max_error_rate> |
117 | | }}} |
118 | | Hosts whose average turnaround is at most reliable_max_avg_turnaround |
119 | | and whose error rate is at most reliable_max_error_rate |
120 | | are considered 'reliable'. |
121 | | {{{ |
122 | | <reliable_reduced_delay_bound>X</reliable_reduced_delay_bound> |
123 | | }}} |
124 | | When a result is sent to a reliable host, multiply the delay bound by reliable_reduced_delay_bound (typically 0.5 or so). |
125 | | {{{ |
126 | | <reliable_on_priority>X</reliable_on_priority> |
127 | | <reliable_priority_on_over>X</reliable_priority_on_over> |
128 | | <reliable_priority_on_over_except_error>X</reliable_priority_on_over_except_error> |
129 | | }}} |
130 | | Results with priority at least '''reliable_on_priority''' will be sent only to reliable hosts. |
131 | | Increase priority of duplicate results by '''reliable_priority_on_over'''; |
132 | | increase priority of duplicates caused by timeout (not error) by '''reliable_priority_on_over_except_error'''. |
133 | | |
| 138 | == Scheduling: accelerating retries == |
| 139 | |
| 140 | The goal of this mechanism (which works with job-cache and matchmaker scheduling, |
| 141 | but not locality scheduling) is to send timeout-generated retries to |
| 142 | hosts that are likely to finish them fast. |
| 143 | Here's how it works: |
| 144 | * Hosts are deemed "reliable" (a slight misnomer) if they satisfy turnaround time and error rate criteria. |
| 145 | * A job instance is deemed "need-reliable" if its priority is above a threshold. |
| 146 | * The scheduler tries to send need-reliable jobs to reliable hosts. When it does, it reduces the delay bound of the job. |
| 147 | * When job replicas are created in response to errors or timeouts, their priority is raised relative to the job's base priority. |
| 148 | |
| 149 | The configurable parameters are: |
| 150 | {{{ |
| 151 | <reliable_on_priority>X</reliable_on_priority> |
| 152 | }}} |
| 153 | Results with priority at least '''reliable_on_priority''' are treated as "need-reliable". |
| 154 | With matchmaker scheduling, they'll be sent preferentially to reliable hosts; |
| 155 | with job-cache scheduling, they'll be sent ONLY to reliable hosts. |
| 156 | |
| 157 | {{{ |
| 158 | <reliable_max_avg_turnaround>secs</reliable_max_avg_turnaround> |
| 159 | <reliable_max_error_rate>X</reliable_max_error_rate> |
| 160 | }}} |
| 161 | Hosts whose average turnaround is at most reliable_max_avg_turnaround |
| 162 | and whose error rate is at most reliable_max_error_rate are considered 'reliable'. |
| 163 | Make sure you set these low enough that a significant fraction (e.g. 25%) of your hosts qualify. |
| 164 | {{{ |
| 165 | <reliable_reduced_delay_bound>X</reliable_reduced_delay_bound> |
| 166 | }}} |
| 167 | When a need-reliable result is sent to a reliable host, |
| 168 | multiply the delay bound by '''reliable_reduced_delay_bound''' (typically 0.5 or so). |
| 169 | {{{ |
| 170 | <reliable_priority_on_over>X</reliable_priority_on_over> |
| 171 | <reliable_priority_on_over_except_error>X</reliable_priority_on_over_except_error> |
| 172 | }}} |
| 173 | |
| 174 | If '''reliable_priority_on_over''' is nonzero, |
| 175 | increase the priority of duplicate jobs by that amount over the job's base priority. |
| 176 | Otherwise, if '''reliable_priority_on_over_except_error''' is nonzero, |
| 177 | increase the priority of duplicates caused by timeout (not error) by that amount. |
| 178 | (Typically only one of these is nonzero, and is equal to '''reliable_on_priority'''.) |
| 179 | |
| 180 | NOTE: this mechanism can be used to preferentially send ANY job, |
| 181 | not just retries, to fast/reliable hosts. |
| 182 | To do so, set the workunit's priority to '''reliable_on_priority''' or greater. |
| 183 | |