289 | | granted credit = claimed credit. |
290 | | |
291 | | For jobs that are replicated, granted credit should be |
292 | | set to the min of the valid results |
293 | | (min is used instead of average to remove the incentive |
294 | | for cherry-picking, see below). |
295 | | |
296 | | However, there are still some possible forms of cheating. |
297 | | |
298 | | * One-time cheats (like claiming 1e304) can be prevented by |
299 | | capping VNPFC(J) at some multiple (say, 10) of VNPFC^mean^(A). |
300 | | * Cherry-picking: suppose an application has two types of jobs, |
301 | | which run for 1 second and 1 hour respectively. |
302 | | Clients can figure out which is which, e.g. by running a job for 2 seconds |
303 | | and seeing if it's exited. |
304 | | Suppose a client systematically refuses the 1 hour jobs |
305 | | (e.g., by reporting a crash or never reporting them). |
306 | | Its VNPFC^mean^(H, A) will quickly decrease, |
307 | | and soon it will be getting several thousand times more credit |
308 | | per actual work than other hosts! |
309 | | Countermeasure: |
310 | | whenever a job errors out, times out, or fails to validate, |
311 | | set the host's error rate back to the initial default, |
312 | | and set its VNPFC^mean^(H, A) to VNPFC^mean^(A) for all apps A. |
313 | | This puts the host to a state where several dozen of its |
314 | | subsequent jobs will be replicated. |
315 | | |
316 | | == Error rate, host punishment, and turnaround time estimation == |
317 | | |
318 | | Unrelated to the credit proposal, but in a similar spirit. |
319 | | |
320 | | Due to hardware problems (e.g. a malfunctioning GPU) |
321 | | a host may have a 100% error rate for one app version |
322 | | and a 0% error rate for another. |
323 | | Similar for turnaround time. |
324 | | |
325 | | So we'll move the "error_rate" and "turnaround_time" |
326 | | fields from the host table to host_app_version. |
327 | | |
328 | | The host punishment mechanism is designed to deal with malfunctioning hosts. |
329 | | For each host the server maintains '''max_results_day'''. |
330 | | This is initialized to a project-specified value (e.g. 200) |
331 | | and scaled by the number of CPUs and/or GPUs. |
332 | | It's decremented if the client reports a crash |
333 | | (but not if the job was aborted). |
334 | | It's doubled when a successful (but not necessarily valid) |
335 | | result is received. |
336 | | |
337 | | This should also be per-app-version, |
338 | | so we'll move "max_results_day" from the host table to host_app_version. |
| 290 | in this case, granted credit = claimed credit. |
| 291 | |
| 292 | For jobs that are replicated, |
| 293 | granted credit is set to: |
| 294 | * if the larger host is on scale probation, the smaller |
| 295 | * if larger > 2*smaller, granted = 1.5*smaller |
| 296 | * else granted = (larger+smaller)/2 |
| 297 | |
| 298 | However, two kinds of cheating still have to be dealt with: |
| 299 | |
| 300 | === One-time cheats === |
| 301 | |
| 302 | For example, claiming a PFC of 1e304. |
| 303 | This can be minimized by |
| 304 | capping VNPFC(J) at some multiple (say, 20) of VNPFC^mean^(A). |
| 305 | If this is enforced, the host's error rate is set to the initial value, |
| 306 | so it won't do single replication for a while, |
| 307 | and scale_probation (see below) is set to true. |
| 365 | == Error rate, host punishment, and turnaround time estimation == |
| 366 | |
| 367 | Unrelated to the credit proposal, but in a similar spirit. |
| 368 | |
| 369 | Due to hardware problems (e.g. a malfunctioning GPU) |
| 370 | a host may have a 100% error rate for one app version |
| 371 | and a 0% error rate for another. |
| 372 | Similar for turnaround time. |
| 373 | |
| 374 | So we'll move the "error_rate" and "turnaround_time" |
| 375 | fields from the host table to host_app_version. |
| 376 | |
| 377 | The host punishment mechanism is designed to deal with malfunctioning hosts. |
| 378 | For each host the server maintains '''max_results_day'''. |
| 379 | This is initialized to a project-specified value (e.g. 200) |
| 380 | and scaled by the number of CPUs and/or GPUs. |
| 381 | It's decremented if the client reports a crash |
| 382 | (but not if the job was aborted). |
| 383 | It's doubled when a successful (but not necessarily valid) |
| 384 | result is received. |
| 385 | |
| 386 | This should also be per-app-version, |
| 387 | so we'll move "max_results_day" from the host table to host_app_version. |
| 388 | |