Changes between Version 8 and Version 9 of AdaptiveReplication


Ignore:
Timestamp:
Mar 9, 2015, 1:45:31 PM (10 years ago)
Author:
davea
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • AdaptiveReplication

    v8 v9  
    44one of the hosts is known to be highly reliable.
    55The overhead of replication is high - at least 50% of total CPU time
    6 is spent checking validity.
     6is spent checking result validity.
    77
    8 '''Adaptive replication''' is an optional policy that avoids replicating a job
    9 if it has been sent to a highly reliable host.
     8'''Adaptive replication''' is an optional policy that avoids replicating jobs
     9that are sent to highly reliable hosts.
    1010The goal of this policy is to provide a target level of confidence
    1111with minimal overhead - perhaps only 5% or 10% of total CPU time.
     
    1313== Policy ==
    1414
    15 BOINC maintains an estimate E(H) of host H's recent error rate.
    16 This is maintained as follows:
    17 
    18  * It is initialized to 0.1
    19  * It is multiplied by 0.95 when H reports a correct (replicated) result.
    20  * It is incremented by 0.1 when H reports an incorrect (replicated) result.
    21 
    22 Thus, it takes a long time to earn a good reputation
    23 and a short time to lose it.
     15BOINC maintains the number CV(H, V) of consecutive valid results
     16return by host H using app version V.
     17This is incremented when a replicated job computed with (H, V) is validated,
     18and is zeroed when such a job is found to be invalid.
     19(V is included because, for example, some hosts may be less reliable
     20for GPU jobs than for CPU jobs).
    2421
    2522The adaptive replication policy is as follows.
    2623
    2724 * Each job is initially marked as unreplicated.
    28  * On each request, the scheduler decides whether to trust the host as follows:
    29   * If E(H) > A, don't trust the host.
    30   * Otherwise, trust the host with probability 1 - sqrt( E(H)/A ).
     25 * When sending a job using app version V, the scheduler decides whether to trust the host as follows:
     26  * If CV(H, V) < 10, don't trust the host.
     27  * Otherwise, trust the host randomly with probability 1 - 1/CV(H, V).
    3128 * If we decide to trust the host, preferentially send it unreplicated jobs.
    32  * Otherwise, preferentially send it replicated jobs.  If we have to send it an unreplicated job, mark it as replicated and create new instances accordingly.
    33 
    34 In the current code base (as of r18056), A is hardcoded to be 0.05 in sched_send.cpp as `ER_MAX`.
     29 * Otherwise, preferentially send it replicated jobs.
     30   If we have to send it an unreplicated job, mark it as replicated and create new instances accordingly.
    3531
    3632== Using adaptive replication ==
     
    5349Scheduler:
    5450 * Decide whether to trust host as described above.
    55  * If we send an unreplicated job (i.e., target_nresults=1 and app.target_nresults>1) to an untrusted host, set wu.target_nresults = app.target_nresults and flag the WU for transitioning.
    56 
    57 Validator:
    58  * Don't update host.error_rate for unreplicated results (i.e., wu.target_nresults=1 and app.target_nresults>1).
     51 * If we send an unreplicated job (i.e., target_nresults=1 and app.target_nresults>1) to an untrusted host,
     52   set wu.target_nresults = app.target_nresults and flag the WU for transitioning.