Changes between Version 2 and Version 3 of AutoFlops


Ignore:
Timestamp:
Sep 9, 2009, 12:53:45 PM (15 years ago)
Author:
davea
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • AutoFlops

    v2 v3  
    33== Goals ==
    44
    5  * eliminate the need for projects to supply FLOPs estimate for jobs
    6  * eliminate the need for projects to supply FLOPS estimates for app versions (in app_plan())
    7  * simplify credit-granting policy
    8 
    9 == Outline ==
    10 
    11 === Server ===
    12 For each app, maintain
    13  * flops_avg: the estimated average number of FLOPs used by the app's jobs
    14  * flops_stdev: the standard deviation of the above
    15 
    16 Note: if a project has different types of jobs for a given app,
    17 with widely differing durations,
    18 they should create separate apps for them.
    19 That will reduce the variance of the estimate.
    20 
    21 Initially this is set to a high value (e.g. 1 GFLOP-day).
    22 
    23 Update: whenever a complete job is reported,
    24 let x = duration * (host's flops_est).
    25 Update flops_avg in a way that favors decrease over increase.
    26 That way, hosts that execute the app efficiently (close to peak hardware speed)
    27 have a larger weight in the estimate.
     5 * Eliminate the need for projects to supply FLOPs estimate for jobs
     6 * Eliminate the need for projects to supply FLOPS estimates for app versions (in app_plan())
     7 * Improve job completion time estimates by using different duration correction factors for different app versions
     8 * Simplify the credit-granting policy
     9
     10== App units ==
     11
     12'''App units''' are a project-defined measure of how much computation a completed job did.
     13They are typically a count of iterations of the app's main loop.
     14They should be approximately proportional to FLOPs performed,
     15but it doesn't matter what the proportion is
     16(i.e., you don't have to count the FLOPs in your main loop).
     17
     18Apps use a new API to report how many app units they've performed
     19(call this right before boinc_exit()):
     20{{{
     21void boinc_app_units(double x);
     22}}}
     23This is optional (the default is one).
     24However, if an app's jobs have highly variable size and don't report app units,
     25BOINC will tend to give too little credit for them.
     26
     27The following API functions are deprecated:
     28{{{
     29boinc_ops_per_cpu_second()
     30boinc_ops_cumulative()
     31}}}
     32
     33Job submission has a new optional argument '''predicted_app_units''':
     34the predicted number of app units the job will take.
     35The default is one.
     36The '''rsc_fpops_est''' and '''rsc_fpops_bound''' arguments
     37to job creation are deprecated.
     38
     39The predictions don't have to be exact.
     40In fact, it's OK if they're systematically too high or low,
     41as long as there's a linear correlation.
     42However, if predicted app units are not linearly correlated with
     43actual app units, bad completion time estimates will result.
     44
     45The server maintains the mean and variance of both actual and predicted app units,
     46and scales prediction by the ratio of the means.
     47It also maintains '''au_variance''', the variance of (actual/predicted),
     48which is used for completion time estimates and bounds.
     49
     50== Job completion time estimates and bounds ==
     51
     52The BOINC client maintains a per-app-version estimate seconds_per_app_unit;
     53The completion time estimate of a job J is
     54{{{
     55seconds_per_app_unit
     56* J.predicted_aus
     57* (app.mean_actual_aus/app.mean_predicted_aus + X*au_stdev)
     58}}}
     59where X is, say, 2 (meaning that for 95% of jobs, the actual completion time
     60will be less than the estimate).
     61
     62The completion time bound for a job is the same,
     63but with X=4 or so (meaning that 99.99% of jobs will complete without
     64being aborted due to time limit).
     65
     66== Estimating FLOPS per app unit ==
     67
     68For credit-granting purposes
     69we want to estimate the number of FLOPs per app unit.
     70
     71When the scheduler dispatches a job,
     72it knows the numbers of CPUs and GPUs used,
     73as well as the peak hardware FLOPS of each device
     74(for CPUs, this is the Whetstone benchmark result;
     75for GPUs, it's derived from hardware parameters).
     76We define '''peak_flops_per_sec(J)'''
     77as the sum of these counts times the peak hardware FLOPS of the devices.
     78
     79The server sends the client peak_flops_per_sec(J).
     80When the client returns a job, it includes a value
     81raw_flops_per_sec(J).
     82This is usually the same as peak_flops_per_sec(J)
     83but it may be less (see note 2 below).
     84
     85Suppose a job J is executed on a given host using app version V,
     86and that it reports A app units and uses elapsed time T.
     87We then define raw_flops(J) as T * raw_flops_per_sec(J).
     88We define raw_flops_per_au(J) as raw_flops(J)/A.
     89
     90If we run jobs on lots of different hosts,
     91the distribution of raw_flops_per_au(J) will have variance,
     92because applications run with different efficiency on different hardware.
     93Suppose an app is memory-intensive.
     94Running on machine A with a fast CPU (say 10 GFLOPS Whetstone) and a slow memory system,
     95it might do 1 GFLOPS (10% efficiency),
     96while on machine B with a slow CPU (say 1 GFLOPS Whetstone) and a fast memory system,
     97it might also do 1 GFLOPS (100% efficiency).
     98A and B would report raw_flops_per_au(J) as 10 GFLOPS and 1 GFLOPS respectively.
     99
     100In the absence of FLOPS counting, we can't estimate the true FLOPS/app_unit.
     101The best we can do is to use raw_flops_per_au(J) from the most efficient hosts.
     102So what we will do is to use an order statistic,
     103say the 5th percentile value, from the distribution of raw_flops_per_au(J).
     104We'll call this est_flops_per_au(app).
     105
     106Notes:
     107
     1081) Typically GPUs are much less efficient than CPUs.
     109If a large fraction of jobs (say, >95%) are being done by GPUs,
     110then the 5th percentile value won't give us a good estimate.
     111To deal with this we could maintain separate arrays
     112for GPU and CPU app versions, and take the min of the two 5th-percentile values.
     113
     1142) If an application has only very inefficient app versions,
     115we'll get an inflated estimate of FLOPS/app_unit.
     116Assuming that GPU apps tend to be more inefficient than CPU apps,
     117this will happen with projects that have only a GPU app (e.g., GPUGRID.net).
     118Here's a scheme to address this problem:
     119
     120a) have the client report raw_flops_per_sec instead of using the server value.
     121
     122b) pass the client a flag saying that an app has both CPU and GPU versions.
     123
     124c) If a client has an app version V for an app with only GPU versions,
     125and it also has app versions V1...Vn for apps that have
     126both CPU and GPU versions,
     127then use the average of the est_flops_per_au for V1...Vn
     128in the raw_flops_per_au(J) that it reports for V.
     129
     130The net effect is to propagate efficiency estimates between projects.
     131
     1323) If an app's jobs have variance that is not reflected in app units,
     133est_flops_per_au may be way too low
     134(suppose, e.g., that 5% of jobs end immediately).
     135A solution to this problem is to define a "reference job"
     136that's executed on all machines,
     137and use only that job to compute est_flops_per_au.
     138
     139== Credit ==
     140
     141Grant each validated job credit proportional to
     142{{{
     143app.mean_app_units * app.est_flops_per_au
     144}}}
     145
     146All jobs for a given app get the same credit.
     147A user doesn't get more credit for a longer-than-average job,
     148but it averages out in the end.
     149
     150This removes the incentive for hackers to report exaggerated
     151job statistics (say, app units).
     152
     153However, it introduces a new form of credit cheating: "cherry picking",
     154i.e. completing only short jobs
     155(based on input files, or on trial execution)
     156and aborting or discarding others.
     157
     158This can be discouraged either by mechanisms
     159that reduce the number of jobs/day when a job is aborted or times out.
     160
     161== Anonymous platform notes ==
     162
     163== Implementation ==
     164
     165=== Protocol ===
     166
     167Request message:
     168 * for each reported job, add <app_units>
     169 * for each app version, add <seconds_per_app_unit>
     170
     171Reply message:
     172 * for each app, <est_flops_per_au>
     173 * for each app version, <
     174
     175=== Database ===
     176
     177app.est_flops_per_au
     178
     179=== Transitioner ===
     180
     181The transitioner maintains, for each app,
     182an array of the N most recent values of raw_flops_per_au(J) for that app.
     183When this array is full, the transitioner computes the 5th percentile value of the array.
     184It then exponentially averages this value with app.est_flops_per_au.
     185
     186=== Scheduler ===
    28187
    29188Job completion time estimate:
    30 app.flops_avg / (host's flops_est for this app version)
    31 
     189{{{
     190estimated_time =
     191        J.predicted_app_units
     192        * A.predicted_app_units_scale
     193        / app_version.seconds_per_app_unit   // as reported by host
     194}}}
     195or
     196{{{
     197        J.predicted_app_units
     198        * A.predicted_app_units_scale
     199        * A.est_raw_flops_per_au
     200        / J.flops_estimate                                      // as estimated by scheduler
     201}}}
    32202
    33203=== client ===
    34 for each app version, maintain
    35 
    36  * flops_est: dynamic estimate of the real FLOPS of the app version on this host.
    37 
    38 Initially this is based on the peak hardware speed, i.e. ngpus*(GPU peak FLOPs) + avg_ncus * (whetstone).
    39 Update: when a job finishes, let x = (job.flops / duration).
    40 Update accordingly (but cap at peak hardware speed).
    41 
    42 Note: this replaces "duration correction factor".
    43 
    44 === protocol ===
    45 
    46 Request message: add flops_est for each app version
    47 
    48 == Credit ==
    49 
    50 Grant each validated job credit proportional to app.flops_avg
    51 
    52 Note: a user doesn't get more credit for a longer-than-average job.
    53 But it averages out in the end.
     204
     205For each app version, maintain
     206
     207 * seconds_per_app_unit: dynamic estimate of elapsed seconds per app unit
     208
     209When a job completes, update this as an exponential average.
     210This replaces "duration correction factor".