Context Navigation

Changes between Version 2 and Version 3 of AutoFlops

Timestamp:: Sep 9, 2009, 12:53:45 PM (16 years ago)
Author:: davea
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

AutoFlops

-                      v2
+                      v3
 == Goals ==
+ * eliminate the need for projects to supply FLOPs estimate for jobs
+ * eliminate the need for projects to supply FLOPS estimates for app versions (in app_plan())
+ * simplify credit-granting policy
+== Outline ==
+=== Server ===
+For each app, maintain
+ * flops_avg: the estimated average number of FLOPs used by the app's jobs
+ * flops_stdev: the standard deviation of the above
+Note: if a project has different types of jobs for a given app,
+with widely differing durations,
+they should create separate apps for them.
+That will reduce the variance of the estimate.
+Initially this is set to a high value (e.g. 1 GFLOP-day).
+Update: whenever a complete job is reported,
+let x = duration * (host's flops_est).
+Update flops_avg in a way that favors decrease over increase.
+That way, hosts that execute the app efficiently (close to peak hardware speed)
+have a larger weight in the estimate.
+ * Eliminate the need for projects to supply FLOPs estimate for jobs
+ * Eliminate the need for projects to supply FLOPS estimates for app versions (in app_plan())
+ * Improve job completion time estimates by using different duration correction factors for different app versions
+ * Simplify the credit-granting policy
+== App units ==
+'''App units''' are a project-defined measure of how much computation a completed job did.
+They are typically a count of iterations of the app's main loop.
+They should be approximately proportional to FLOPs performed,
+but it doesn't matter what the proportion is
+(i.e., you don't have to count the FLOPs in your main loop).
+Apps use a new API to report how many app units they've performed
+(call this right before boinc_exit()):
+{{{
+void boinc_app_units(double x);
+}}}
+This is optional (the default is one).
+However, if an app's jobs have highly variable size and don't report app units,
+BOINC will tend to give too little credit for them.
+The following API functions are deprecated:
+{{{
+boinc_ops_per_cpu_second()
+boinc_ops_cumulative()
+}}}
+Job submission has a new optional argument '''predicted_app_units''':
+the predicted number of app units the job will take.
+The default is one.
+The '''rsc_fpops_est''' and '''rsc_fpops_bound''' arguments
+to job creation are deprecated.
+The predictions don't have to be exact.
+In fact, it's OK if they're systematically too high or low,
+as long as there's a linear correlation.
+However, if predicted app units are not linearly correlated with
+actual app units, bad completion time estimates will result.
+The server maintains the mean and variance of both actual and predicted app units,
+and scales prediction by the ratio of the means.
+It also maintains '''au_variance''', the variance of (actual/predicted),
+which is used for completion time estimates and bounds.
+== Job completion time estimates and bounds ==
+The BOINC client maintains a per-app-version estimate seconds_per_app_unit;
+The completion time estimate of a job J is
+{{{
+seconds_per_app_unit
+* J.predicted_aus
+* (app.mean_actual_aus/app.mean_predicted_aus + X*au_stdev)
+}}}
+where X is, say, 2 (meaning that for 95% of jobs, the actual completion time
+will be less than the estimate).
+The completion time bound for a job is the same,
+but with X=4 or so (meaning that 99.99% of jobs will complete without
+being aborted due to time limit).
+== Estimating FLOPS per app unit ==
+For credit-granting purposes
+we want to estimate the number of FLOPs per app unit.
+When the scheduler dispatches a job,
+it knows the numbers of CPUs and GPUs used,
+as well as the peak hardware FLOPS of each device
+(for CPUs, this is the Whetstone benchmark result;
+for GPUs, it's derived from hardware parameters).
+We define '''peak_flops_per_sec(J)'''
+as the sum of these counts times the peak hardware FLOPS of the devices.
+The server sends the client peak_flops_per_sec(J).
+When the client returns a job, it includes a value
+raw_flops_per_sec(J).
+This is usually the same as peak_flops_per_sec(J)
+but it may be less (see note 2 below).
+Suppose a job J is executed on a given host using app version V,
+and that it reports A app units and uses elapsed time T.
+We then define raw_flops(J) as T * raw_flops_per_sec(J).
+We define raw_flops_per_au(J) as raw_flops(J)/A.
+If we run jobs on lots of different hosts,
+the distribution of raw_flops_per_au(J) will have variance,
+because applications run with different efficiency on different hardware.
+Suppose an app is memory-intensive.
+Running on machine A with a fast CPU (say 10 GFLOPS Whetstone) and a slow memory system,
+it might do 1 GFLOPS (10% efficiency),
+while on machine B with a slow CPU (say 1 GFLOPS Whetstone) and a fast memory system,
+it might also do 1 GFLOPS (100% efficiency).
+A and B would report raw_flops_per_au(J) as 10 GFLOPS and 1 GFLOPS respectively.
+In the absence of FLOPS counting, we can't estimate the true FLOPS/app_unit.
+The best we can do is to use raw_flops_per_au(J) from the most efficient hosts.
+So what we will do is to use an order statistic,
+say the 5th percentile value, from the distribution of raw_flops_per_au(J).
+We'll call this est_flops_per_au(app).
+Notes:
+) Typically GPUs are much less efficient than CPUs.
+If a large fraction of jobs (say, >95%) are being done by GPUs,
+then the 5th percentile value won't give us a good estimate.
+To deal with this we could maintain separate arrays
+for GPU and CPU app versions, and take the min of the two 5th-percentile values.
+) If an application has only very inefficient app versions,
+we'll get an inflated estimate of FLOPS/app_unit.
+Assuming that GPU apps tend to be more inefficient than CPU apps,
+this will happen with projects that have only a GPU app (e.g., GPUGRID.net).
+Here's a scheme to address this problem:
+a) have the client report raw_flops_per_sec instead of using the server value.
+b) pass the client a flag saying that an app has both CPU and GPU versions.
+c) If a client has an app version V for an app with only GPU versions,
+and it also has app versions V1...Vn for apps that have
+both CPU and GPU versions,
+then use the average of the est_flops_per_au for V1...Vn
+in the raw_flops_per_au(J) that it reports for V.
+The net effect is to propagate efficiency estimates between projects.
+) If an app's jobs have variance that is not reflected in app units,
+est_flops_per_au may be way too low
+(suppose, e.g., that 5% of jobs end immediately).
+A solution to this problem is to define a "reference job"
+that's executed on all machines,
+and use only that job to compute est_flops_per_au.
+== Credit ==
+Grant each validated job credit proportional to
+{{{
+app.mean_app_units * app.est_flops_per_au
+}}}
+All jobs for a given app get the same credit.
+A user doesn't get more credit for a longer-than-average job,
+but it averages out in the end.
+This removes the incentive for hackers to report exaggerated
+job statistics (say, app units).
+However, it introduces a new form of credit cheating: "cherry picking",
+i.e. completing only short jobs
+(based on input files, or on trial execution)
+and aborting or discarding others.
+This can be discouraged either by mechanisms
+that reduce the number of jobs/day when a job is aborted or times out.
+== Anonymous platform notes ==
+== Implementation ==
+=== Protocol ===
+Request message:
+ * for each reported job, add <app_units>
+ * for each app version, add <seconds_per_app_unit>
+Reply message:
+ * for each app, <est_flops_per_au>
+ * for each app version, <
+=== Database ===
+app.est_flops_per_au
+=== Transitioner ===
+The transitioner maintains, for each app,
+an array of the N most recent values of raw_flops_per_au(J) for that app.
+When this array is full, the transitioner computes the 5th percentile value of the array.
+It then exponentially averages this value with app.est_flops_per_au.
+=== Scheduler ===
 Job completion time estimate:
+app.flops_avg / (host's flops_est for this app version)
+{{{
+estimated_time =
+        J.predicted_app_units
+        * A.predicted_app_units_scale
+        / app_version.seconds_per_app_unit   // as reported by host
+}}}
+or
+{{{
+        J.predicted_app_units
+        * A.predicted_app_units_scale
+        * A.est_raw_flops_per_au
+        / J.flops_estimate                                      // as estimated by scheduler
+}}}
 === client ===
+for each app version, maintain
+ * flops_est: dynamic estimate of the real FLOPS of the app version on this host.
+Initially this is based on the peak hardware speed, i.e. ngpus*(GPU peak FLOPs) + avg_ncus * (whetstone).
+Update: when a job finishes, let x = (job.flops / duration).
+Update accordingly (but cap at peak hardware speed).
+Note: this replaces "duration correction factor".
+=== protocol ===
+Request message: add flops_est for each app version
+== Credit ==
+Grant each validated job credit proportional to app.flops_avg
+Note: a user doesn't get more credit for a longer-than-average job.
+But it averages out in the end.
+For each app version, maintain
+ * seconds_per_app_unit: dynamic estimate of elapsed seconds per app unit
+When a job completes, update this as an exponential average.
+This replaces "duration correction factor".