wiki:AutoFlops

Version 7 (modified by davea, 15 years ago) (diff)

--

T(DesignDocument)?

Automated estimation of job and app version characteristics

Goals

  • Eliminate the need for projects to supply FLOPs estimate for jobs
  • Eliminate the need for projects to supply FLOPS estimates for app versions (in app_plan())
  • Improve job completion time estimates by using different duration correction factors for different app versions
  • Simplify the credit-granting policy

App units

App units are a project-defined, application-specific measure of computation. They are typically a count of iterations of the app's main loop. They should be roughly proportional to FLOPs performed, but it doesn't matter what the proportion is (i.e., you don't have to count the FLOPs in your main loop).

Apps use a new API to report how many app units they've performed (call this right before boinc_exit()):

void boinc_app_units(double x);

This is optional (the default is one). However, if an app's jobs have highly variable size and don't report app units, BOINC will tend to give too little credit for them.

The following API functions are deprecated:

boinc_ops_per_cpu_second()
boinc_ops_cumulative()

Job submission has a new optional argument predicted_app_units: the predicted number of app units the job will take. The default is one. The rsc_fpops_est and rsc_fpops_bound arguments to job creation are deprecated.

The predictions don't have to be exact. In fact, it's OK if they're systematically too high or low. However, if predicted app units are not linearly correlated with actual app units, bad completion time estimates will result.

The server maintains the mean and variance of both actual and predicted app units, and scales prediction by the ratio of the means. It also maintains au_variance, the variance of (actual/predicted), which is used for completion time estimates and bounds.

Job completion time estimates and bounds

The BOINC client maintains a estimate seconds_per_app_unit(V) of elapsed time per app unit for the app version V. This is computed as a weighted average of seconds per app unit for recently completed jobs It reports this to the scheduler.

The scheduler's completion time estimate for a job J using app version V on a given host is

seconds_per_app_unit(V)
* J.predicted_aus
* (app.actual_aus_mean/app.predicted_aus_mean + X*app.actual_aus_stdev)

where X is, say, 2 (meaning that for 95% of jobs, the actual completion time will be less than the estimate).

The completion time bound for a job is the same, but with X=4 or so (meaning that 99.99% of jobs will complete without being aborted due to time limit).

Estimating FLOPS per app unit

For credit-granting purposes we want to estimate the number of FLOPs per app unit.

When the scheduler dispatches a job, it knows the numbers of CPUs and GPUs used, as well as the peak hardware FLOPS of each device (for CPUs, this is the Whetstone benchmark result; for GPUs, it's derived from hardware parameters). We define peak_flops_per_sec(J) as the sum of these counts times the peak hardware FLOPS of the devices.

The server sends the client peak_flops_per_sec(J). When the client returns a job, it includes a value raw_flops_per_sec(J). This is usually the same as peak_flops_per_sec(J) but it may be less (see note 2 below).

Suppose a job J is executed on a given host using app version V, and that it reports A app units and uses elapsed time T. We then define

raw_flops(J) = T * raw_flops_per_sec(J)
raw_flops_per_au(J) = raw_flops(J)/A

If we run jobs on lots of different hosts, the distribution of raw_flops_per_au(J) will have variance, because applications run with different efficiency on different hardware. Suppose an app is memory-intensive. Running on machine A with a fast CPU (say 10 GFLOPS Whetstone) and a slow memory system, it might do 1 GFLOPS (10% efficiency), while on machine B with a slow CPU (say 1 GFLOPS Whetstone) and a fast memory system, it might also do 1 GFLOPS (100% efficiency). A and B would report raw_flops_per_au(J) as 10 GFLOPS and 1 GFLOPS respectively.

In the absence of FLOPS counting, we can't estimate the true FLOPS/app_unit. The best we can do is to use raw_flops_per_au(J) from the most efficient hosts. So what we will do is to use an order statistic, say the 5th percentile value, from the distribution of raw_flops_per_au(J). We'll call this est_flops_per_au(app).

Notes:

1) Typically GPUs are much less efficient than CPUs. If a large fraction of jobs (say, >95%) are being done by GPUs, then the 5th percentile value won't give us a good estimate. To deal with this we could maintain separate arrays for GPU and CPU app versions, and take the min of the two 5th-percentile values.

2) If an application has only very inefficient app versions, we'll get an inflated estimate of FLOPS/app_unit. Assuming that GPU apps tend to be more inefficient than CPU apps, this will happen with projects that have only a GPU app (e.g., GPUGRID.net). Here's a scheme to address this problem:

a) the server sends the client est_flops_per_au, and a flag indicating whether the app has both GPU and CPU versions

b) If a client has an app version V for an app with only GPU versions, and it also has app versions V1...Vn for apps that have both CPU and GPU versions, then let

raw_flops_per_au(J) = average of the est_flops_per_au for V1...Vn

otherwise raw_flops_per_au(J) = peak_flops_per_au(J)

The net effect is to propagate efficiency estimates between projects.

3) If an app's jobs have variance that is not reflected in app units, est_flops_per_au may be way too low (suppose, e.g., that 5% of jobs end immediately). A solution to this problem is to define a "reference job" that's executed on all machines, and use only that job to compute est_flops_per_au.

4) A job's elapsed time is influenced by factors such as non-BOINC jobs and overcommitment by BOINC. These can potentially lead to overestimates of FLOPS per app unit. However, this effect will be minimal if a reasonable fraction (> 5%) of hosts don't run non-BOINC jobs and don't have overcommitment.

Credit

Grant each validated job credit proportional to

app.mean_app_units * app.est_flops_per_au

All jobs for a given app get the same credit. A user doesn't get more credit for a longer-than-average job, but it averages out in the end.

This removes the incentive for hackers to report exaggerated job statistics (say, app units).

However, it introduces a new form of credit cheating: "cherry picking", i.e. completing only short jobs (based on input files, or on trial execution) and aborting or discarding others.

This can be discouraged by server mechanisms:

  • reducing the number of jobs/day when a job is aborted or times out.
  • resend jobs (Richard Haselgrove's idea: don't always resend; its expensive. Do it randomly, or when a host is under suspicion).

Implementation notes

Protocol

Request message:

  • for each reported job
    • <app_units>
    • <raw_flops_per_au>
  • for each app version
    • <seconds_per_app_unit>
    • <raw_flops_per_sec>

Reply message:

  • for each app
    • <est_flops_per_au>
  • for each app version
  • for each job
    • <peak_flops_per_sec>

Database

app.est_flops_per_au

Transitioner

The transitioner maintains, for each app, an array of the N most recent values of raw_flops_per_au(J) for that app. When this array is full, the transitioner computes the 5th percentile value of the array. It then exponentially averages this value with app.est_flops_per_au.

client

For each app version, maintain

  • seconds_per_app_unit: dynamic estimate of elapsed seconds per app unit

When a job completes, update this as an exponential average. This replaces "duration correction factor".

Anonymous platform notes