Context Navigation

← Previous Change
Wiki History
Next Change →

Changes between Initial Version and Version 1 of RuntimeEstimation

Timestamp:: Apr 8, 2010, 4:06:19 PM (15 years ago)
Author:: davea
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

RuntimeEstimation

                       v1
+= Job runtime estimation =
+== The old system ==
+Jobs have a FLOP count estimate, wu.rsc_fpops_est.
+When sending an app version to a host,
+the scheduler estimates its FLOPS.
+This is either the CPU benchmark,
+or a value assigned by the app_plan() function.
+The app_plan function is expected to predict
+the performance of an app on all possible hosts.
+The client maintains a per-project duration correction factor (DCF),
+which was intended to measure the efficiency of the project's apps,
+and the systematic error in wu.rsc_fpops_est.
+DCF was used to scale runtime estimates on both client and server side.
+Problems with the old system:
+ * Projects can have lots of apps.  A single DCF does not suffice.
+ * Projects can't be expected to predict app performance.
+== The new system ==
+Projects still have to supply wu.rsc_fpops_est.
+The new system has a large overlap with [CreditNew the new credit system].
+In particular, we now maintain:
+ * A '''host_app_version''' database record
+   per (host, app version), or per (host, app, resource type) in the case of anonymous platform.
+   This record includes the average elapsed time per wu.rsc_fpops_est.
+ * for each app version, a '''pfc_scale''' which approximates the efficiency
+   of the app version relative to the most efficient version.
+The app_plan() function now returns peak FLOPS,
+not the expected actual FLOPS.
+In the process of selecting an app version for each job,
+the scheduler estimates its actual FLOPS.
+This is stored in BEST_APP_VERSION.HOST_USAGE.flops.
+=== Regular case ===
+An app version's FLOPS estimate is initially the peak FLOPS.
+We then look at the host_app_version record.
+If it exists, and there are sufficient samples, we set
+{{{
+estimated_flops = 1/host_app_version.et.avg
+}}}
+Otherwise, is app_version.pfc_scale is defined,
+{{{
+estimated_flops *= app_version.pfc_scale
+}}}
+=== Anonymous platform case ===
+If the host_app_version record exists and there are sufficient samples,
+{{{
+estimated_flops = 1/host_app_version.et.avg
+}}}
+Otherwise, we use the estimate supplied by the client.
+This may be specified in the app_info.xml file.
+If not, the current client passes the peak FLOPS.
+Older clients (predating GPU support) don't pass a FLOPS estimate.
+In this case we use the CPU benchmark.
+The estimated FLOPS is used to estimate job runtime on the server side.
+However, the only way to change the client's runtime estimate is by
+adjusting the wu.rsc_fpops_est that we send to the client.
+So, in the first case above, we scale wu.rsc_fpops_est by
+{{{
+(old estimate flops)/(new estimated flops)
+}}}