| 1 | = Job runtime estimation = |
| 2 | |
| 3 | == The old system == |
| 4 | |
| 5 | Jobs have a FLOP count estimate, wu.rsc_fpops_est. |
| 6 | |
| 7 | When sending an app version to a host, |
| 8 | the scheduler estimates its FLOPS. |
| 9 | This is either the CPU benchmark, |
| 10 | or a value assigned by the app_plan() function. |
| 11 | |
| 12 | The app_plan function is expected to predict |
| 13 | the performance of an app on all possible hosts. |
| 14 | |
| 15 | The client maintains a per-project duration correction factor (DCF), |
| 16 | which was intended to measure the efficiency of the project's apps, |
| 17 | and the systematic error in wu.rsc_fpops_est. |
| 18 | DCF was used to scale runtime estimates on both client and server side. |
| 19 | |
| 20 | Problems with the old system: |
| 21 | |
| 22 | * Projects can have lots of apps. A single DCF does not suffice. |
| 23 | * Projects can't be expected to predict app performance. |
| 24 | |
| 25 | == The new system == |
| 26 | |
| 27 | Projects still have to supply wu.rsc_fpops_est. |
| 28 | |
| 29 | The new system has a large overlap with [CreditNew the new credit system]. |
| 30 | In particular, we now maintain: |
| 31 | |
| 32 | * A '''host_app_version''' database record |
| 33 | per (host, app version), or per (host, app, resource type) in the case of anonymous platform. |
| 34 | This record includes the average elapsed time per wu.rsc_fpops_est. |
| 35 | * for each app version, a '''pfc_scale''' which approximates the efficiency |
| 36 | of the app version relative to the most efficient version. |
| 37 | The app_plan() function now returns peak FLOPS, |
| 38 | not the expected actual FLOPS. |
| 39 | |
| 40 | In the process of selecting an app version for each job, |
| 41 | the scheduler estimates its actual FLOPS. |
| 42 | This is stored in BEST_APP_VERSION.HOST_USAGE.flops. |
| 43 | |
| 44 | === Regular case === |
| 45 | |
| 46 | An app version's FLOPS estimate is initially the peak FLOPS. |
| 47 | We then look at the host_app_version record. |
| 48 | If it exists, and there are sufficient samples, we set |
| 49 | {{{ |
| 50 | estimated_flops = 1/host_app_version.et.avg |
| 51 | }}} |
| 52 | |
| 53 | Otherwise, is app_version.pfc_scale is defined, |
| 54 | |
| 55 | {{{ |
| 56 | estimated_flops *= app_version.pfc_scale |
| 57 | }}} |
| 58 | |
| 59 | === Anonymous platform case === |
| 60 | |
| 61 | If the host_app_version record exists and there are sufficient samples, |
| 62 | {{{ |
| 63 | estimated_flops = 1/host_app_version.et.avg |
| 64 | }}} |
| 65 | |
| 66 | Otherwise, we use the estimate supplied by the client. |
| 67 | This may be specified in the app_info.xml file. |
| 68 | If not, the current client passes the peak FLOPS. |
| 69 | |
| 70 | Older clients (predating GPU support) don't pass a FLOPS estimate. |
| 71 | In this case we use the CPU benchmark. |
| 72 | |
| 73 | The estimated FLOPS is used to estimate job runtime on the server side. |
| 74 | |
| 75 | However, the only way to change the client's runtime estimate is by |
| 76 | adjusting the wu.rsc_fpops_est that we send to the client. |
| 77 | So, in the first case above, we scale wu.rsc_fpops_est by |
| 78 | {{{ |
| 79 | (old estimate flops)/(new estimated flops) |
| 80 | }}} |