| | 1 | = Job runtime estimation = |
| | 2 | |
| | 3 | == The old system == |
| | 4 | |
| | 5 | Jobs have a FLOP count estimate, wu.rsc_fpops_est. |
| | 6 | |
| | 7 | When sending an app version to a host, |
| | 8 | the scheduler estimates its FLOPS. |
| | 9 | This is either the CPU benchmark, |
| | 10 | or a value assigned by the app_plan() function. |
| | 11 | |
| | 12 | The app_plan function is expected to predict |
| | 13 | the performance of an app on all possible hosts. |
| | 14 | |
| | 15 | The client maintains a per-project duration correction factor (DCF), |
| | 16 | which was intended to measure the efficiency of the project's apps, |
| | 17 | and the systematic error in wu.rsc_fpops_est. |
| | 18 | DCF was used to scale runtime estimates on both client and server side. |
| | 19 | |
| | 20 | Problems with the old system: |
| | 21 | |
| | 22 | * Projects can have lots of apps. A single DCF does not suffice. |
| | 23 | * Projects can't be expected to predict app performance. |
| | 24 | |
| | 25 | == The new system == |
| | 26 | |
| | 27 | Projects still have to supply wu.rsc_fpops_est. |
| | 28 | |
| | 29 | The new system has a large overlap with [CreditNew the new credit system]. |
| | 30 | In particular, we now maintain: |
| | 31 | |
| | 32 | * A '''host_app_version''' database record |
| | 33 | per (host, app version), or per (host, app, resource type) in the case of anonymous platform. |
| | 34 | This record includes the average elapsed time per wu.rsc_fpops_est. |
| | 35 | * for each app version, a '''pfc_scale''' which approximates the efficiency |
| | 36 | of the app version relative to the most efficient version. |
| | 37 | The app_plan() function now returns peak FLOPS, |
| | 38 | not the expected actual FLOPS. |
| | 39 | |
| | 40 | In the process of selecting an app version for each job, |
| | 41 | the scheduler estimates its actual FLOPS. |
| | 42 | This is stored in BEST_APP_VERSION.HOST_USAGE.flops. |
| | 43 | |
| | 44 | === Regular case === |
| | 45 | |
| | 46 | An app version's FLOPS estimate is initially the peak FLOPS. |
| | 47 | We then look at the host_app_version record. |
| | 48 | If it exists, and there are sufficient samples, we set |
| | 49 | {{{ |
| | 50 | estimated_flops = 1/host_app_version.et.avg |
| | 51 | }}} |
| | 52 | |
| | 53 | Otherwise, is app_version.pfc_scale is defined, |
| | 54 | |
| | 55 | {{{ |
| | 56 | estimated_flops *= app_version.pfc_scale |
| | 57 | }}} |
| | 58 | |
| | 59 | === Anonymous platform case === |
| | 60 | |
| | 61 | If the host_app_version record exists and there are sufficient samples, |
| | 62 | {{{ |
| | 63 | estimated_flops = 1/host_app_version.et.avg |
| | 64 | }}} |
| | 65 | |
| | 66 | Otherwise, we use the estimate supplied by the client. |
| | 67 | This may be specified in the app_info.xml file. |
| | 68 | If not, the current client passes the peak FLOPS. |
| | 69 | |
| | 70 | Older clients (predating GPU support) don't pass a FLOPS estimate. |
| | 71 | In this case we use the CPU benchmark. |
| | 72 | |
| | 73 | The estimated FLOPS is used to estimate job runtime on the server side. |
| | 74 | |
| | 75 | However, the only way to change the client's runtime estimate is by |
| | 76 | adjusting the wu.rsc_fpops_est that we send to the client. |
| | 77 | So, in the first case above, we scale wu.rsc_fpops_est by |
| | 78 | {{{ |
| | 79 | (old estimate flops)/(new estimated flops) |
| | 80 | }}} |