| 1 | = Homogeneous app version = |
| 2 | |
| 3 | BOINC's [HomogeneousRedundancy homogeneous redundancy] (HR) mechanism lets you |
| 4 | specify that multiple instances of a job must be run on hosts |
| 5 | whose CPU and OS type are similar, |
| 6 | to ensure that correct results are identical or |
| 7 | sufficiently similar to compare. |
| 8 | |
| 9 | The HR mechanism doesn't handle GPU app versions; |
| 10 | e.g. it can't prevent situations where one instance is |
| 11 | run with a GPU app version and another instance is run with a CPU app version. |
| 12 | |
| 13 | We considered adding GPU info to the HR mechanism. |
| 14 | This turned out to be infeasible. |
| 15 | |
| 16 | Instead, Kevin Reed and I propose adding a new mechanism called '''homogeneous app version''' (HAV), |
| 17 | which ensures that instances of a given job are run using the same app version |
| 18 | (e.g., Win32/CUDA etc.). |
| 19 | This can be specified on a per-application basis. |
| 20 | |
| 21 | Notes: |
| 22 | |
| 23 | * You can use this together with HR. |
| 24 | * Use this only when you're sure that all app versions are correct, |
| 25 | since it eliminates cross-checking between versions. |
| 26 | |
| 27 | == Implementation notes == |
| 28 | |
| 29 | New DB fields |
| 30 | |
| 31 | * APP::homogeneous_app_version (bool) |
| 32 | * WORKUNIT::app_version_id (int) |
| 33 | |
| 34 | The latter is maintained like wu.hr_class: |
| 35 | it's set when we first dispatch an instance of the job, |
| 36 | and it's cleared if all instances error out. |
| 37 | |
| 38 | Change to best_app_version(): |
| 39 | {{{ |
| 40 | if app.homogeneous_app_version and wu.app_version_id |
| 41 | check if this host supports the app version's platform |
| 42 | if app version has plan class, check if host can handle it |
| 43 | check if we need work for the resource type |
| 44 | }}} |
| 45 | |
| 46 | In some cases this may result in using a non-optional app version; |
| 47 | e.g. we might use a CUDA 2.0 version for a host capable of running CUDA 2.3. |
| 48 | So be it. |
| 49 | |
| 50 | It's possible that the shared-memory job cache could get clogged up |
| 51 | with jobs already committed to rare app versions. |
| 52 | I don't have a plan for dealing with this. |