| 1 | = New credit system design = |
| 2 | |
| 3 | == Introduction == |
| 4 | |
| 5 | We can estimate the peak FLOPS of a given processor. |
| 6 | For CPUs, this is the Whetstone benchmark score. |
| 7 | For GPUs, it's given by a manufacturer-supplied formula. |
| 8 | |
| 9 | Applications access memory, |
| 10 | and the speed of a host's memory system is not reflected |
| 11 | in its Whetstone score. |
| 12 | So a given job might take the same amount of CPU time |
| 13 | and a 1 GFLOPS host as on a 10 GFLOPS host. |
| 14 | The "efficiency" of an application running on a given host |
| 15 | is the ratio of actual FLOPS to peak FLOPS. |
| 16 | |
| 17 | GPUs typically have a much higher (50-100X) peak speed than GPUs. |
| 18 | However, application efficiency is typically lower |
| 19 | (very roughly, 10% for GPUs, 50% for CPUs). |
| 20 | |
| 21 | == The first credit system == |
| 22 | |
| 23 | In the first iteration of credit system, "claimed credit" was defined as |
| 24 | {{{ |
| 25 | C1 = H.whetstone * J.cpu_time |
| 26 | }}} |
| 27 | There were then various schemes for taking the |
| 28 | average or min of the claimed credit of the |
| 29 | replicas of a job, and using that as the "granted credit". |
| 30 | |
| 31 | We call this system "Peak-FLOPS-based" because |
| 32 | it's based on the CPU's peak performance. |
| 33 | |
| 34 | The problem with this system is that, for a given app version, |
| 35 | efficiency can vary widely. |
| 36 | In the above example, |
| 37 | host B would claim 10X as much credit, |
| 38 | and its owner would be upset when it was granted |
| 39 | only a tenth of that. |
| 40 | |
| 41 | Furthermore, the credits granted to a given host for a |
| 42 | series of identical jobs could vary widely, |
| 43 | depending on the host it was paired with by replication. |
| 44 | |
| 45 | So host neutrality was achieved, |
| 46 | but in a way that seemed arbitrary and unfair to users. |
| 47 | |
| 48 | == The second credit system == |
| 49 | |
| 50 | To address the problems with host neutrality, |
| 51 | we switched to the philosophy that |
| 52 | credit should be proportional to number of FLOPs actually performed |
| 53 | by the application. |
| 54 | We added API calls to let applications report this. |
| 55 | We call this approach "Actual-FLOPs-based". |
| 56 | |
| 57 | SETI@home had an application that allowed counting of FLOPs, |
| 58 | and they adopted this system. |
| 59 | They added a scaling factor so that the average credit |
| 60 | was about the same as in the first credit system. |
| 61 | |
| 62 | Not all projects could count FLOPs, however. |
| 63 | So SETI@home published their average credit per CPU second, |
| 64 | and other projects continued to use benchmark-based credit, |
| 65 | but multiplied it by a scaling factor to match SETI@home's average. |
| 66 | |
| 67 | This system had several problems: |
| 68 | |
| 69 | * It didn't address GPUs. |
| 70 | * project that couldn't count FLOPs still had host neutrality problem |
| 71 | * didn't address single replication |
| 72 | |
| 73 | |
| 74 | == Goals of the new (third) credit system == |
| 75 | |
| 76 | * Completely automate credit - projects don't have to |
| 77 | change code, settings, etc. |
| 78 | |
| 79 | * Device neutrality: similar jobs should get similar credit |
| 80 | regardless of what processor or GPU they run on. |
| 81 | |
| 82 | * Limited project neutrality: different projects should grant |
| 83 | about the same amount of credit per CPU hour, |
| 84 | averaged over hosts. |
| 85 | Projects with GPU apps should grant credit in proportion |
| 86 | to the efficiency of the apps. |
| 87 | (This means that projects with efficient GPU apps will |
| 88 | grant more credit on average. That's OK). |
| 89 | |
| 90 | == Peak FLOP Count (PFC) == |
| 91 | |
| 92 | This system uses to the Peak-FLOPS-based approach, |
| 93 | but addresses its problems in a new way. |
| 94 | |
| 95 | When a job is issued to a host, the scheduler specifies usage(J,D), |
| 96 | J's usage of processing resource D: |
| 97 | how many CPUs, and how many GPUs (possibly fractional). |
| 98 | |
| 99 | If the job is finished in elapsed time T, |
| 100 | we define peak_flop_count(J), or PFC(J) as |
| 101 | {{{ |
| 102 | PFC(J) = T * (sum over devices D (usage(J, D) * peak_flop_rate(D)) |
| 103 | }}} |
| 104 | |
| 105 | Notes: |
| 106 | |
| 107 | * We use elapsed time instead of actual device time (e.g., CPU time). |
| 108 | If a job uses a resource inefficiently |
| 109 | (e.g., a CPU job that does lots of disk I/O) |
| 110 | PFC() won't reflect this. That's OK. |
| 111 | * usage(J,D) may not be accurate; e.g., a GPU job may take |
| 112 | more or less CPU than the scheduler thinks it will. |
| 113 | Eventually we may switch to a scheme where the client |
| 114 | dynamically determines the CPU usage. |
| 115 | For now, though, we'll just use the scheduler's estimate. |
| 116 | |
| 117 | The idea of the system is that granted credit for a job J |
| 118 | is proportional to PFC(J), |
| 119 | but is normalized in the following ways: |
| 120 | |
| 121 | == Version normalization == |
| 122 | |
| 123 | |
| 124 | If a given application has multiple versions (e.g., CPU and GPU versions) |
| 125 | the average granted credit is the same for each version. |
| 126 | The adjustment is always downwards: |
| 127 | we maintain the average PFC*(V) of PFC() for each app version, |
| 128 | find the minimum X, |
| 129 | then scale each app version's jobs by (X/PFC*(V)). |
| 130 | The results is called NPFC(J). |
| 131 | |
| 132 | Notes: |
| 133 | * This mechanism provides device neutrality. |
| 134 | * This addresses the common situation |
| 135 | where an app's GPU version is much less efficient than the CPU version |
| 136 | (i.e. the ratio of actual FLOPs to peak FLOPs is much less). |
| 137 | To a certain extent, this mechanism shifts the system |
| 138 | towards the "Actual FLOPs" philosophy, |
| 139 | since credit is granted based on the most efficient app version. |
| 140 | It's not exactly "Actual FLOPs", since the most efficient |
| 141 | version may not be 100% efficient. |
| 142 | * Averages are computed as a moving average, |
| 143 | so that the system will respond quickly as job sizes change |
| 144 | or new app versions are deployed. |
| 145 | |
| 146 | == Project normalization == |
| 147 | |
| 148 | If an application has both CPU and GPU versions, |
| 149 | then the version normalization mechanism uses the CPU |
| 150 | version as a "sanity check" to limit the credit granted for GPU jobs. |
| 151 | |
| 152 | Suppose a project has an app with only a GPU version, |
| 153 | so there's no CPU version to act as a sanity check. |
| 154 | If we grant credit based only on GPU peak speed, |
| 155 | the project will grant much more credit per GPU hour than |
| 156 | other projects, violating limited project neutrality. |
| 157 | |
| 158 | The solution to this is: if an app has only GPU versions, |
| 159 | then we scale its granted credit by a factor, |
| 160 | obtained from a central BOINC server, |
| 161 | which is based on the average scaling factor |
| 162 | for that GPU type among projects that |
| 163 | do have both CPU and GPU versions. |
| 164 | |
| 165 | Notes: |
| 166 | |
| 167 | * Projects will run a periodic script to update the scaling factors. |
| 168 | * Rather than GPU type, we'll actually use plan class, |
| 169 | since e.g. the average efficiency of CUDA 2.3 apps may be different |
| 170 | from that of CUDA 2.1 apps. |
| 171 | * Initially we'll obtain scaling factors from large projects |
| 172 | that have both GPU and CPU apps (e.g., SETI@home). |
| 173 | Eventually we'll use an average (weighted by work done) over multiple projects. |
| 174 | |
| 175 | == Host normalization == |
| 176 | |
| 177 | For a given application, all hosts should get the same average granted credit per job. |
| 178 | To ensure this, for each application A we maintain the average NPFC*(A), |
| 179 | and for each host H we maintain NPFC*(H, A). |
| 180 | The "claimed credit" for a given job J is then |
| 181 | {{{ |
| 182 | NPFC(J) * (NPFC*(A)/NPFC*(H, A)) |
| 183 | }}} |
| 184 | |
| 185 | Notes: |
| 186 | * NPFC* is averaged over jobs, not hosts. |
| 187 | * Both averages are recent averages, so that they respond to |
| 188 | changes in job sizes and app versions characteristics. |
| 189 | * This assumes that all hosts are sent the same distribution of jobs. |
| 190 | There are two situations where this is not the case: |
| 191 | a) job-size matching, and b) GPUGrid.net's scheme for sending |
| 192 | some (presumably larger) jobs to GPUs with more processors. |
| 193 | To deal with this, we'll weight the average by workunit.rsc_flops_est. |
| 194 | |
| 195 | == Replication and cheating == |
| 196 | |
| 197 | Host normalization mostly eliminates the incentive to cheat |
| 198 | by claiming excessive credit |
| 199 | (i.e., by falsifying benchmark scores or elapsed time). |
| 200 | An exaggerated claim will increase NPFC*(H,A), |
| 201 | causing subsequent claimed credit to be scaled down proportionately. |
| 202 | This means that no special cheat-prevention scheme |
| 203 | is needed for single replications; |
| 204 | granted credit = claimed credit. |
| 205 | |
| 206 | For jobs that are replicated, granted credit is be |
| 207 | set to the min of the valid results |
| 208 | (min is used instead of average to remove the incentive |
| 209 | for cherry-picking, see below). |
| 210 | |
| 211 | However, there are still some possible forms of cheating. |
| 212 | |
| 213 | * One-time cheats (like claiming 1e304) can be prevented by |
| 214 | capping NPFC(J) at some multiple (say, 10) of NPFC*(A). |
| 215 | * Cherry-picking: suppose an application has two types of jobs, |
| 216 | which run for 1 second and 1 hour respectively. |
| 217 | Clients can figure out which is which, e.g. by running a job for 2 seconds |
| 218 | and seeing if it's exited. |
| 219 | Suppose a client systematically refuses the 1 hour jobs |
| 220 | (e.g., by reporting a crash or never reporting them). |
| 221 | Its NPFC*(H, A) will quickly decrease, |
| 222 | and soon it will be getting several thousand times more credit |
| 223 | per actual work than other hosts! |
| 224 | Countermeasure: |
| 225 | whenever a job errors out, times out, or fails to validate, |
| 226 | set the host's error rate back to the initial default, |
| 227 | and set its NPFC*(H, A) to NPFC*(A) for all apps A. |
| 228 | This puts the host to a state where several dozen of its |
| 229 | subsequent jobs will be replicated. |
| 230 | |
| 231 | == Implementation == |
| 232 | |