| | 1 | = New credit system design = |
| | 2 | |
| | 3 | == Introduction == |
| | 4 | |
| | 5 | We can estimate the peak FLOPS of a given processor. |
| | 6 | For CPUs, this is the Whetstone benchmark score. |
| | 7 | For GPUs, it's given by a manufacturer-supplied formula. |
| | 8 | |
| | 9 | Applications access memory, |
| | 10 | and the speed of a host's memory system is not reflected |
| | 11 | in its Whetstone score. |
| | 12 | So a given job might take the same amount of CPU time |
| | 13 | and a 1 GFLOPS host as on a 10 GFLOPS host. |
| | 14 | The "efficiency" of an application running on a given host |
| | 15 | is the ratio of actual FLOPS to peak FLOPS. |
| | 16 | |
| | 17 | GPUs typically have a much higher (50-100X) peak speed than GPUs. |
| | 18 | However, application efficiency is typically lower |
| | 19 | (very roughly, 10% for GPUs, 50% for CPUs). |
| | 20 | |
| | 21 | == The first credit system == |
| | 22 | |
| | 23 | In the first iteration of credit system, "claimed credit" was defined as |
| | 24 | {{{ |
| | 25 | C1 = H.whetstone * J.cpu_time |
| | 26 | }}} |
| | 27 | There were then various schemes for taking the |
| | 28 | average or min of the claimed credit of the |
| | 29 | replicas of a job, and using that as the "granted credit". |
| | 30 | |
| | 31 | We call this system "Peak-FLOPS-based" because |
| | 32 | it's based on the CPU's peak performance. |
| | 33 | |
| | 34 | The problem with this system is that, for a given app version, |
| | 35 | efficiency can vary widely. |
| | 36 | In the above example, |
| | 37 | host B would claim 10X as much credit, |
| | 38 | and its owner would be upset when it was granted |
| | 39 | only a tenth of that. |
| | 40 | |
| | 41 | Furthermore, the credits granted to a given host for a |
| | 42 | series of identical jobs could vary widely, |
| | 43 | depending on the host it was paired with by replication. |
| | 44 | |
| | 45 | So host neutrality was achieved, |
| | 46 | but in a way that seemed arbitrary and unfair to users. |
| | 47 | |
| | 48 | == The second credit system == |
| | 49 | |
| | 50 | To address the problems with host neutrality, |
| | 51 | we switched to the philosophy that |
| | 52 | credit should be proportional to number of FLOPs actually performed |
| | 53 | by the application. |
| | 54 | We added API calls to let applications report this. |
| | 55 | We call this approach "Actual-FLOPs-based". |
| | 56 | |
| | 57 | SETI@home had an application that allowed counting of FLOPs, |
| | 58 | and they adopted this system. |
| | 59 | They added a scaling factor so that the average credit |
| | 60 | was about the same as in the first credit system. |
| | 61 | |
| | 62 | Not all projects could count FLOPs, however. |
| | 63 | So SETI@home published their average credit per CPU second, |
| | 64 | and other projects continued to use benchmark-based credit, |
| | 65 | but multiplied it by a scaling factor to match SETI@home's average. |
| | 66 | |
| | 67 | This system had several problems: |
| | 68 | |
| | 69 | * It didn't address GPUs. |
| | 70 | * project that couldn't count FLOPs still had host neutrality problem |
| | 71 | * didn't address single replication |
| | 72 | |
| | 73 | |
| | 74 | == Goals of the new (third) credit system == |
| | 75 | |
| | 76 | * Completely automate credit - projects don't have to |
| | 77 | change code, settings, etc. |
| | 78 | |
| | 79 | * Device neutrality: similar jobs should get similar credit |
| | 80 | regardless of what processor or GPU they run on. |
| | 81 | |
| | 82 | * Limited project neutrality: different projects should grant |
| | 83 | about the same amount of credit per CPU hour, |
| | 84 | averaged over hosts. |
| | 85 | Projects with GPU apps should grant credit in proportion |
| | 86 | to the efficiency of the apps. |
| | 87 | (This means that projects with efficient GPU apps will |
| | 88 | grant more credit on average. That's OK). |
| | 89 | |
| | 90 | == Peak FLOP Count (PFC) == |
| | 91 | |
| | 92 | This system uses to the Peak-FLOPS-based approach, |
| | 93 | but addresses its problems in a new way. |
| | 94 | |
| | 95 | When a job is issued to a host, the scheduler specifies usage(J,D), |
| | 96 | J's usage of processing resource D: |
| | 97 | how many CPUs, and how many GPUs (possibly fractional). |
| | 98 | |
| | 99 | If the job is finished in elapsed time T, |
| | 100 | we define peak_flop_count(J), or PFC(J) as |
| | 101 | {{{ |
| | 102 | PFC(J) = T * (sum over devices D (usage(J, D) * peak_flop_rate(D)) |
| | 103 | }}} |
| | 104 | |
| | 105 | Notes: |
| | 106 | |
| | 107 | * We use elapsed time instead of actual device time (e.g., CPU time). |
| | 108 | If a job uses a resource inefficiently |
| | 109 | (e.g., a CPU job that does lots of disk I/O) |
| | 110 | PFC() won't reflect this. That's OK. |
| | 111 | * usage(J,D) may not be accurate; e.g., a GPU job may take |
| | 112 | more or less CPU than the scheduler thinks it will. |
| | 113 | Eventually we may switch to a scheme where the client |
| | 114 | dynamically determines the CPU usage. |
| | 115 | For now, though, we'll just use the scheduler's estimate. |
| | 116 | |
| | 117 | The idea of the system is that granted credit for a job J |
| | 118 | is proportional to PFC(J), |
| | 119 | but is normalized in the following ways: |
| | 120 | |
| | 121 | == Version normalization == |
| | 122 | |
| | 123 | |
| | 124 | If a given application has multiple versions (e.g., CPU and GPU versions) |
| | 125 | the average granted credit is the same for each version. |
| | 126 | The adjustment is always downwards: |
| | 127 | we maintain the average PFC*(V) of PFC() for each app version, |
| | 128 | find the minimum X, |
| | 129 | then scale each app version's jobs by (X/PFC*(V)). |
| | 130 | The results is called NPFC(J). |
| | 131 | |
| | 132 | Notes: |
| | 133 | * This mechanism provides device neutrality. |
| | 134 | * This addresses the common situation |
| | 135 | where an app's GPU version is much less efficient than the CPU version |
| | 136 | (i.e. the ratio of actual FLOPs to peak FLOPs is much less). |
| | 137 | To a certain extent, this mechanism shifts the system |
| | 138 | towards the "Actual FLOPs" philosophy, |
| | 139 | since credit is granted based on the most efficient app version. |
| | 140 | It's not exactly "Actual FLOPs", since the most efficient |
| | 141 | version may not be 100% efficient. |
| | 142 | * Averages are computed as a moving average, |
| | 143 | so that the system will respond quickly as job sizes change |
| | 144 | or new app versions are deployed. |
| | 145 | |
| | 146 | == Project normalization == |
| | 147 | |
| | 148 | If an application has both CPU and GPU versions, |
| | 149 | then the version normalization mechanism uses the CPU |
| | 150 | version as a "sanity check" to limit the credit granted for GPU jobs. |
| | 151 | |
| | 152 | Suppose a project has an app with only a GPU version, |
| | 153 | so there's no CPU version to act as a sanity check. |
| | 154 | If we grant credit based only on GPU peak speed, |
| | 155 | the project will grant much more credit per GPU hour than |
| | 156 | other projects, violating limited project neutrality. |
| | 157 | |
| | 158 | The solution to this is: if an app has only GPU versions, |
| | 159 | then we scale its granted credit by a factor, |
| | 160 | obtained from a central BOINC server, |
| | 161 | which is based on the average scaling factor |
| | 162 | for that GPU type among projects that |
| | 163 | do have both CPU and GPU versions. |
| | 164 | |
| | 165 | Notes: |
| | 166 | |
| | 167 | * Projects will run a periodic script to update the scaling factors. |
| | 168 | * Rather than GPU type, we'll actually use plan class, |
| | 169 | since e.g. the average efficiency of CUDA 2.3 apps may be different |
| | 170 | from that of CUDA 2.1 apps. |
| | 171 | * Initially we'll obtain scaling factors from large projects |
| | 172 | that have both GPU and CPU apps (e.g., SETI@home). |
| | 173 | Eventually we'll use an average (weighted by work done) over multiple projects. |
| | 174 | |
| | 175 | == Host normalization == |
| | 176 | |
| | 177 | For a given application, all hosts should get the same average granted credit per job. |
| | 178 | To ensure this, for each application A we maintain the average NPFC*(A), |
| | 179 | and for each host H we maintain NPFC*(H, A). |
| | 180 | The "claimed credit" for a given job J is then |
| | 181 | {{{ |
| | 182 | NPFC(J) * (NPFC*(A)/NPFC*(H, A)) |
| | 183 | }}} |
| | 184 | |
| | 185 | Notes: |
| | 186 | * NPFC* is averaged over jobs, not hosts. |
| | 187 | * Both averages are recent averages, so that they respond to |
| | 188 | changes in job sizes and app versions characteristics. |
| | 189 | * This assumes that all hosts are sent the same distribution of jobs. |
| | 190 | There are two situations where this is not the case: |
| | 191 | a) job-size matching, and b) GPUGrid.net's scheme for sending |
| | 192 | some (presumably larger) jobs to GPUs with more processors. |
| | 193 | To deal with this, we'll weight the average by workunit.rsc_flops_est. |
| | 194 | |
| | 195 | == Replication and cheating == |
| | 196 | |
| | 197 | Host normalization mostly eliminates the incentive to cheat |
| | 198 | by claiming excessive credit |
| | 199 | (i.e., by falsifying benchmark scores or elapsed time). |
| | 200 | An exaggerated claim will increase NPFC*(H,A), |
| | 201 | causing subsequent claimed credit to be scaled down proportionately. |
| | 202 | This means that no special cheat-prevention scheme |
| | 203 | is needed for single replications; |
| | 204 | granted credit = claimed credit. |
| | 205 | |
| | 206 | For jobs that are replicated, granted credit is be |
| | 207 | set to the min of the valid results |
| | 208 | (min is used instead of average to remove the incentive |
| | 209 | for cherry-picking, see below). |
| | 210 | |
| | 211 | However, there are still some possible forms of cheating. |
| | 212 | |
| | 213 | * One-time cheats (like claiming 1e304) can be prevented by |
| | 214 | capping NPFC(J) at some multiple (say, 10) of NPFC*(A). |
| | 215 | * Cherry-picking: suppose an application has two types of jobs, |
| | 216 | which run for 1 second and 1 hour respectively. |
| | 217 | Clients can figure out which is which, e.g. by running a job for 2 seconds |
| | 218 | and seeing if it's exited. |
| | 219 | Suppose a client systematically refuses the 1 hour jobs |
| | 220 | (e.g., by reporting a crash or never reporting them). |
| | 221 | Its NPFC*(H, A) will quickly decrease, |
| | 222 | and soon it will be getting several thousand times more credit |
| | 223 | per actual work than other hosts! |
| | 224 | Countermeasure: |
| | 225 | whenever a job errors out, times out, or fails to validate, |
| | 226 | set the host's error rate back to the initial default, |
| | 227 | and set its NPFC*(H, A) to NPFC*(A) for all apps A. |
| | 228 | This puts the host to a state where several dozen of its |
| | 229 | subsequent jobs will be replicated. |
| | 230 | |
| | 231 | == Implementation == |
| | 232 | |