Changes between Version 2 and Version 3 of CreditNew
- Timestamp:
- Oct 30, 2009, 3:54:58 PM (15 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
CreditNew
v2 v3 1 1 = New credit system design = 2 2 3 == Introduction==4 5 We can estimate the peak FLOPS of a givenprocessor.3 == Peak FLOPS and efficiency == 4 5 BOINC estimates the peak FLOPS of each processor. 6 6 For CPUs, this is the Whetstone benchmark score. 7 7 For GPUs, it's given by a manufacturer-supplied formula. … … 15 15 is the ratio of actual FLOPS to peak FLOPS. 16 16 17 GPUs typically have a much higher (50-100X) peak speed than CPUs.17 GPUs typically have a much higher (50-100X) peak speed than GPUs. 18 18 However, application efficiency is typically lower 19 19 (very roughly, 10% for GPUs, 50% for CPUs). 20 20 21 == Credit system goals == 22 23 Some possible goals in designing a credit system: 24 25 * Device neutrality: similar jobs should get similar credit 26 regardless of what processor or GPU they run on. 27 28 * Project neutrality: different projects should grant 29 about the same amount of credit per day for a given host. 30 31 It's easy to show that both goals can't be satisfied simultaneously 32 when there is more than one type of processing resource. 33 21 34 == The first credit system == 22 35 23 In the first iteration of credit system, "claimed credit" was defined as 36 In the first iteration of BOINC's credit system, 37 "claimed credit" was defined as 24 38 {{{ 25 39 C1 = H.whetstone * J.cpu_time 26 40 }}} 27 41 There were then various schemes for taking the 28 average or min of the claimed credit of the 29 replicas of a job,and using that as the "granted credit".42 average or min of the claimed credit of the replicas of a job, 43 and using that as the "granted credit". 30 44 31 45 We call this system "Peak-FLOPS-based" because … … 33 47 34 48 The problem with this system is that, for a given app version, 35 efficiency can vary widely .49 efficiency can vary widely between hosts. 36 50 In the above example, 37 host B would claim 10X as much credit, 38 and its owner would be upset when it was granted 39 only a tenth of that. 51 the 10 GFLOPS host would claim 10X as much credit, 52 and its owner would be upset when it was granted only a tenth of that. 40 53 41 54 Furthermore, the credits granted to a given host for a 42 55 series of identical jobs could vary widely, 43 56 depending on the host it was paired with by replication. 44 45 So host neutrality was achieved, 46 but in a way that seemed arbitrary and unfair to users. 57 This seemed arbitrary and unfair to users. 47 58 48 59 == The second credit system == 49 60 50 To address the problems with host neutrality, 51 we switched to the philosophy that 61 We then switched to the philosophy that 52 62 credit should be proportional to number of FLOPs actually performed 53 63 by the application. … … 57 67 SETI@home had an application that allowed counting of FLOPs, 58 68 and they adopted this system. 59 They added a scaling factor so that the average credit 60 was about the same as inthe first credit system.69 They added a scaling factor so that the average credit per job 70 was the same as the first credit system. 61 71 62 72 Not all projects could count FLOPs, however. … … 68 78 69 79 * It didn't address GPUs. 70 * project that couldn't count FLOPs still had host neutrality problem71 * didn't address single replication80 * Project that couldn't count FLOPs still had device neutrality problems. 81 * It didn't prevent credit cheating when single replication was used. 72 82 73 83 … … 77 87 change code, settings, etc. 78 88 79 * Device neutrality: similar jobs should get similar credit 80 regardless of what processor or GPU they run on. 89 * Device neutrality 81 90 82 91 * Limited project neutrality: different projects should grant … … 90 99 == Peak FLOP Count (PFC) == 91 100 92 This system usesto the Peak-FLOPS-based approach,101 This system goes back to the Peak-FLOPS-based approach, 93 102 but addresses its problems in a new way. 94 103 95 104 When a job is issued to a host, the scheduler specifies usage(J,D), 96 105 J's usage of processing resource D: 97 how many CPUs ,and how many GPUs (possibly fractional).106 how many CPUs and how many GPUs (possibly fractional). 98 107 99 108 If the job is finished in elapsed time T, … … 109 118 (e.g., a CPU job that does lots of disk I/O) 110 119 PFC() won't reflect this. That's OK. 120 The key thing is that BOINC reserved the device for the job, 121 whether or not the job used it efficiently. 111 122 * usage(J,D) may not be accurate; e.g., a GPU job may take 112 123 more or less CPU than the scheduler thinks it will. … … 115 126 For now, though, we'll just use the scheduler's estimate. 116 127 117 The idea of the system is that granted credit for a job J 118 is proportional to PFC(J), 128 The idea of the system is that granted credit for a job J is proportional to PFC(J), 119 129 but is normalized in the following ways: 120 130 121 == Version normalization ==131 == Cross-version normalization == 122 132 123 133 … … 128 138 find the minimum X, 129 139 then scale each app version's jobs by (X/PFC*(V)). 130 The result s is calledNPFC(J).140 The result is called "Version-Normalized Peak FLOP Count", or VNPFC(J). 131 141 132 142 Notes: … … 144 154 or new app versions are deployed. 145 155 146 == Project normalization ==156 == Cross-project normalization == 147 157 148 158 If an application has both CPU and GPU versions, … … 157 167 158 168 The solution to this is: if an app has only GPU versions, 159 then we scale its granted credit by a factor, 160 obtained from a central BOINC server, 161 which is based on the average scaling factor 169 then we scale its granted credit by the average scaling factor 162 170 for that GPU type among projects that 163 171 do have both CPU and GPU versions. 172 This factor is obtained from a central BOINC server. 164 173 165 174 Notes: … … 176 185 177 186 For a given application, all hosts should get the same average granted credit per job. 178 To ensure this, for each application A we maintain the average NPFC*(A),179 and for each host H we maintain NPFC*(H, A).187 To ensure this, for each application A we maintain the average VNPFC*(A), 188 and for each host H we maintain VNPFC*(H, A). 180 189 The "claimed credit" for a given job J is then 181 190 {{{ 182 NPFC(J) * (NPFC*(A)/NPFC*(H, A))183 }}} 184 185 Notes: 186 * NPFC* is averaged over jobs, not hosts.187 * Both averages are recent averages, so that they respond to188 changes in job sizes and app versions characteristics.191 VNPFC(J) * (VNPFC*(A)/VNPFC*(H, A)) 192 }}} 193 194 Notes: 195 * VNPFC* is averaged over jobs, not hosts. 196 * Both averages are exponential recent averages, 197 so that they respond to changes in job sizes and app versions characteristics. 189 198 * This assumes that all hosts are sent the same distribution of jobs. 190 199 There are two situations where this is not the case: 191 200 a) job-size matching, and b) GPUGrid.net's scheme for sending 192 201 some (presumably larger) jobs to GPUs with more processors. 193 To deal with this, we 'll weight the averageby workunit.rsc_flops_est.202 To deal with this, we can weight jobs by workunit.rsc_flops_est. 194 203 195 204 == Replication and cheating == … … 198 207 by claiming excessive credit 199 208 (i.e., by falsifying benchmark scores or elapsed time). 200 An exaggerated claim will increase NPFC*(H,A),209 An exaggerated claim will increase VNPFC*(H,A), 201 210 causing subsequent claimed credit to be scaled down proportionately. 202 211 This means that no special cheat-prevention scheme … … 212 221 213 222 * One-time cheats (like claiming 1e304) can be prevented by 214 capping NPFC(J) at some multiple (say, 10) ofNPFC*(A).223 capping VNPFC(J) at some multiple (say, 10) of VNPFC*(A). 215 224 * Cherry-picking: suppose an application has two types of jobs, 216 217 218 219 220 221 ItsNPFC*(H, A) will quickly decrease,222 223 224 225 226 227 and set its NPFC*(H, A) toNPFC*(A) for all apps A.228 229 225 which run for 1 second and 1 hour respectively. 226 Clients can figure out which is which, e.g. by running a job for 2 seconds 227 and seeing if it's exited. 228 Suppose a client systematically refuses the 1 hour jobs 229 (e.g., by reporting a crash or never reporting them). 230 Its VNPFC*(H, A) will quickly decrease, 231 and soon it will be getting several thousand times more credit 232 per actual work than other hosts! 233 Countermeasure: 234 whenever a job errors out, times out, or fails to validate, 235 set the host's error rate back to the initial default, 236 and set its VNPFC*(H, A) to VNPFC*(A) for all apps A. 237 This puts the host to a state where several dozen of its 238 subsequent jobs will be replicated. 230 239 231 240 == Implementation == 232 241 242 Database changes: 243 244 New table "host_app_version" 245 {{{ 246 int host_id; 247 int app_version_id; 248 double avg_vnpfc; // recent average 249 int njobs; 250 double total_vnpfc; 251 }}} 252 253 New fields in "app_version": 254 {{{ 255 double avg_vnpfc; 256 int njobs; 257 double total_vnpfc; 258 }}} 259 260 New fields in "app": 261 {{{ 262 double min_avg_vnpfc; // min value of app_version.avg_vnpfc 263 }}}