7 | | 1. app versions now include avg_ncpus, coprocessor usage, and |
8 | | a FLOPS estimate (this defaults to the CPU benchmark, |
9 | | but for GPU and multithread apps it will be different). |
10 | | This info is sent from the server. |
11 | | 1. Estimating the duration of unstarted jobs: |
12 | | jobs are now associated with specific app versions. |
13 | | The estimated duration of an unstarted job is the WU's |
14 | | FLOP estimate divided by the app version's FLOPS, |
15 | | scaled by the duration correction factor. |
16 | | 1. Duration correction factor: this is now based on elapsed time |
17 | | (i.e. wall time during which the job has been running) rather than CPU time. |
18 | | Yes, this is affected by non-BOINC CPU load; that's as it should be. |
19 | | 1. "CPU efficiency" is no longer maintained; it's subsumed in DCF. |
20 | | 1. Estimating the duration of running jobs: |
21 | | this is a weighted average of static and dynamic estimates. |
22 | | The dynamic estimate is now based on elapsed time rather than CPU time. |
23 | | So if a GPU job has been running for 5 min, is 25% done, |
24 | | and has used 1 min of CPU, its dynamic estimate is 20 min (not 4 min). |
25 | | 1. Round-robin simulation: this was modified to reflect multi-thread |
26 | | and coproc apps (e.g., if the host has 1 GPU, only one coproc app |
27 | | can run at a time). |
28 | | If CPUs are idle because coprocs are in use, |
29 | | don't count it towards CPU shortfall. |
30 | | 1. scheduler_cpus() and enforce_schedule() take coprocs and |
31 | | avg_ncpus into account. They try to keep GPUs busy if possible. |
| 5 | '''Task scheduling policy''':: |
| 6 | Of the tasks that are runnable, which ones to execute? |
| 7 | '''Work-fetch policy''':: |
| 8 | When to ask a project for more work, |
| 9 | which project to ask, |
| 10 | and how much work of each processor type to ask for? |
60 | | === Wall CPU time === |
61 | | '''Wall CPU time''' is the amount of wall-clock time a process has been runnable at the OS level. The actual CPU time may be less than this, e.g. if the process does a lot of paging, or if other (non-BOINC) processing jobs run at the same time. |
62 | | |
63 | | BOINC uses wall CPU time as the measure of CPU resource usage. Wall CPU time is more fair than actual CPU time in the case of paging apps. In addition, the measurement of actual CPU time depends on apps to report it correctly, and they may not do this. |
64 | | |
65 | | === Normalized CPU time === #NormalizedCPUTime |
66 | | The '''normalized CPU time''' of a result is an estimate of the wall time it will take to complete, taking into account |
67 | | |
68 | | * the fraction of time BOINC runs ('on-fraction') |
69 | | * the fraction of time computation is enabled ('active-fraction') |
70 | | * CPU efficiency (the ratio of actual CPU to wall CPU) |
71 | | |
72 | | but ''Not'' taking into account the Resource Share of Projects. |
73 | | |
74 | | === Project-normalized CPU time === |
75 | | |
76 | | The '''project-normalized CPU time''' of a result is an estimate of the wall time it will take to complete, taking into account the above factors plus the project's resource share relative to other potentially runnable projects. |
77 | | |
78 | | The 'work_req' element of a scheduler RPC request is in units of project-normalized CPU time. In deciding how much work to send, the scheduler must take into account the project's resource share fraction, and the host's on-fraction and active-fraction. |
79 | | |
80 | | For example, suppose a host has 1 GFLOP/sec CPUs, the project's resource share fraction is 0.5, the host's on-fraction is 0.8 and the host's active-fraction is 0.9. Then the expected processing rate per CPU is |
81 | | |
82 | | {{{ |
83 | | (1 GFLOP/sec)*0.5*0.8*0.9 = 0.36 GFLOP/sec |
84 | | }}} |
85 | | |
86 | | If the host requests 1000 project-normalized CPU seconds of work, the scheduler should send it at least 360 GFLOPs of work. |
87 | | |
88 | | === Result states === |
89 | | |
90 | | R is '''runnable''' if |
91 | | * Neither R nor R.project is suspended, and |
92 | | * R's input files have been downloaded, and |
93 | | * R hasn't finished computing |
94 | | |
95 | | R is '''nearly runnable''' if |
96 | | * Neither R nor R.project is suspended, and |
97 | | * None of R's input files is in a 'download deferred' state. |
98 | | * R hasn't finished computing |
99 | | |
100 | | === Project states === |
101 | | |
102 | | P is '''runnable''' if |
103 | | * P has at least one runnable result (this implies that P is not suspended). |
104 | | |
105 | | P is '''downloading''' if |
106 | | * P is not suspended, and |
107 | | * P has at least one result whose files are being downloaded and none of the downloads is deferred. |
108 | | |
109 | | P is '''fetchable''' (i.e. the work-fetch policy allows work to be fetched from it) if |
110 | | * P is not suspended, and |
111 | | * P is not deferred (i.e. its minimum RPC time is in the past), and |
112 | | * P's no-new-work flag is not set, and |
113 | | * P is not overworked (see definition below), and |
114 | | * a fetch of P's master file is not pending |
115 | | |
116 | | P is '''latency-limited''' if |
117 | | * The client's last scheduler RPC to P returned a 'no work because of deadlines' flag, and |
118 | | * the RPC reply's delay request has not yet elapsed. |
119 | | |
120 | | This means that P has work available, but didn't send any because the work's deadlines couldn't be met given the existing work queue. P is '''potentially runnable''' if |
121 | | |
122 | | * P is either runnable, downloading, fetchable, overworked, or latency-limited. |
123 | | |
124 | | This means that, to the best of the client's knowledge, it could do work for P if it wanted to. |
125 | | |
126 | | === Debt === |
127 | | |
128 | | Intuitively, a project's 'debt' is how much work is owed to it, relative to other projects. BOINC uses two types of debt; each is defined for a set S of projects. In each case, the debt is recalculated periodically as follows: |
129 | | * A = the wall CPU time used by projects in S during this period |
130 | | * R = sum of resource shares of projects in S |
131 | | * For each project P in S: |
132 | | * F = P.resource_share / R (i.e., P's fractional resource share) |
133 | | * W = A*F (i.e., how much wall CPU time P should have gotten) |
134 | | * P.debt += W - P.wall_cpu_time (i.e. what P should have gotten minus what it got). |
135 | | * P.debt is normalized so that the mean or minimum is zero. |
136 | | |
137 | | '''Short-term debt''' is used by the CPU scheduler. It is adjusted over the set of runnable projects. It is normalized so that minimum short-term debt is zero, and maximum short-term debt is no greater than 86,400 (i.e. one day). |
138 | | |
139 | | '''Long-term debt''' is used by the work-fetch policy. It is defined for all projects, and adjusted over the set of potentially runnable projects. It is normalized so that average long-term debt, over all project, is zero. |
140 | | |
141 | | == Round-robin simulation == |
142 | | |
143 | | The CPU scheduling and work fetch policies use the results of a simulation of weighted round-robin scheduling applied to the set of nearly runnable results. The simulation takes into account on-fraction and active-fraction. It produces the following outputs: |
144 | | |
145 | | * deadline_missed(R): whether result R misses its deadline. |
146 | | * deadlines_missed(P): the number of results R of P for which deadline_missed(R). |
147 | | * total_shortfall: the additional normalized CPU time needed to keep all CPUs busy for the next min_queue seconds. |
148 | | * shortfall(P): the additional normalized CPU time needed for project P to keep it from running out of work in the next min_queue seconds. |
149 | | |
150 | | In the example below, projects A and B have resource shares 2 and 1 respectively. A has results A1 and A2, and B has result B1. The computer has two CPUs. From time 0 to 4 all three results run with equal weighting. At time 4 result A2 finishes. From time 4 to 8, project A gets only a 0.5 share because it has only one result. At time 8, result A1 finishes. |
151 | | |
152 | | In this case, shortfall(A) is 4, shortfall(B``) is 0, and total_shortfall is 2. |
| 39 | * For each processor type, the '''shortfall''': that is, the number |
| 40 | additional instance-seconds needed to keep all instances busy until |
| 41 | max_buffer. In the example below, the shortfall (the gray areas) is 13. |
224 | | The scheduler RPC mechanism may select a project to contact because of a user request, an outstanding trickle-up message, or a result that is overdue for reporting. If it does so, it will also request work from that project. Otherwise, the RPC mechanism chooses the project P for which |
225 | | |
226 | | {{{ |
227 | | P.work_request_size>0 and |
228 | | P.long_term_debt + shortfall(P) is greatest |
229 | | }}} |
230 | | |
231 | | and requests work from that project. Note: P.work_request_size is in units of normalized CPU time, so the actual work request (which is in units of project-normalized CPU time) is P.work_request_size divided by P's resource share fraction relative to potentially runnable projects. |
232 | | |
233 | | ---- |
234 | | |
235 | | == Scheduler work-send policy == |
236 | | |
237 | | NOTE: the following has not been implemented, and is independent of the above policies. |
238 | | |
239 | | The scheduler should avoid sending results whose deadlines are likely to be missed, or which are likely to cause existing results to miss their deadlines. This will be accomplished as follows: |
240 | | * Scheduler requests includes connection period, list of queued result (with estimated time remaining and deadline) and project resource fractions. |
241 | | * The scheduler won't send results whose deadlines are less than now + min_queue. |
242 | | * The scheduler does an EDF simulation of the initial workload to determine by how much each result misses its deadline. For each result R being considered for sending, the scheduler does an EDF simulation. If R meets its deadline and no result misses its deadline by more than it did previously, R is sent. |
243 | | * If the scheduler has work but doesn't send any because of deadline misses, it returns a 'no work because of deadlines' flag. If the last RPC to a project returned this flag, it is marked as latency-limited and accumulates LTD. |
244 | | |
245 | | ---- |
246 | | |
247 | | == Describing scenarios == |
248 | | |
249 | | We encourage the use of the following notation for describing scheduling scenarios (times are given in hours): |
250 | | |
251 | | P(C, D, R) |
252 | | |
253 | | This describes a project with |
254 | | |
255 | | * C = CPU time per task |
256 | | * D = delay bound |
257 | | * R = fractional resource share |
258 | | |
259 | | A scenario is described by a list of project, plus the following optional parameters: |
260 | | * NCPUS: number of CPUS (default 1) |
261 | | * min_queue |
262 | | * leave_in_memory |
263 | | * cpu_scheduling_period |
264 | | |
265 | | An example scenario description is: |
266 | | {{{ |
267 | | P1(1000, 2000, .5) |
268 | | P2(1, 10, .5) |
269 | | NCPUS=4 |
270 | | }}} |
271 | | |
272 | | == Scenarios == |
273 | | |
274 | | === Scenario 1 === |
275 | | |
276 | | {{{ |
277 | | P1(0.1, 1, .5) |
278 | | P2(1, 24, .25) |
279 | | P3(1, 24, .25) |
280 | | NCPUS = 2 |
281 | | leave_in_memory = false |
282 | | cpu_scheduling_period = 1 |
283 | | }}} |
284 | | |
285 | | Typically one CPU will process 6-minute tasks for P1, and the other CPU will alternate between P2 and P3. It's critical that the scheduler run each task of P2 and P3 for the full CPU scheduling period. If we went strictly by debt, we'd end up switching between them every 6 minutes, and both P2 and P3 would have to resume from a checkpoint each time. For some apps (e.g. Einstein@home) resuming from a checking takes several minutes. So we'd end up wasting most of the time on one CPU. |