41 | | |
42 | | == Major projects == |
43 | | |
44 | | === Handle heterogeneous GPUs === |
45 | | |
46 | | Currently BOINC requires that all GPUs of a given vendor (NVIDIA, ATI, Intel) be similar, |
47 | | and it treats them as a single pool |
48 | | (i.e. jobs are not associated with a particular GPU instance). |
49 | | This model has a number of drawbacks on machines with multiple different GPUs. |
50 | | |
51 | | Change the model so that each GPU is treated separately. |
52 | | This will require extensive changes to the client, scheduler, and RPC protocol. |
53 | | |
54 | | === Eliminate O(n!^2) algorithms === |
55 | | |
56 | | The client's job scheduler has several O(N!^2) algorithms, |
57 | | where N is the number of jobs queued on the client. |
58 | | These cause the client to use lots of CPU when N is large (1,000). |
59 | | Change these to Nlog(N). |
60 | | |
61 | | === Automated testing of BOINC === |
62 | | |
63 | | Help us add unit tests to the BOINC code, |
64 | | and to design end-to-end tests that exercise the entire system |
65 | | under a range of use cases and error conditions. |
66 | | |
67 | | === Accelerating batch completion === |
68 | | |
69 | | Volunteer computing resources are unreliable - computers fail, |
70 | | people uninstall BOINC, and so on. |
71 | | Roughly 5% of jobs fail or time out. |
72 | | This means that in a batch of 10,000 jobs, 500 or so will fail. |
73 | | We retry these (after a delay of a few days) and 25 or so will fail, and so on. |
74 | | Thus is can take quite a long time to finish the entire batch. |
75 | | |
76 | | This problem can be solved by using more reliable computers to handle retries |
77 | | and jobs at the end of a batch. |
78 | | Doing so, however, is tricky. |
79 | | We have some ideas on how to |
80 | | [PortalFeatures prioritize batches] and [JobPrioritization prioritize jobs]. |
81 | | Complete these designs and implement them. |
82 | | |
83 | | === Improve app version selection === |
84 | | |
85 | | The scheduler's logic for selecting app versions is clumsy. |
86 | | Replace it with logic that, at the start of a request, |
87 | | selects a version for each (app, resource type) |
88 | | and stores these in an array. |
89 | | |
90 | | === Remodel the preferences system === |
91 | | Details are [wiki:PrefsRemodel here]. |
92 | | |
93 | | === Dynamic deadline adjustment === |
94 | | |
95 | | Currently, when the scheduler sends a job to the client, |
96 | | the job has a fixed deadline. |
97 | | If the job hasn't been completed and reported to the scheduler by then, |
98 | | the server will generate a new instance the job. |
99 | | In some cases this is wasteful. |
100 | | If the client is 90% finished with the job by the deadline, |
101 | | it may be better to let it finish than to create a new instance. |
102 | | The proposal, in general terms: |
103 | | * Have the client report the status (fraction done and elapsed time) of in-progress jobs. |
104 | | * Allow the scheduler to extend the deadlines of jobs under some conditions. |
105 | | |
| 143 | |
| 144 | == Major projects == |
| 145 | |
| 146 | === Handle heterogeneous GPUs === |
| 147 | |
| 148 | Currently BOINC requires that all GPUs of a given vendor (NVIDIA, ATI, Intel) be similar, |
| 149 | and it treats them as a single pool |
| 150 | (i.e. jobs are not associated with a particular GPU instance). |
| 151 | This model has a number of drawbacks on machines with multiple different GPUs. |
| 152 | |
| 153 | Change the model so that each GPU is treated separately. |
| 154 | This will require extensive changes to the client, scheduler, and RPC protocol. |
| 155 | |
| 156 | === Eliminate O(n!^2) algorithms === |
| 157 | |
| 158 | The client's job scheduler has several O(N!^2) algorithms, |
| 159 | where N is the number of jobs queued on the client. |
| 160 | These cause the client to use lots of CPU when N is large (1,000). |
| 161 | Change these to Nlog(N). |
| 162 | |
| 163 | === Automated testing of BOINC === |
| 164 | |
| 165 | Help us add unit tests to the BOINC code, |
| 166 | and to design end-to-end tests that exercise the entire system |
| 167 | under a range of use cases and error conditions. |
| 168 | |
| 169 | === Accelerating batch completion === |
| 170 | |
| 171 | Volunteer computing resources are unreliable - computers fail, |
| 172 | people uninstall BOINC, and so on. |
| 173 | Roughly 5% of jobs fail or time out. |
| 174 | This means that in a batch of 10,000 jobs, 500 or so will fail. |
| 175 | We retry these (after a delay of a few days) and 25 or so will fail, and so on. |
| 176 | Thus is can take quite a long time to finish the entire batch. |
| 177 | |
| 178 | This problem can be solved by using more reliable computers to handle retries |
| 179 | and jobs at the end of a batch. |
| 180 | Doing so, however, is tricky. |
| 181 | We have some ideas on how to |
| 182 | [PortalFeatures prioritize batches] and [JobPrioritization prioritize jobs]. |
| 183 | Complete these designs and implement them. |
| 184 | |
| 185 | === Improve app version selection === |
| 186 | |
| 187 | The scheduler's logic for selecting app versions is clumsy. |
| 188 | Replace it with logic that, at the start of a request, |
| 189 | selects a version for each (app, resource type) and stores these in an array. |
| 190 | |
| 191 | === Remodel the preferences system === |
| 192 | Details are [wiki:PrefsRemodel here]. |
| 193 | |
| 194 | === Dynamic deadline adjustment === |
| 195 | |
| 196 | Currently, when the scheduler sends a job to the client, the job has a fixed deadline. |
| 197 | If the job hasn't been completed and reported to the scheduler by then, |
| 198 | the server will generate a new instance the job. |
| 199 | In some cases this is wasteful. |
| 200 | If the client is 90% finished with the job by the deadline, |
| 201 | it may be better to let it finish than to create a new instance. |
| 202 | The proposal, in general terms: |
| 203 | * Have the client report the status (fraction done and elapsed time) of in-progress jobs. |
| 204 | * Allow the scheduler to extend the deadlines of jobs under some conditions. |
| 205 | |