| 41 | | |
| 42 | | == Major projects == |
| 43 | | |
| 44 | | === Handle heterogeneous GPUs === |
| 45 | | |
| 46 | | Currently BOINC requires that all GPUs of a given vendor (NVIDIA, ATI, Intel) be similar, |
| 47 | | and it treats them as a single pool |
| 48 | | (i.e. jobs are not associated with a particular GPU instance). |
| 49 | | This model has a number of drawbacks on machines with multiple different GPUs. |
| 50 | | |
| 51 | | Change the model so that each GPU is treated separately. |
| 52 | | This will require extensive changes to the client, scheduler, and RPC protocol. |
| 53 | | |
| 54 | | === Eliminate O(n!^2) algorithms === |
| 55 | | |
| 56 | | The client's job scheduler has several O(N!^2) algorithms, |
| 57 | | where N is the number of jobs queued on the client. |
| 58 | | These cause the client to use lots of CPU when N is large (1,000). |
| 59 | | Change these to Nlog(N). |
| 60 | | |
| 61 | | === Automated testing of BOINC === |
| 62 | | |
| 63 | | Help us add unit tests to the BOINC code, |
| 64 | | and to design end-to-end tests that exercise the entire system |
| 65 | | under a range of use cases and error conditions. |
| 66 | | |
| 67 | | === Accelerating batch completion === |
| 68 | | |
| 69 | | Volunteer computing resources are unreliable - computers fail, |
| 70 | | people uninstall BOINC, and so on. |
| 71 | | Roughly 5% of jobs fail or time out. |
| 72 | | This means that in a batch of 10,000 jobs, 500 or so will fail. |
| 73 | | We retry these (after a delay of a few days) and 25 or so will fail, and so on. |
| 74 | | Thus is can take quite a long time to finish the entire batch. |
| 75 | | |
| 76 | | This problem can be solved by using more reliable computers to handle retries |
| 77 | | and jobs at the end of a batch. |
| 78 | | Doing so, however, is tricky. |
| 79 | | We have some ideas on how to |
| 80 | | [PortalFeatures prioritize batches] and [JobPrioritization prioritize jobs]. |
| 81 | | Complete these designs and implement them. |
| 82 | | |
| 83 | | === Improve app version selection === |
| 84 | | |
| 85 | | The scheduler's logic for selecting app versions is clumsy. |
| 86 | | Replace it with logic that, at the start of a request, |
| 87 | | selects a version for each (app, resource type) |
| 88 | | and stores these in an array. |
| 89 | | |
| 90 | | === Remodel the preferences system === |
| 91 | | Details are [wiki:PrefsRemodel here]. |
| 92 | | |
| 93 | | === Dynamic deadline adjustment === |
| 94 | | |
| 95 | | Currently, when the scheduler sends a job to the client, |
| 96 | | the job has a fixed deadline. |
| 97 | | If the job hasn't been completed and reported to the scheduler by then, |
| 98 | | the server will generate a new instance the job. |
| 99 | | In some cases this is wasteful. |
| 100 | | If the client is 90% finished with the job by the deadline, |
| 101 | | it may be better to let it finish than to create a new instance. |
| 102 | | The proposal, in general terms: |
| 103 | | * Have the client report the status (fraction done and elapsed time) of in-progress jobs. |
| 104 | | * Allow the scheduler to extend the deadlines of jobs under some conditions. |
| 105 | | |
| | 143 | |
| | 144 | == Major projects == |
| | 145 | |
| | 146 | === Handle heterogeneous GPUs === |
| | 147 | |
| | 148 | Currently BOINC requires that all GPUs of a given vendor (NVIDIA, ATI, Intel) be similar, |
| | 149 | and it treats them as a single pool |
| | 150 | (i.e. jobs are not associated with a particular GPU instance). |
| | 151 | This model has a number of drawbacks on machines with multiple different GPUs. |
| | 152 | |
| | 153 | Change the model so that each GPU is treated separately. |
| | 154 | This will require extensive changes to the client, scheduler, and RPC protocol. |
| | 155 | |
| | 156 | === Eliminate O(n!^2) algorithms === |
| | 157 | |
| | 158 | The client's job scheduler has several O(N!^2) algorithms, |
| | 159 | where N is the number of jobs queued on the client. |
| | 160 | These cause the client to use lots of CPU when N is large (1,000). |
| | 161 | Change these to Nlog(N). |
| | 162 | |
| | 163 | === Automated testing of BOINC === |
| | 164 | |
| | 165 | Help us add unit tests to the BOINC code, |
| | 166 | and to design end-to-end tests that exercise the entire system |
| | 167 | under a range of use cases and error conditions. |
| | 168 | |
| | 169 | === Accelerating batch completion === |
| | 170 | |
| | 171 | Volunteer computing resources are unreliable - computers fail, |
| | 172 | people uninstall BOINC, and so on. |
| | 173 | Roughly 5% of jobs fail or time out. |
| | 174 | This means that in a batch of 10,000 jobs, 500 or so will fail. |
| | 175 | We retry these (after a delay of a few days) and 25 or so will fail, and so on. |
| | 176 | Thus is can take quite a long time to finish the entire batch. |
| | 177 | |
| | 178 | This problem can be solved by using more reliable computers to handle retries |
| | 179 | and jobs at the end of a batch. |
| | 180 | Doing so, however, is tricky. |
| | 181 | We have some ideas on how to |
| | 182 | [PortalFeatures prioritize batches] and [JobPrioritization prioritize jobs]. |
| | 183 | Complete these designs and implement them. |
| | 184 | |
| | 185 | === Improve app version selection === |
| | 186 | |
| | 187 | The scheduler's logic for selecting app versions is clumsy. |
| | 188 | Replace it with logic that, at the start of a request, |
| | 189 | selects a version for each (app, resource type) and stores these in an array. |
| | 190 | |
| | 191 | === Remodel the preferences system === |
| | 192 | Details are [wiki:PrefsRemodel here]. |
| | 193 | |
| | 194 | === Dynamic deadline adjustment === |
| | 195 | |
| | 196 | Currently, when the scheduler sends a job to the client, the job has a fixed deadline. |
| | 197 | If the job hasn't been completed and reported to the scheduler by then, |
| | 198 | the server will generate a new instance the job. |
| | 199 | In some cases this is wasteful. |
| | 200 | If the client is 90% finished with the job by the deadline, |
| | 201 | it may be better to let it finish than to create a new instance. |
| | 202 | The proposal, in general terms: |
| | 203 | * Have the client report the status (fraction done and elapsed time) of in-progress jobs. |
| | 204 | * Allow the scheduler to extend the deadlines of jobs under some conditions. |
| | 205 | |