Changes between Version 11 and Version 12 of CondorBoinc
- Timestamp:
- Mar 5, 2013, 12:23:56 AM (12 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
CondorBoinc
v11 v12 6 6 so that a BOINC-based volunteer computing project can provide computing resources to a Condor pool. 7 7 8 A central design goal is transparency: 9 from the job submitter's viewpoint, 10 things should look exactly like Condor: 11 i.e. they prepare Condor submit files and use condor_submit. 8 A central design goal is transparency 9 from the job submitter's viewpoint. 12 10 13 11 Condor-B must address some basic differences between Condor and BOINC: … … 31 29 A job is associated with an application, not an app version. 32 30 33 == BOINC environment choices ==31 == Assumptions == 34 32 35 BOINC offers three "environments" in which applications can be deployed: 36 * '''Native''': 37 This requires making source-code modifications and recompiling 38 for different platforms, linking with the BOINC API library. 39 * '''Virtual machine-based''': 40 This would eliminate multi-platform issues 41 but would require volunteer hosts to have VirtualBox installed. 42 * '''BOINC wrapper''': 43 Requires apps to be built for different platforms, but no source code mods. 33 For simplicity, we'll assume that the BOINC project has been 34 configured to run a certain set of applications 35 for which jobs are commonly submitted to Condor. 36 For each of these applications, admins must 44 37 45 Using the BOINC wrapper is the path of least resistance at this point. 46 47 The Condor pool admins will select a set of applications to run under BOINC. 48 For each app, they must 49 50 * Create a BOINC "application" 51 * Create input and output templates 52 * Compile the app for one or more platforms 53 * Create BOINC "app versions", with associated job.xml files for the wrapper 54 55 == Data model == 56 57 Goal: minimize data transfer and storage on the BOINC server. 58 To do this, we'll add the following mechanism to BOINC: 59 60 * DB tables for files, and for batch/file associations (with lease ends). 61 File names will be based on MD5s. 62 * Web RPCs for querying and uploading files. 63 * Daemon for deleting files and DB records of files with 64 no associations, or past all lease ends. 65 66 For output files, we'll take the approach that each job has (from BOINC's viewpoint) 67 a single output file, which is a zipped archive of its actual output files. 68 This will get copied to the submitter host, unzipped, 69 and its components moved to the appropriate directory. 38 * Create a BOINC application record 39 * Create input and output templates. 40 Note: in general, the set of input/output files, and their names, 41 must be fixed ahead of time. 42 If an application produces output files with indeterminate names, 43 it must combine these into a zip file 44 (the BOINC wrapper can do this). 45 * Build the app for one or more platforms (ways of doing this are discussed below). 46 * Create BOINC "app versions". 70 47 71 48 == Job submission mechanism == … … 75 52 76 53 * A "BOINC GAHP" program: runs as a daemon process on the submit node. 77 This does the following: 78 * Handle RPCs (over pipes) from the Condor job router to 79 submit and monitor jobs. 80 * Periodically poll the BOINC server for completed jobs; 81 when a job is newly completed, 82 download its output from the BOINC server, 83 and store it into the appropriate directories on the submit node. 54 This handles RPCs (over pipes) from the Condor job router to submit and monitor jobs. 84 55 * A new class in Condor's job_router for managing communication 85 56 with the BOINC GAHP. … … 89 60 === GAHP protocol === 90 61 91 The API exported by the BOINC GAHP has the following functions: 62 The GAHP protocol is text-based. 63 Each request and reply consists of a single line. 64 65 Each of the main commands returns S (success) or E (error) depending 66 on whether it was syntactically valid. 67 The command takes a <req id> argument. 68 69 The commands are: 70 {{{ 71 BOINC_SUBMIT <req id> <batch name> <app name> <#jobs> 72 <job name> <#args> <arg1> <arg2> ... 73 <#input files> 74 <src path> <dst filename> 75 ... 76 ALL|<#output files> 77 <filename> ... 78 ... 79 Result: 80 NULL (success) or <err msg> 81 }}} 82 Notes: 83 * The batch name must be unique over all submissions 84 * The output file descriptions are optional; 85 in any case, they must agree with the app's output template. 86 * As of now, <dst filename> will always be the filename part 87 of <src path> 88 * We could add a <dir> argument to prepend to input paths. 92 89 93 90 {{{ 94 submit_jobs() 95 inputs: 96 batch_name (unique within project) 97 app_name 98 jobs 99 job name 100 cmdline 101 for each input file 102 path on submit node 103 name by which app will open file 104 (currently, this will always be filename part of path) 105 bool return_all_output_files (or regular expression) 106 if this is not set 107 for each output file 108 open name (what the app will create) 109 output: 110 error code 91 BOINC_QUERY_BATCH <req id> <batch name> 92 93 Result: 94 NULL|<err msg> <job1> <status1> ... 111 95 }}} 112 113 The BOINC GAHP handles this as follows: 114 115 * Make list of all input files 116 * Eliminate duplicates in file list 117 * Compute MD5s of files 118 * Do query_files() RPC to see which files are already on BOINC server 119 * Do upload_files() RPC to copy needed files to BOINC server 120 * Do submit_jobs() RPC to BOINC server; create batch, jobs 96 Notes: 97 * status is either NOT_STARTED, IN_PROGRESS, DONE, or ERROR 121 98 122 99 {{{ 123 query_batch 124 in: batch name 125 out: list of jobs 126 job name 127 status (done/error/in prog/not in prog) 100 BOINC_FETCH_OUTPUT <req id> <job name> <dir> 101 ALL|<#files> 102 <src name> <dst name> 103 ... 104 Result: 105 NULL|error_msg 128 106 }}} 107 Retrieves a job's output files. 129 108 130 109 {{{ 131 retrieve_job_outputs 132 in: 133 job name 134 destination directory for output files 135 bool return_all_output_files (or regular expression) 136 if this is not set 137 for each output file 138 open name (what the app created) 139 destination name 140 141 out: status 110 BOINC_ABORT_JOBS <req id> <job name> ... 111 Result: 112 NULL|<err msg> 142 113 }}} 114 Abort the given jobs. 143 115 144 116 {{{ 145 abort_jobs 146 in: list of job names 117 BOINC_RETIRE_BATCH <req id> <batch name> 118 Result: 119 NULL|<err msg> 147 120 }}} 121 Retire the given batch; its files and database records can be deleted. 148 122 149 123 {{{ 150 set_lease 151 in: batch name 152 new lease end time124 BOINC_SET_LEASE <req id> <batch name> <new lease time> 125 Result: 126 NULL|<err msg> 153 127 }}} 128 Set the "lease time" for a batch. 129 After this time its files and database records can be deleted. 154 130 155 === BOINC Web RPCs===131 === Project selection and authentication === 156 132 157 {{{ 158 query_files() 159 in: list of physical file names 160 out: list of those not present on server 161 }}} 133 For the time being we'll do it this way: 134 Each job submitter has a separate account on the BOINC project 135 (these accounts can be assigned [MultiUser access rights and quotas]). 136 The account has a private '''authenticator''' (a random string). 162 137 163 {{{ 164 upload_files() 165 in: batch name 166 filename 167 file contents 168 out: error code 138 The job submitter will create a configuration file containing 139 * the URL of the BOINC project 140 * the account authenticator 169 141 170 uploads files and creates DB records (see below) 171 }}} 142 The BOINC GAHP will read this configuration file at startup, 143 and will handle requests using that account on that project. 172 144 173 {{{ 174 submit_jobs() 175 in: same as for GAHP, except include both logical and physical name 176 out: error code 177 }}} 145 Note: we could generalize this a bit by including the 146 project URL and authenticator as an argument to each GAHP request. 178 147 179 == = Atomicity ===148 == Data model == 180 149 181 (We need to decide about this). 150 The BOINC GAHP uses BOINC's 151 [RemoteInputFiles#Content-basedfilemanagement content-based file management system] 152 to manage input files. 153 In this system, files are stored on the BOINC server 154 with names based on their MD5. 155 This provides automatic file immutability 156 It minimizes server disk usage and network transfer in cases where 157 a given file is used by many jobs or batches. 182 158 183 === Authentication === 159 The BOINC database stores records associating files and batches; 160 a file is deleted only when it is no longer associated with any batches. 184 161 185 All the above APIs will take a "credentials" argument, 186 which may be either a BOINC authenticator or x.509 certificate; 187 we'll need to decide this. 188 Two general approaches: 162 == Implementation notes == 189 163 190 * Each job submitter has a separate account on the BOINC project 191 (created ahead of time in a way TBD). 192 This is preferred because it allows BOINC to enforce quotas. 193 * All jobs belong to a single BOINC account. 164 The BOINC GAHP handles BOINC_SUBMIT as follows: 194 165 195 == Changes to BOINC == 166 * Do an RPC to create a "batch" record 167 * Make list of all input files; eliminate duplicates 168 * Compute MD5s of files 169 * Do an RPC to see which files are already on the BOINC server 170 and create batch/file associations for these files 171 (this avoids a race condition with the file cleanup daemon). 172 * Do an RPC to copy needed files to BOINC server, 173 and create batch/file associations for these files. 174 * Do an RPC create jobs 196 175 197 * The job creation primitives (create_work()) will let you directly specify 198 the logical names of input files, 199 rather than specifying them in a template. 200 * Add lease_end field to batch 176 == Ways to deploy applications on BOINC == 201 177 202 == BOINC GAHP implementation notes == 178 BOINC offers three "environments" in which applications can be deployed: 179 * '''Native''': 180 This requires making source-code modifications and building the app 181 for different platforms, linking with the BOINC API library. 182 * '''BOINC wrapper''': 183 Requires apps to be built for different platforms, but no source code mods. 184 * '''Virtual machine-based''': 185 This would eliminate multi-platform issues 186 but would require volunteer hosts to have VirtualBox installed. 203 187 204 The BOINC GAHP could be implemented in Python or C++.205 My inclination is to use C++.206