Changes between Version 11 and Version 12 of CondorBoinc


Ignore:
Timestamp:
Mar 5, 2013, 12:23:56 AM (11 years ago)
Author:
davea
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • CondorBoinc

    v11 v12  
    66so that a BOINC-based volunteer computing project can provide computing resources to a Condor pool.
    77
    8 A central design goal is transparency:
    9 from the job submitter's viewpoint,
    10 things should look exactly like Condor:
    11 i.e. they prepare Condor submit files and use condor_submit.
     8A central design goal is transparency
     9from the job submitter's viewpoint.
    1210
    1311Condor-B must address some basic differences between Condor and BOINC:
     
    3129   A job is associated with an application, not an app version.
    3230
    33 == BOINC environment choices ==
     31== Assumptions ==
    3432
    35 BOINC offers three "environments" in which applications can be deployed:
    36  * '''Native''':
    37    This requires making source-code modifications and recompiling
    38    for different platforms, linking with the BOINC API library.
    39  * '''Virtual machine-based''':
    40    This would eliminate multi-platform issues
    41    but would require volunteer hosts to have VirtualBox installed.
    42  * '''BOINC wrapper''':
    43    Requires apps to be built for different platforms, but no source code mods.
     33For simplicity, we'll assume that the BOINC project has been
     34configured to run a certain set of applications
     35for which jobs are commonly submitted to Condor.
     36For each of these applications, admins must
    4437
    45 Using the BOINC wrapper is the path of least resistance at this point.
    46 
    47 The Condor pool admins will select a set of applications to run under BOINC.
    48 For each app, they must
    49 
    50  * Create a BOINC "application"
    51  * Create input and output templates
    52  * Compile the app for one or more platforms
    53  * Create BOINC "app versions", with associated job.xml files for the wrapper
    54 
    55 == Data model ==
    56 
    57 Goal: minimize data transfer and storage on the BOINC server.
    58 To do this, we'll add the following mechanism to BOINC:
    59 
    60  * DB tables for files, and for batch/file associations (with lease ends).
    61    File names will be based on MD5s.
    62  * Web RPCs for querying and uploading files.
    63  * Daemon for deleting files and DB records of files with
    64    no associations, or past all lease ends.
    65 
    66 For output files, we'll take the approach that each job has (from BOINC's viewpoint)
    67 a single output file, which is a zipped archive of its actual output files.
    68 This will get copied to the submitter host, unzipped,
    69 and its components moved to the appropriate directory.
     38 * Create a BOINC application record
     39 * Create input and output templates.
     40   Note: in general, the set of input/output files, and their names,
     41   must be fixed ahead of time.
     42   If an application produces output files with indeterminate names,
     43   it must combine these into a zip file
     44   (the BOINC wrapper can do this).
     45 * Build the app for one or more platforms (ways of doing this are discussed below).
     46 * Create BOINC "app versions".
    7047
    7148== Job submission mechanism  ==
     
    7552
    7653 * A "BOINC GAHP" program: runs as a daemon process on the submit node.
    77    This does the following:
    78    * Handle RPCs (over pipes) from the Condor job router to
    79      submit and monitor jobs.
    80    * Periodically poll the BOINC server for completed jobs;
    81      when a job is newly completed,
    82      download its output from the BOINC server,
    83      and store it into the appropriate directories on the submit node.
     54   This handles RPCs (over pipes) from the Condor job router to submit and monitor jobs.
    8455 * A new class in Condor's job_router for managing communication
    8556   with the BOINC GAHP.
     
    8960=== GAHP protocol ===
    9061
    91 The API exported by the BOINC GAHP has the following functions:
     62The GAHP protocol is text-based.
     63Each request and reply consists of a single line.
     64
     65Each of the main commands returns S (success) or E (error) depending
     66on whether it was syntactically valid.
     67The command takes a <req id> argument.
     68
     69The commands are:
     70{{{
     71BOINC_SUBMIT <req id> <batch name> <app name> <#jobs>
     72  <job name> <#args> <arg1> <arg2> ...
     73  <#input files>
     74    <src path> <dst filename>
     75    ...
     76  ALL|<#output files>
     77    <filename> ...
     78  ...
     79Result:
     80  NULL (success) or <err msg>
     81}}}
     82Notes:
     83 * The batch name must be unique over all submissions
     84 * The output file descriptions are optional;
     85   in any case, they must agree with the app's output template.
     86 * As of now, <dst filename> will always be the filename part
     87   of <src path>
     88 * We could add a <dir> argument to prepend to input paths.
    9289
    9390{{{
    94 submit_jobs()
    95     inputs:
    96         batch_name (unique within project)
    97         app_name
    98         jobs
    99             job name
    100             cmdline
    101             for each input file
    102                 path on submit node
    103                 name by which app will open file
    104                      (currently, this will always be filename part of path)
    105             bool return_all_output_files (or regular expression)
    106             if this is not set
    107                 for each output file
    108                     open name (what the app will create)
    109     output:
    110         error code
     91BOINC_QUERY_BATCH <req id> <batch name>
     92
     93Result:
     94  NULL|<err msg> <job1> <status1> ...
    11195}}}
    112 
    113 The BOINC GAHP handles this as follows:
    114 
    115  * Make list of all input files
    116  * Eliminate duplicates in file list
    117  * Compute MD5s of files
    118  * Do query_files() RPC to see which files are already on BOINC server
    119  * Do upload_files() RPC to copy needed files to BOINC server
    120  * Do submit_jobs() RPC to BOINC server; create batch, jobs
     96Notes:
     97 * status is either NOT_STARTED, IN_PROGRESS, DONE, or ERROR
    12198
    12299{{{
    123 query_batch
    124    in: batch name
    125    out: list of jobs
    126       job name
    127       status (done/error/in prog/not in prog)
     100BOINC_FETCH_OUTPUT <req id> <job name> <dir>
     101    ALL|<#files>
     102    <src name> <dst name>
     103    ...
     104Result:
     105  NULL|error_msg
    128106}}}
     107Retrieves a job's output files.
    129108
    130109{{{
    131 retrieve_job_outputs
    132     in:
    133         job name
    134         destination directory for output files
    135         bool return_all_output_files (or regular expression)
    136         if this is not set
    137             for each output file
    138                 open name (what the app created)
    139                 destination name
    140 
    141     out: status
     110BOINC_ABORT_JOBS <req id> <job name> ...
     111Result:
     112  NULL|<err msg>
    142113}}}
     114Abort the given jobs.
    143115
    144116{{{
    145 abort_jobs
    146     in: list of job names
     117BOINC_RETIRE_BATCH <req id> <batch name>
     118Result:
     119  NULL|<err msg>
    147120}}}
     121Retire the given batch; its files and database records can be deleted.
    148122
    149123{{{
    150 set_lease
    151     in: batch name
    152         new lease end time
     124BOINC_SET_LEASE <req id> <batch name> <new lease time>
     125Result:
     126  NULL|<err msg>
    153127}}}
     128Set the "lease time" for a batch.
     129After this time its files and database records can be deleted.
    154130
    155 === BOINC Web RPCs ===
     131=== Project selection and authentication ===
    156132
    157 {{{
    158 query_files()
    159     in: list of physical file names
    160     out: list of those not present on server
    161 }}}
     133For the time being we'll do it this way:
     134Each job submitter has a separate account on the BOINC project
     135(these accounts can be assigned [MultiUser access rights and quotas]).
     136The account has a private '''authenticator''' (a random string).
    162137
    163 {{{
    164 upload_files()
    165     in: batch name
    166         filename
    167         file contents
    168     out: error code
     138The job submitter will create a configuration file containing
     139 * the URL of the BOINC project
     140 * the account authenticator
    169141
    170 uploads files and creates DB records (see below)
    171 }}}
     142The BOINC GAHP will read this configuration file at startup,
     143and will handle requests using that account on that project.
    172144
    173 {{{
    174 submit_jobs()
    175     in: same as for GAHP, except include both logical and physical name
    176     out: error code
    177 }}}
     145Note: we could generalize this a bit by including the
     146project URL and authenticator as an argument to each GAHP request.
    178147
    179 === Atomicity ===
     148== Data model ==
    180149
    181 (We need to decide about this).
     150The BOINC GAHP uses BOINC's
     151[RemoteInputFiles#Content-basedfilemanagement content-based file management system]
     152to manage input files.
     153In this system, files are stored on the BOINC server
     154with names based on their MD5.
     155This provides automatic file immutability
     156It minimizes server disk usage and network transfer in cases where
     157a given file is used by many jobs or batches.
    182158
    183 === Authentication ===
     159The BOINC database stores records associating files and batches;
     160a file is deleted only when it is no longer associated with any batches.
    184161
    185 All the above APIs will take a "credentials" argument,
    186 which may be either a BOINC authenticator or x.509 certificate;
    187 we'll need to decide this.
    188 Two general approaches:
     162== Implementation notes ==
    189163
    190  * Each job submitter has a separate account on the BOINC project
    191    (created ahead of time in a way TBD).
    192    This is preferred because it allows BOINC to enforce quotas.
    193  * All jobs belong to a single BOINC account.
     164The BOINC GAHP handles BOINC_SUBMIT as follows:
    194165
    195 == Changes to BOINC ==
     166 * Do an RPC to create a "batch" record
     167 * Make list of all input files; eliminate duplicates
     168 * Compute MD5s of files
     169 * Do an RPC to see which files are already on the BOINC server
     170   and create batch/file associations for these files
     171   (this avoids a race condition with the file cleanup daemon).
     172 * Do an RPC to copy needed files to BOINC server,
     173   and create batch/file associations for these files.
     174 * Do an RPC create jobs
    196175
    197  * The job creation primitives (create_work()) will let you directly specify
    198   the logical names of input files,
    199   rather than specifying them in a template.
    200  * Add lease_end field to batch
     176== Ways to deploy applications on BOINC ==
    201177
    202 == BOINC GAHP implementation notes ==
     178BOINC offers three "environments" in which applications can be deployed:
     179 * '''Native''':
     180   This requires making source-code modifications and building the app
     181   for different platforms, linking with the BOINC API library.
     182 * '''BOINC wrapper''':
     183   Requires apps to be built for different platforms, but no source code mods.
     184 * '''Virtual machine-based''':
     185   This would eliminate multi-platform issues
     186   but would require volunteer hosts to have VirtualBox installed.
    203187
    204 The BOINC GAHP could be implemented in Python or C++.
    205 My inclination is to use C++.
    206