Changes between Version 5 and Version 6 of JobSizeMatching


Ignore:
Timestamp:
Apr 19, 2013, 12:56:39 PM (11 years ago)
Author:
davea
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • JobSizeMatching

    v5 v6  
    11= Job size matching =
    22
    3 The difference in throughput between a slow resource
     3The difference in throughput between a slow processor
    44(e.g. an Android device that runs infrequently)
    5 and a fast resource (e.g. a GPU that's always on)
     5and a fast processor (e.g. a GPU that's always on)
    66can be a factor of 1,000 or more.
    77Having a single job size can therefore present problems:
    88
    9  * If the size is too small, hosts with GPUs get huge numbers of jobs
    10    (which causes various problems) and there is a high DB load on the server.
     9 * If the size is too small, hosts with GPUs get huge numbers of jobs.
     10   This causes performance problems on the client
     11   and a high DB load on the server.
    1112 * If the size is too large, slow hosts can't get jobs,
    1213   or they get jobs that take weeks to finish.
     
    1415This document describes a set of mechanisms that address these issues.
    1516
    16 == Regulating the flow of jobs into shared memory ==
     17== Job size classes ==
    1718
    18 Let's suppose that an app's work generator can produce several sizes of job -
    19 say, small, medium, and large.
    20 '''We won't address the issue of how to pick these sizes.'''
     19We'll assume that jobs for a given application can be generated
     20in several discrete '''size classes'''
     21(the number of size classes is a parameter of the application).
    2122
    22 How can we prevent shared memory from becoming "clogged" with jobs one size?
     23BOINC will try to send jobs of size class i
     24to devices whose effective speed is in the ith quantile,
     25where 'effective speed' is the product of the
     26device speed and the host's on-fraction.
    2327
    24 One approach would be to allocate slots for each size.
    25 This would be complex because we already have two allocation schemes
    26 (for HR and all_apps).
     28This involves 3 new integer DB fields:
     29 * app.n_size_classes
     30 * workunit.size_class
     31 * result.size_class
    2732
    28 We could modify the work generator to so that it polls the number of unsent
    29 jobs of each size, and creates a few more jobs of a given size when this
    30 number falls below a threshold.
     33The size class of a job is specified in the call to create_work().
    3134
    32 Problem: this might not be able to handle a large spike in demand.
    33 We'd like to be able to have a large buffer of unsent jobs in the DB.
     35Notes:
     36 * It's up to the project to decide how large jobs of different
     37   size classes should be.
     38   E.g. if there are 2 classes, the ratio of large to small could be 2:1 or 100:1.
     39   We should design tools to guide this decision.
     40 * It's up to the project to decide how many jobs
     41   of each size class to create.
     42 * app.n_size_classes can be changed but you need to be careful
     43   because of existing jobs.
    3444
    35 Solution:
    36  * when jobs are created (in the transitioner) set their state to
    37   INACTIVE rather than UNSENT.
    38   (a per-app flag would indicate this should be done).  This flag would be numeric and indicate the number of job classes (num_size_classes).  A value of > 1 would enable this mechanism.
    39  * have a new daemon (called it the "regulator") that polls for number of unsent
    40   jobs of each type, and changes a few jobs from INACTIVE to UNSENT.  This daemon should be designed to manage all apps with num_size_classes > 1 or a specific app passed in on the command line.  The size of buffer for each job class should be based on a parameter from the command line for this daemon.  Additionally, the frequency of polling should also be based upon this field.  The criteria for which jobs to advance to the next state should be the same as those available in the feeder.
    41  * Add a "size_class" field to workunit and result to indicate S/M/L.  This field should be integer (or small int).  A project should not set this field larger than the value of num_size_classes.
     45== Computing device quantiles ==
     46
     47The order statistics of device effective speed will be computed
     48by a new program '''size_census'''.
     49For each app with n_size_classes>1 this does:
     50
     51 * enumerate host_app_versions for that app
     52 * compute "effective speed" as hav.et.avg * host.available_frac
     53 * sort the results, and find the quantile points
     54 * write these to a flat file
     55
     56The feeder reads these flat files (on startup, and when they change)
     57and stores the quantile points in shared memory.
    4258
    4359== Scheduler changes ==
    4460
    45 We need to revamp the scheduler.
    46 Here's how things currently work:
     61When the scheduler sends jobs of a given app to a given processor,
     62it should preferentially send jobs whose size class matches
     63the quantile of the processor.
     64Failing that, it should preferentially send jobs of smaller
     65size classes rather than larger.
    4766
    48  * The scheduler makes up to 5 passes through the array:
    49   * "need reliable" jobs
    50   * beta jobs
    51   * previously infeasible jobs
    52   * locality scheduling lite (job uses file already on client)
    53   * unrestricted
    54  * We maintain a data structure that maps app to the "best" app version for that app.
    55   * In the "need reliable" phase this includes only reliable app versions;
    56     the map is cleared at the end of the phase.
    57   * If we satisfy the request for a particular resource and the best app version
    58     uses that resource, we clear the entry.
     67To implement this policy,
     68we'll use a new version of score-based scheduling.
     69This will work as follows.
    5970
    60 New approach:
    61 Do it one resource at a time (GPUs first).
     71The scheduler will make separate pass for each resource types (GPUs first).
    6272For each resource:
    6373
    6474 * For each app, find the best app version and decide whether it's reliable
    65  * For each of these app versions, find the expected speed
    66    (taking on-fraction etc. into account).
    67    Based on this, and the statistics of the host population,
    68    decide what size job to send for this resource.
    69  * Scan the job array, starting at a random point.
     75 * For each of these app versions, find the effective speed.
     76   Find which quantile the device falls into.
     77 * Scan N entries of the job array, starting at a random point.
    7078   Make a list of jobs for which an app version is available,
    7179   and that are of the right size.
    72  * Sort the list by a score that combines the above criteria
     80 * For each job, compute a "score" that includes various factors.
    7381   (reliable, beta, previously infeasibly, locality scheduling lite).
     82 * Include a factor for job size;
     83   decrement the score of jobs that are too small,
     84   and decrement more for jobs that are too large.
     85 * Sort the list by score.
    7486 * Scan the list; for each job
    7587  * Make sure it's still in the array
     
    7991  * Leave loop if resource request is satisfied or we're out of disk space
    8092
    81 [knreed] - I think that we will need to add a parameter that controls how many jobs in the job array that are scanned by each host.  I think that for WCG, we would probably do something like have the job array be 3-4000 in length and have a given device scan 500 entries.  We would need to experiment with this to minimize contention between active requests and the resource load maintaining a job array of that size.
    82 == Open questions ==
     93Notes:
     94 * The scan size N is configurable.
     95  E.g. WCG plans to have a job array of 3-4000, and N = ~500.
     96  We'd need to experiment with this to minimize contention between active requests
     97  and the resource load maintaining a job array of that size.
     98 * All other factors being equal, the scheduler will send jobs of other apps
     99  rather than send a wrong-size job.
     100  This could potentially lead to starvation issues; we'll have to see.
    83101
    84 1) How to choose job sizes?
    85   [knreed] - Projects should make the call.  However, it would be interesting to create tool that would scan the db and report back a distribution of device-resource available compute power over a 24 hour (elapsed) time period.  This would help the project identify target 'flops' sizes.
     102== Regulating the flow of jobs into shared memory ==
    86103
     104How can we prevent shared memory from becoming "clogged" with jobs of one size?
    87105
    88 2) Given an estimated speed, how to decide which size to send?
    89   [knreed] - One thought.  Using the distribution of device-resource available compute power in a 24 hour period for a given app, break the population into app.num_size_classes groups.  The feeder will maintain a distribution of jobs that it has assigned to shared memory such that is also divides the jobs into app.num_size_classes groups.  A device-resource should be assigned a job that matches the same category as the device-resource.
     106One approach would be to allocate slots for each size.
     107This would be too complex because we already have two allocation schemes
     108(for HR and all_apps).
    90109
     110We could modify the work generator to so that it polls the number of unsent
     111jobs of each size, and creates a few more jobs of a given size when this
     112number falls below a threshold.
     113Problem: this might not be able to handle a large spike in demand.
     114We'd like to be able to have a large buffer of unsent jobs in the DB.
    91115
    92 3) How do we prevent infeasible jobs from clogging the shared memory?  I think the only new consideration is that if an app has homogenous app version set and the resource currently looking at the job can process the job, just not with the best app version, should the job be assigned?
     116Instead, we'll do the following:
     117 * when jobs are created (in the transitioner) set their state to
     118  INACTIVE rather than UNSENT.
     119  This is done if app.n_size_classes > 1
     120 * have a new daemon ('''size_regulator''') that polls for the number of unsent
     121  jobs of each type, and changes a few jobs from INACTIVE to UNSENT
     122  is N falls below a certain level.
    93123
     124The parameters of '''size_regulator''' are:
     125 * The name of the app
     126 * Low- and high-water marks for the number of UNSENT jobs of each size class.
     127 * The polling frequency.
     128 * Ordering options as for the feeder
     129  (--priority_order, --priority_order_create_time, --random_order).
     130  This determines the order in which jobs are marked as UNSENT.