Changes between Initial Version and Version 1 of JobSizeMatching


Ignore:
Timestamp:
Feb 13, 2013, 1:13:27 PM (12 years ago)
Author:
davea
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • JobSizeMatching

    v1 v1  
     1= Job size matching =
     2
     3The difference in throughput between a slow resource
     4(e.g. an Android device that runs infrequently)
     5and a fast resource (e.g. a GPU that's always on)
     6can be a factor of 1,000 or more.
     7Having a single job size can therefore present problems:
     8
     9 * If the size is too small, hosts with GPUs get huge numbers of jobs
     10   (which causes various problems) and there is a high DB load on the server.
     11 * If the size is too large, slow hosts can't get jobs,
     12   or they get jobs that take weeks to finish.
     13
     14This document describes a set of mechanisms that address these issues.
     15
     16== Regulating the flow of jobs into shared memory ==
     17
     18Let's suppose that an app's work generator can produce several sizes of job -
     19say, small, medium, and large.
     20'''We won't address the issue of how to pick these sizes.'''
     21
     22How can we prevent shared memory from becoming "clogged" with jobs one size?
     23
     24One approach would be to allocate slots for each size.
     25This would be complex because we already have two allocation schemes
     26(for HR and all_apps).
     27
     28We could modify the work generator to so that it polls the number of unsent
     29jobs of each size, and creates a few more jobs of a given size when this
     30number falls below a threshold.
     31
     32Problem: this might not be able to handle a large spike in demand.
     33We'd like to be able to have a large buffer of unsent jobs in the DB.
     34
     35Solution:
     36 * when jobs are created (in the transitioner) set their state to
     37  INACTIVE rather than UNSENT.
     38  (a per-app flag would indicate this should be done).
     39 * have a new daemon (called it the "regulator") that polls for number of unsent
     40  jobs of each type, and changes a few jobs from INACTIVE to UNSENT.
     41 * Add a "size_class" field to workunit and result to indicate S/M/L.
     42
     43== Scheduler changes ==
     44
     45We need to revamp the scheduler.
     46Here's how things currently work:
     47
     48 * The scheduler makes up to 5 passes through the array:
     49  * "need reliable" jobs
     50  * beta jobs
     51  * previously infeasible jobs
     52  * locality scheduling lite (job uses file already on client)
     53  * unrestricted
     54 * We maintain a data structure that maps app to the "best" app version for that app.
     55  * In the "need reliable" phase this includes only reliable app versions;
     56    the map is cleared at the end of the phase.
     57  * If we satisfy the request for a particular resource and the best app version
     58    uses that resource, we clear the entry.