Context Navigation

Changes between Version 7 and Version 8 of JobSizeMatching

Timestamp:: Apr 19, 2013, 1:15:48 PM (12 years ago)
Author:: davea
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

JobSizeMatching

-                      v7
+                      v8
 Having a single job size can therefore present problems:
  * If the size is too small, hosts with GPUs get huge numbers of jobs.
+ * If the size is small, hosts with GPUs get huge numbers of jobs.
    This causes performance problems on the client
    and a high DB load on the server.
  * If the size is too large, slow hosts can't get jobs,
+ * If the size is large, slow hosts can't get jobs,
    or they get jobs that take weeks to finish.
 …
 We'll assume that jobs for a given application can be generated
 in several discrete '''size classes'''
 (the number of size classes is a parameter of the application).
+in several discrete '''size classes''';
+the number of size classes is a parameter of the application.
 BOINC will try to send jobs of size class i
 to devices whose effective speed is in the ith quantile,
+where 'effective speed' is the product of the
+device speed and the host's on-fraction.
+where 'effective speed' is the product of the device speed and the host's on-fraction.
 This involves 3 new integer DB fields:
 …
 The size class of a job is specified in the call to create_work().
+Apps with n_size_classes > 1 are called '''multi-size apps'''.
+A project can have both multi-size and non-multi-size apps.
 Notes:
 …
 The order statistics of device effective speed will be computed
 by a new program '''size_census'''.
 For each app with n_size_classes>1 this does:
+For each multi-size app this does:
  * enumerate host_app_versions for that app
 …
 == Scheduler changes ==
 When the scheduler sends jobs of a given app to a given processor,
+When the scheduler sends jobs of a given multi-size app to a given processor,
 it should preferentially send jobs whose size class matches
 the quantile of the processor.
 …
  * For each job, compute a "score" that includes various factors.
    (reliable, beta, previously infeasibly, locality scheduling lite).
  * Include a factor for job size;
+ * For multi-size apps, include a factor for job size;
    decrement the score of jobs that are too small,
    and decrement more for jobs that are too large.
 …
   and the resource load maintaining a job array of that size.
  * All other factors being equal, the scheduler will send jobs of other apps
   rather than send a wrong-size job.
   This could potentially lead to starvation issues; we'll have to see.
+  rather than send a job of non-optimal size class.
+  This could potentially lead to starvation issues; we'll have to see if this is a problem.
 == Regulating the flow of jobs into shared memory ==
 …
 Instead, we'll do the following:
+ * when jobs are created (in the transitioner) set their state to
+  INACTIVE rather than UNSENT.
+  This is done if app.n_size_classes > 1
+ * when jobs are created for a multi-size app (in the transitioner),
+  set their state to INACTIVE rather than UNSENT.
  * have a new daemon ('''size_regulator''') that polls for the number of unsent
   jobs of each type, and changes a few jobs from INACTIVE to UNSENT