Context Navigation

Changes between Version 4 and Version 5 of JobSizeMatching

Timestamp:: Mar 18, 2013, 8:05:56 AM (12 years ago)
Author:: Kevin Reed
Comment:: thoughts from kevin

Legend:

: Unmodified
: Added
: Removed
: Modified

JobSizeMatching

-                      v4
+                      v5
  * when jobs are created (in the transitioner) set their state to
   INACTIVE rather than UNSENT.
   (a per-app flag would indicate this should be done).
+  (a per-app flag would indicate this should be done).  This flag would be numeric and indicate the number of job classes (num_size_classes).  A value of > 1 would enable this mechanism.
  * have a new daemon (called it the "regulator") that polls for number of unsent
   jobs of each type, and changes a few jobs from INACTIVE to UNSENT.
  * Add a "size_class" field to workunit and result to indicate S/M/L.
+  jobs of each type, and changes a few jobs from INACTIVE to UNSENT.  This daemon should be designed to manage all apps with num_size_classes > 1 or a specific app passed in on the command line.  The size of buffer for each job class should be based on a parameter from the command line for this daemon.  Additionally, the frequency of polling should also be based upon this field.  The criteria for which jobs to advance to the next state should be the same as those available in the feeder.
+ * Add a "size_class" field to workunit and result to indicate S/M/L.  This field should be integer (or small int).  A project should not set this field larger than the value of num_size_classes.
 == Scheduler changes ==
 …
   * Leave loop if resource request is satisfied or we're out of disk space
+[knreed] - I think that we will need to add a parameter that controls how many jobs in the job array that are scanned by each host.  I think that for WCG, we would probably do something like have the job array be 3-4000 in length and have a given device scan 500 entries.  We would need to experiment with this to minimize contention between active requests and the resource load maintaining a job array of that size.
 == Open questions ==
+How to choose job sizes?
+) How to choose job sizes?
+  [knreed] - Projects should make the call.  However, it would be interesting to create tool that would scan the db and report back a distribution of device-resource available compute power over a 24 hour (elapsed) time period.  This would help the project identify target 'flops' sizes.
-Given an estimated speed, how to decide which size to send?
+) Given an estimated speed, how to decide which size to send?
+  [knreed] - One thought.  Using the distribution of device-resource available compute power in a 24 hour period for a given app, break the population into app.num_size_classes groups.  The feeder will maintain a distribution of jobs that it has assigned to shared memory such that is also divides the jobs into app.num_size_classes groups.  A device-resource should be assigned a job that matches the same category as the device-resource.
+) How do we prevent infeasible jobs from clogging the shared memory?  I think the only new consideration is that if an app has homogenous app version set and the resource currently looking at the job can process the job, just not with the best app version, should the job be assigned?