wiki:ClientSchedOctTen

Version 4 (modified by davea, 14 years ago) (diff)

--

Client scheduling changes

Design document for changes to the client work fetch and job scheduling policies, started Oct 2010.

This supercedes the following design docs:

Problems with current system

The current policies, described here, maintain long- and short-term debts for each (project, processor type) pair.

Job scheduling for a given processor type is based on STD. Projects with greater STD for the type are given priority.

Work fetch is based on a weighted sum of LTDs. Work is typically fetched from the project for which this sum is greatest, and typically work is requested for all processor types.

These policies fail to meet their goals in many cases. Here are two scenarios that illustrate the underlying problems:

Example 1

A host has a fast GPU and a slow CPU. Project A has apps for both GPU and CPU. Project B has apps only for CPU. Equal resource shares.

In the current system each project will get 50% of the CPU. The target behavior, which matches resource shares better, is that project B gets 100% of the CPU and project A gets 100% of the GPU.

Example 2

Same host. Additional project C has only CPU apps.

In this case A's CPU LTD will stay around zero, and the CPU LTD for B and C goes unboundedly negative, and gets clamped at the cutoff. All information about the relative debt of B and C is lost.

The bottom line

The current approach - extending the STD/LTD model to multiple processor types - seemed good at the time but turns out to be the wrong way.

Proposal: credit-driven scheduling

The idea is to make resource share apply to credit. If two projects have the same resource share, they should have the same RAC. This suggests the following principle, which can apply to both work fetch and job scheduling:

  • Normalize RAC and resource share so that each one sums to 1 across projects.
  • For a project P, let G(P) = share(P) - RAC(P).
  • Give priority to projects for which G(P) is highest, i.e. that aren't getting as much credit as they should.

This does 2 things:

  • It's the correct semantics for resource share: they now control something that volunteers can actually see, namely credit.
  • It penalizes projects that grant inflated credit: the more credit a project grants, the less work a given host will do for it, assuming the host is attached to multiple projects. (The converse is potentially true - a project would get more work done by granting less credit. This is minimized by a mechanism described below.)

Note: I've glossed over the issue of the time scale over which RAC is averaged. The RAC reported by servers has a half-life of a week. For purposes of scheduling a different (probably longer) period would be better. The client could potentially compute its own RAC based on changes in total credit. However, it's probably OK to just use the server-reported RAC.

Recent average FLOPS

There are some problems with credit-driven scheduling:

  • There may be a long and variable delay between completing a job and getting credit for it.
  • Jobs may fail to get credit, e.g. because they don't validate.

To deal with these issues, I propose using not just RAC by itself, but the combination of RAC and recent average FLOPS (RAF) per project. This is intended to address the above 2 issues, and the issue of projects that grant too little credit.

Work fetch

In addition to G(P), the work fetch policy also needs to take into account the amount of work currently queued for a project, so that it doesn't keep getting work from the same project. To accomplish this, we define Q(P) as the number of FLOPS currently queued for P, normalized so that the sum over projects is 1.

We then define the work-fetch priority of a project as

WFP(P) = share(P) - (RAC(P) + A*RAF(P) + B*Q(P))/(1+A+B)

where A and B are parameters, probably around 1.

Job scheduling

As the job scheduling policy picks jobs to run (e.g. on a multiprocessor) it needs to take into account the jobs already scheduled, so that it doesn't always schedule multiple jobs from the same project. To accomplish this, as each job is scheduled we update RAF(P) as if the job had run for one scheduling period. The job-scheduling priority is then

JSP(P) = share(P) - (RAC(P) + C*RAF(P))

where C is a parameter, probably around 1.