Changes between Version 15 and Version 16 of HomogeneousRedundancy


Ignore:
Timestamp:
Mar 15, 2012, 12:27:42 PM (13 years ago)
Author:
davea
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • HomogeneousRedundancy

    v15 v16  
    33= Dealing with numerical discrepancies =
    44
    5 Most numerical applications produce different outcomes for a given workunit depending on the machine architecture, operating system, compiler, and compiler flags. For some applications these discrepancies produce only small differences in the final output, and results can be validated using a 'fuzzy comparison' function that allows for deviations of a few percent.
     5Most numerical applications produce different outcomes for a given workunit
     6depending on the machine architecture, operating system, compiler, and compiler flags.
     7For some applications these discrepancies produce only small differences in the final output,
     8 and results can be validated using a 'fuzzy comparison' function that allows for deviations of a few percent.
    69
    7 Other applications are 'divergent' in the sense that small numerical differences lead to unpredictably large differences in the final output. For such applications it may be difficult to distinguish between results that are correct but differ because of numerical discrepancies, and results that are erroneous. The 'fuzzy comparison' approach does not work for such applications.
     10Other applications are 'divergent' in the sense that small numerical differences lead to unpredictably large differences in the final output.
     11For such applications it may be difficult to distinguish between results that are correct
     12but differ because of numerical discrepancies, and results that are erroneous.
     13The 'fuzzy comparison' approach does not work for such applications.
    814
    915== Eliminating discrepancies ==
    1016
    11 One approach is to eliminate numerical discrepancies. Some notes on how to do this for Fortran programs are given in a paper, [//MOM1MP01.pdf Massive Tracking on Heterogeneous Platforms] and in an earlier [//fortran_numerics.txt text document], both courtesy of Eric !McIntosh from CERN.
     17One approach is to eliminate numerical discrepancies.
     18Some notes on how to do this for Fortran programs are given in a paper,
     19[//MOM1MP01.pdf Massive Tracking on Heterogeneous Platforms] and in an earlier [//fortran_numerics.txt text document],
     20both courtesy of Eric !McIntosh from CERN.
    1221
    1322== Homogeneous redundancy ==
    1423
    1524BOINC provides a feature called '''homogeneous redundancy''' (HR) to handle divergent applications.
    16 HR divides hosts into 'numerical equivalence classes': two hosts are in the same class if they return identical results for your applications.
     25HR divides hosts into 'numerical equivalence classes':
     26two hosts are in the same class if they return identical results for your applications.
    1727The BOINC scheduler will send results for a given workunit only to hosts in the same class;
    1828this lets you use strict equality to compare redundant results.
     
    2434in the [ProjectConfigFile config.xml] file, where N is the "HR type" to use (see below).
    2535
    26 Alternatively, you can enable HR for a single application by setting the `homogeneous_redundancy` field in its database record to the HR type for use with that application.
     36Alternatively, you can enable HR for a single application by setting the `homogeneous_redundancy` field
     37in its database record to the HR type for use with that application.
    2738
    2839An "HR type" is a host classification.
     
    3344 2:: A coarse-grained classification in which there are 4 classes: Windows, Linux, Mac-PPC and Mac-Intel.
    3445
    35 The proper classification depends on your application, and how it's compiled (compiler, compiler options, math libraries) on the various platforms.
     46The proper classification depends on your application,
     47and how it's compiled (compiler, compiler options, math libraries) on the various platforms.
    3648For example, WCG reports that the following gcc options (on Linux) cause their apps to produce identical results on all processor types:
    3749{{{
     
    5163'''sched/hr.cpp'''.
    5264
    53 == Taking a census of hosts ==
     65== Scheduling considerations ==
    5466
    55 If you use HR, it's important to tell the feeder roughly what fraction
    56 of hosts belong to each HR class;
    57 this allows it to allocate space in its shared-memory work array
    58 in proportion to this fraction.
    59 This information is passed to the feeder in a file '''hr_info.txt'''
    60 in your project's root directory.
    61 You can generate this file by running '''sched/census'''.
    62 Run this as a periodic task to track changes
    63 in your host population; example config.xml entry:
     67When HR is used, once an instance of a job has been sent to a host,
     68the job is "committed" to the HR class of that host.
     69This can potentially lead to a situation where the scheduler's
     70[ProjectOptions#Job-cachescheduling job cache]
     71contains only jobs committed to a particular HR class,
     72and hosts of other HR classes won't get jobs.
     73You can use the '''show_shmem''' command to check whether this is happening.
     74
     75For most projects this doesn't occur.
     76If it does, BOINC provides a mechanism that allocates slots in the job cache
     77to different HR classes, in proportion to the aggregate processing rate
     78of hosts in each class.
     79To enable this, put
     80{{{
     81<hr_allocate_slots/>
     82}}}
     83in your '''config.xml''' file.
     84
     85If you use this mechanism, you must periodically run a program called '''census'''
     86that computes the shares for each HR class.
     87To do so, add the following config.xml entry:
    6488{{{
    6589<task>
     
    7397You can use this e.g., during the period when your project is starting up
    7498and doesn't have a lot of hosts yet.
     99Copy it to your project's root directory as '''hr_info.txt'''.
    75100
    76101If you send the feeder a SIGUSR1 signal,
     
    87112Instead, you can either:
    88113
    89  * Wait until there are no jobs in progress, either by waiting for them to finish or by canceling workunits using the [HtmlOps administrative web page]
     114 * Wait until there are no jobs in progress,
     115  either by waiting for them to finish or by canceling workunits using the [HtmlOps administrative web page]
    90116 * Create a new application.