Changes between Version 15 and Version 16 of HomogeneousRedundancy
- Timestamp:
- Mar 15, 2012, 12:27:42 PM (13 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
HomogeneousRedundancy
v15 v16 3 3 = Dealing with numerical discrepancies = 4 4 5 Most numerical applications produce different outcomes for a given workunit depending on the machine architecture, operating system, compiler, and compiler flags. For some applications these discrepancies produce only small differences in the final output, and results can be validated using a 'fuzzy comparison' function that allows for deviations of a few percent. 5 Most numerical applications produce different outcomes for a given workunit 6 depending on the machine architecture, operating system, compiler, and compiler flags. 7 For some applications these discrepancies produce only small differences in the final output, 8 and results can be validated using a 'fuzzy comparison' function that allows for deviations of a few percent. 6 9 7 Other applications are 'divergent' in the sense that small numerical differences lead to unpredictably large differences in the final output. For such applications it may be difficult to distinguish between results that are correct but differ because of numerical discrepancies, and results that are erroneous. The 'fuzzy comparison' approach does not work for such applications. 10 Other applications are 'divergent' in the sense that small numerical differences lead to unpredictably large differences in the final output. 11 For such applications it may be difficult to distinguish between results that are correct 12 but differ because of numerical discrepancies, and results that are erroneous. 13 The 'fuzzy comparison' approach does not work for such applications. 8 14 9 15 == Eliminating discrepancies == 10 16 11 One approach is to eliminate numerical discrepancies. Some notes on how to do this for Fortran programs are given in a paper, [//MOM1MP01.pdf Massive Tracking on Heterogeneous Platforms] and in an earlier [//fortran_numerics.txt text document], both courtesy of Eric !McIntosh from CERN. 17 One approach is to eliminate numerical discrepancies. 18 Some notes on how to do this for Fortran programs are given in a paper, 19 [//MOM1MP01.pdf Massive Tracking on Heterogeneous Platforms] and in an earlier [//fortran_numerics.txt text document], 20 both courtesy of Eric !McIntosh from CERN. 12 21 13 22 == Homogeneous redundancy == 14 23 15 24 BOINC provides a feature called '''homogeneous redundancy''' (HR) to handle divergent applications. 16 HR divides hosts into 'numerical equivalence classes': two hosts are in the same class if they return identical results for your applications. 25 HR divides hosts into 'numerical equivalence classes': 26 two hosts are in the same class if they return identical results for your applications. 17 27 The BOINC scheduler will send results for a given workunit only to hosts in the same class; 18 28 this lets you use strict equality to compare redundant results. … … 24 34 in the [ProjectConfigFile config.xml] file, where N is the "HR type" to use (see below). 25 35 26 Alternatively, you can enable HR for a single application by setting the `homogeneous_redundancy` field in its database record to the HR type for use with that application. 36 Alternatively, you can enable HR for a single application by setting the `homogeneous_redundancy` field 37 in its database record to the HR type for use with that application. 27 38 28 39 An "HR type" is a host classification. … … 33 44 2:: A coarse-grained classification in which there are 4 classes: Windows, Linux, Mac-PPC and Mac-Intel. 34 45 35 The proper classification depends on your application, and how it's compiled (compiler, compiler options, math libraries) on the various platforms. 46 The proper classification depends on your application, 47 and how it's compiled (compiler, compiler options, math libraries) on the various platforms. 36 48 For example, WCG reports that the following gcc options (on Linux) cause their apps to produce identical results on all processor types: 37 49 {{{ … … 51 63 '''sched/hr.cpp'''. 52 64 53 == Taking a census of hosts ==65 == Scheduling considerations == 54 66 55 If you use HR, it's important to tell the feeder roughly what fraction 56 of hosts belong to each HR class; 57 this allows it to allocate space in its shared-memory work array 58 in proportion to this fraction. 59 This information is passed to the feeder in a file '''hr_info.txt''' 60 in your project's root directory. 61 You can generate this file by running '''sched/census'''. 62 Run this as a periodic task to track changes 63 in your host population; example config.xml entry: 67 When HR is used, once an instance of a job has been sent to a host, 68 the job is "committed" to the HR class of that host. 69 This can potentially lead to a situation where the scheduler's 70 [ProjectOptions#Job-cachescheduling job cache] 71 contains only jobs committed to a particular HR class, 72 and hosts of other HR classes won't get jobs. 73 You can use the '''show_shmem''' command to check whether this is happening. 74 75 For most projects this doesn't occur. 76 If it does, BOINC provides a mechanism that allocates slots in the job cache 77 to different HR classes, in proportion to the aggregate processing rate 78 of hosts in each class. 79 To enable this, put 80 {{{ 81 <hr_allocate_slots/> 82 }}} 83 in your '''config.xml''' file. 84 85 If you use this mechanism, you must periodically run a program called '''census''' 86 that computes the shares for each HR class. 87 To do so, add the following config.xml entry: 64 88 {{{ 65 89 <task> … … 73 97 You can use this e.g., during the period when your project is starting up 74 98 and doesn't have a lot of hosts yet. 99 Copy it to your project's root directory as '''hr_info.txt'''. 75 100 76 101 If you send the feeder a SIGUSR1 signal, … … 87 112 Instead, you can either: 88 113 89 * Wait until there are no jobs in progress, either by waiting for them to finish or by canceling workunits using the [HtmlOps administrative web page] 114 * Wait until there are no jobs in progress, 115 either by waiting for them to finish or by canceling workunits using the [HtmlOps administrative web page] 90 116 * Create a new application.