Version 144 (modified by 7 years ago) (diff) | ,
---|
Project configuration
The following elements in the <config>
section of your config.xml file
control various aspects of your project.
Booleans default to false, and can be expressed as
<tag>1</tag> (true) <tag>0</tag> (false) <tag/> (true)
Scheduler
These options control how jobs are dispatched to clients. This is also affected by the parameters you pass to the feeder.
General
- <ban_cpu>regexp</ban_cpu>
- Any host for which p_vendor<tab>p_model matches the given regular expression will not be sent jobs. This is a POSIX extended regular expression. For example, to exclude clients with AMD K6 processors, use
<ban_cpu>.*AMD.*\t.*Family 5 Model 8 Stepping 0.*</ban_cpu>
- <ban_os>regexp</ban_os>
- Any host for which os_name<tab>os_version matches the given regular expression will not be sent jobs. This is a POSIX extended regular expression.
- <distinct_beta_apps>0|1</distinct_beta_apps>
- If set, user application selection applies to beta test applications as well as others.
- <ignore_delay_bound/>
- By default, results are not sent to hosts too slow to complete them within delay bound. If this flag is set, this rule is not enforced.
- <maintenance_delay>nseconds</maintenance_delay>
- If the project is down, tell clients to delay their next request for at least the given number of seconds.
- <max_results_accepted>N</max_results_accepted>
- Ignore reported results beyond the first N. This limits the scheduler's memory usage, which can prevent crashes. Note: this doesn't cause results to get lost; the client will report the rest of the results in the next RPC.
- <multiple_clients_per_host>0|1</multiple_clients_per_host>
- Set this if some of your hosts run multiple BOINC clients simultaneously (this is the case on projects that use Condor and/or grid resources, which require each client to use only 1 CPU). If set, the scheduler will skip a check that tries to locate the host based on its IP address.
- <nowork_skip> 0|1 </nowork_skip>
- If the scheduler has no work, it replies to RPCs without doing any database access (e.g., without looking up the user or host record). This reduces DB load, but it fails to update preferences when users click on Update. Use it if your server DB is overloaded.
- <report_grace_period>x</report_grace_period>
- <grace_period_hours>x</grace_period_hours>
- A "grace period" (in seconds or hours respectively) for task reporting. A task is considered time-out (and a new replica generated) if it is not reported by client_deadline + x.
- <user_filter>0|1</user_filter>
- If set, use the "batch" field of workunits to select which user is allowed to process the job. If batch is nonzero, only send the job to the user with that ID.
- <workload_sim>0|1</workload_sim>
- Use a more expensive, but more accurate, method to decide whether hosts can complete jobs within their delay bound.
App version selection
- <prefer_primary_platform> 0|1 </prefer_primary_platform>
- Send hosts app versions for their primary platform if one exists; e.g. if a host is 64-bit, don't send it a 32-bit CPU version if a 64-bit CPU version exists. Use this option only if you're sure that your 64-bit versions are faster than the 32-bit versions.
- <version_select_random_factor>X</version_select_random_factor>
- In predicting which app version will be faster for a given host, multiply the projected FLOPS by a uniform random variable with mean 1 and this standard deviation (default 0.1).
Job limits
- <one_result_per_user_per_wu/>
- If set, send at most one instance of a given job to a given user. This increases the effectiveness of replication-based validation by making it more difficult for hackers to get all the instances of a given job.
- <one_result_per_host_per_wu/>
-
If present, send at most one result of a given workunit to a given host.
This is weaker than
one_result_per_user_per_wu
; it's useful if you're using homogeneous redundancy and most of the hosts of a particular class belong to a single user.
- <min_sendwork_interval> N </min_sendwork_interval>
- Minimum number of seconds between sending jobs to a given host. You can use this to limit the impact of faulty hosts.
- <max_wus_in_progress> N </max_wus_in_progress>
- <max_wus_in_progress_gpu> M </max_wus_in_progress_gpu>
- Limit the number of jobs in progress on a given host (and thus limit average turnaround time). Starting with 6.8, the BOINC client report the resources used by in-progress jobs; in this case, the max CPU jobs in progress is N*NCPUS and the max GPU jobs in progress is M*NGPUs. Otherwise, the overall maximum is N*NCPUS + M*NGPUS).
See the following section for a more powerful way of expressing limits on in-progress jobs.
- <gpu_multiplier> GM </gpu_multiplier>
- If your project uses GPUs, set this to roughly the ratio of GPU speed to CPU speed. Used in the calculation of job limits (see next 2 items).
- <max_wus_to_send> N </max_wus_to_send>
- Maximum jobs returned per scheduler RPC is N*(NCPUS + GM*NGPUS). You can use this to limit the impact of faulty hosts. Default is 10.
- <max_ncpus>N</max_ncpus>
- An upper bound on NCPUS (default: 64)
- <daily_result_quota> N </daily_result_quota>
- Each host has a field MRD in the interval [1 .. daily_result_quota]; it's initially daily_result_quota, and is adjusted as the host sends good or bad results. The maximum number of jobs sent to a given host in a 24-hour period is MRD*(NCPUS + GM*NGPUS). You can use this to limit the impact of faulty hosts.
Job limits (advanced)
The following is a more adaptable way of expressing limits on the number of jobs in progress on a host. You can specify limits for specific apps, and for your projects as a whole. Within each of these, you can specify limits for CPU jobs, GPU jobs, or total. In the case of CPU and GPU jobs, you can specify whether the limit should be scaled by the number of devices present on the host.
This uses a separate config file, config_aux.xml. The syntax is:
<?xml version="1.0" ?> <config> <max_jobs_in_progress> <project> <total_limit> <jobs>N</jobs> </total_limit> [ <gpu_limit> ] <jobs>N</jobs> [ <per_proc/> ] if set, limit is per processor [ <cpu_limit> ] ... </project> <app> <app_name>name</app_name> [ <total_limit> ... ] [ <cpu_limit> ... ] [ <gpu_limit> ... ] </app> ... </max_jobs_in_progress> </config>
These limits are enforced only for 6.12+ clients.
Job scheduling
The default job scheduling mechanism is that the feeder (a daemon program) maintains a cache of jobs in shared memory. Scheduler instances get jobs from this cache, reducing their database access overhead.
- <shmem_work_items>N</shmem_work_items>
- The size of the job cache. Default is 100 jobs.
- <feeder_query_size>N</feeder_query_size>
- The size of the feeder's enumeration query. Default is 200.
- <sched_old>0|1</sched_old>
- Use an old mechanism in which the scheduler scans the cache multiple times, looking for jobs according to different criteria. The current mechanism makes a single pass through the cache.
- <job_size_matching>0|1</job_size_matching>
- If set, enabled multi-size applications; favor sending large jobs to fast hosts. To use this, you must run the size_census.php program as a periodic task to maintain statistics on the distribution of host speeds.
<rte_no_stats>0|1</rte_no_stats>
If set, don't use statistics (host/app version or app version) in job runtime estimation. Use this if the runtime distribution is not unimodal, e.g. for universal apps that lump together a lot of actual applications.
Homogeneous redundancy
- <homogeneous_redundancy>N</homogeneous_redundancy>
- If zero (default) don't use the homogeneous redundancy mechanism. Otherwise, specifies the granularity of host classification (1=fine, 2=coarse). (Note: you may also specify this on a per-application basis).
- <hr_allocate_slots>0|1</hr_allocate_slots>
- If set, allocate job-cache slots to homogeneous redundancy (HR) classes. Use this if your job cache is getting clogged with jobs committed to a particular HR class.
- <hr_class_static>0|1</hr_class_static>
- Suppress a mechanism that clears the HR class of jobs that have error instances and no in-progress or completed instances. Use this if you assign HR classes to jobs.
Accelerating retries
The goal of this mechanism is to send timeout-generated retries to hosts that are likely to finish them fast. Here's how it works:
- Hosts are deemed "reliable" (a slight misnomer) if they satisfy turnaround time and error rate criteria.
- A job instance is deemed "need-reliable" if its priority is above a threshold.
- The scheduler tries to send need-reliable jobs to reliable hosts. When it does, it reduces the delay bound of the job.
- When job replicas are created in response to errors or timeouts, their priority is raised relative to the job's base priority.
The configurable parameters are:
- <reliable_on_priority>X</reliable_on_priority>
- Results with priority at least reliable_on_priority are treated as "need-reliable". They'll be sent preferentially to reliable hosts.
- <reliable_max_avg_turnaround>secs</reliable_max_avg_turnaround>
- Hosts whose average turnaround is at most reliable_max_avg_turnaround and that have at least 10 consecutive valid results e are considered 'reliable'. Make sure you set this low enough that a significant fraction (e.g. 25%) of your hosts qualify.
- <reliable_reduced_delay_bound>X</reliable_reduced_delay_bound>
- When a need-reliable result is sent to a reliable host, multiply the delay bound by reliable_reduced_delay_bound (typically 0.5 or so).
- <reliable_priority_on_over>X</reliable_priority_on_over>
- <reliable_priority_on_over_except_error>X</reliable_priority_on_over_except_error>
- If reliable_priority_on_over is nonzero, increase the priority of duplicate jobs by that amount over the job's base priority. Otherwise, if reliable_priority_on_over_except_error is nonzero, increase the priority of duplicates caused by timeout (not error) by that amount. (Typically only one of these is nonzero, and is equal to reliable_on_priority.)
NOTE: this mechanism can be used to preferentially send ANY job, not just retries, to fast/reliable hosts. To do so, set the workunit's priority to reliable_on_priority or greater.
Locality scheduling
- <locality_scheduling/>
- When possible, send work that uses the same files that the host already has. This is intended for projects which have large data files, where many different workunits use the same data file. In this case, to reduce download demands on the server, it may be advantageous to retain the data files on the hosts, and send them work for the files that they already have. See Locality Scheduling.
- <locality_scheduling_wait_period> N </locality_scheduling_wait_period>
- This element only has an effect when used in conjunction with the previous locality scheduling element. It tells the scheduler to use 'trigger files' to inform the project that more work is needed for specific files. The period is the number of seconds which the scheduler will wait to see if the project can create additional work. Together with project-specific daemons or scripts this can be used for 'just-in-time' workunit creation. See Locality Scheduling.
Job retransmission
- <resend_lost_results> 0|1 </resend_lost_results>
- If set, and a <other_results> list is present in scheduler request, resend any in-progress results not in the list. This is recommended; it may increase the efficiency of your project. For reasons that are not well understood, a BOINC client sometimes fails to receive the scheduler reply. This flag addresses that issue: it causes the SAME results to be resent by the scheduler, if the client has failed to receive them. Note: this will increase the load on your DB server; you can minimize this by creating an index:
alter table result add index res_host_state (hostid, server_state);
- <send_result_abort>0|1</send_result_abort>
- If set, and the client is processing a result for a WU that has been canceled or is not in the DB (i.e. there's no chance of getting credit), tell the client to abort the result regardless of state. If client is processing a result for a WU that has been assimilated or is overdue (i.e. there's a chance of not getting credit) tell the client to abort the result if it hasn't started yet. Note: this will increase the load on your DB server.
Data distribution
- <replace_download_url_by_timezone>URL</replace_download_url_by_timezone>
- OUTDATED: BERND, PLEASE UPDATE. When the scheduler sends work to hosts, it replaces the download URL appearing in the data and executable file descriptions with the download URL closest to the host's timezone. The project must provide a two-column file called 'download_servers' in the project root directory. This is a list of all download servers that will be inserted when work is sent to hosts. The first column is an integer listing the server's offset in seconds from UTC. The second column is the server URL in the format such as http://einstein.phys.uwm.edu. The download servers must have identical file hierarchies and contents, and the path to file and executables must start with '/download/...' as in 'http://X/download/123/some_file_name'.
- <cache_md5_info> 0|1 </cache_md5_info>
- When creating work, keep a record (in files called foo.md5) of the file length and md5 sum of data files and executables. This can greatly reduce the time needed to create work, if (1) these files are re-used, and (2) there are many of these files, and (3) reading the files from disk is time-consuming.
Logging
The contents of the log files is controlled by the following:
- <debug_assignment/>
- Explain the sending of assigned work.
- <debug_credit/>
- Show credit details in validator logs.
- <debug_edf_sim_detail/>
- Show the details of EDF simulation
- <debug_edf_sim_workload/>
- Show the initial conditions of EDF simulation
- <debug_handle_results/>
- Show the handling of reported jobs.
- <debug_locality>
- Show locality scheduling debugging info.
- <debug_prefs/>
- Show the propagation of global prefs.
- <debug_quota/>
- Show info related to job quotas (per RPC, max in progress, max per day)
- <debug_request_details/>
- Show details of request message.
- <debug_request_headers/>
- Show HTTP request headers.
- <debug_resend/>
- Show resending of lost jobs.
- <debug_send/>
- High-level job dispatch info, e.g. work request parameters and jobs actually sent.
- <debug_send_scan/>
- Job dispatch: info about scans through the shared-memory job cache.
- <debug_send_job/>
- Job dispatch: info at the level of individual jobs (e.g. why they weren't sent).
- <debug_user_messages/>
- Show messages we're sending to the user.
- <debug_vda/>
- Show details for volunteer data archival.
- <debug_version_select/>
- Explain app version selection.
The overall verbosity is controlled by the following:
- <sched_debug_level> N </sched_debug_level>
- Log messages have a "level": 1=minimal, 2=normal, 3=debug. The messages enabled by the above flags have level=2. If you set this option to N, only messages of level N or less will be written.
Scheduler debugging
- <scheduler_log_buffer>N</scheduler_log_buffer>
- Set the output buffer of the scheduler log to N bytes. Set it to zero to see all msgs before a crash.
- <debug_req_reply_dir>path</debug_req_reply_dir>
- If specified, each scheduler instance will write three files in there: PID_C_sched.log, PID_C_sched_request.xml and (if all goes well) PID_C_sched_reply.xml. PID is the process id of this scheduler instance, C is an internal counter within the process if FCGI is used. The sched.log will contain nothing else than the pid and the IP address of the client. This should allow for identifying the scheduler instance responsible for a given apache error log message ("premature end of script headers") when a scheduler crashed. sched_request.xml (obviously) is the scheduler request, and if the scheduler doesn't crash in between, there will also be the reply to the client kept in sched_reply.xml
Client control
- <next_rpc_delay>x</next_rpc_delay>
- In each scheduler reply, tell the clients to do another scheduler RPC after at most X seconds, regardless of whether they need work. This is useful, e.g., to ensure that in-progress jobs can be canceled in a bounded amount of time.
- <verify_files_on_app_start/>
- Before starting or restarting an app, check contents of input files and app version files by either MD5 or digital signature check. Detects user tampering with file (but doesn't really increase security, since user could also change MD5s or signatures in client state file).
- <symstore>URL</symstore>
- URL of your project's symbol store, used for debugging Windows applications.
- <min_core_client_version_announced> N </min_core_client_version_announced>
- Announce a new version of the BOINC core client, which in the future will be the minimum required version. In conjunction with the next tag, you can warn users with version below this to upgrade by a specified deadline. The version number is encoded as 10000*major + 100*minor + release.
- <min_core_client_upgrade_deadline> N </min_core_client_upgrade_deadline>
- Use in conjunction with the previous tag. The value given here is the Unix epoch returned by time(2) until which hosts can update their core client. After this time, they may be shut out of the project. Before this time, they will receive messages warning them to upgrade.
- <msg_to_host/>
- If present, check the msg_to_host table on each RPC, and send the client any messages queued for it.
- <non_cpu_intensive> 0|1 </non_cpu_intensive>
- If this flag is present, the project will be treated specially by the client:
- The client will download one result at a time.
- This result will be executed whenever computation is enabled (bypassing the normal scheduling mechanism).
This is intended for applications that use little CPU time, e.g. that do network or host measurements.
Upload certificates
- <dont_generate_upload_certificates/>
- Don't put upload certificates in results. This makes result generation a lot faster, since no encryption is done, but you lose protection against DoS attacks on your upload servers.
- <ignore_upload_certificates/>
- If upload certificates are not generated, this option must be enabled to force the file upload handler to accept files.
Default preferences
- <default_disk_max_used_gb> X </default_disk_max_used_gb>
-
Sets the default value for the
disk_max_used_gb
preference so it's consistent between the scheduler and web pages. The scheduler uses it when a request for work doesn't include preferences, or the preference is set to zero. The web page scripts use it to set the initial value when displaying or editing preferences the first time, or when the user never saved them. Default is 100.
- <default_disk_max_used_pct> X </default_disk_max_used_pct>
-
Sets the default value for the
disk_max_used_pct
preference so its consistent between the scheduler and web pages. The scheduler uses it when a request for work doesn't include preferences, or the preference is set to zero. The web page scripts use it to set the initial value when displaying or editing preferences the first time, or when the user never saved them. Default is 50.
- <default_disk_min_free_gb> X </default_disk_min_free_gb>
-
Sets the default value for the
disk_min_free_gb
preference so its consistent between the scheduler and web pages. The scheduler uses it when a request for work doesn't include preferences. The web page scripts use it to set the initial value when displaying or editing preferences the first time, or when the user never saved them. Also, the scheduler uses this setting to override any smaller preference from the host, it enforces a 'minimum free disk space' to keep from filling up the drive. Recommend setting this no smaller than .001 (1MB or 1,000,000 bytes). Default is .001.
File deletion policy
- <delete_delay_hours>X</delete_delay_hours>
- Wait X hours before deleting files. This provides a 'grace period' during which late results will still get credit.
- <httpd_user>username</httpd_user>
- The user name under which the web server runs. As a safeguard, the file deleter skips files not owned by this user.
Server status page options
- <www_host>hostname</www_host>
- Host name of web server.
- <sched_host>hostname</sched_host>
- Host name of scheduling server.
- <uldl_host>hostname</uldl_host>
- Host name of upload/download server.
- <uldl_pid>path</uldl_pid>
-
pid file of upload/download server (default:
/etc/httpd/run/httpd.pid
).
- <ssh_exe>path</ssh_exe>
-
path to
ssh
(default:/usr/bin/ssh
).
- <ps_exe>path</ps_exe>
-
path to
ps
(which supports "w" flag) (default:/bin/ps
).
Web site features
- <profile_screening/>
- If present, don't show profile pictures until they've been screened and approved by project admins.
- <show_results/>
- Enable web site features that show results (per user, host, etc.).
- <dont_suppress_pending/>
- Do not hide incomplete results when using adaptive replication.
- <no_forum_rating/>
- Disable forum post rating.
- <no_web_account_creation/>
- Don't allow account creation via the web. See also <disable_account_creation> and <disable_account_creation_rpc>.
- <akismet_key> 1234567890ab </akismet_key>
- If set, akismet.com is used to check post contents to protect forums from spam. See Protecting message boards from spam for more information.
- <users_per_page>N</users_per_page>
- Number of entries per page for top users/teams/hosts. Default is 20.
- <teams_per_page>N</teams_per_page>
- <hosts_per_page>N</hosts_per_page>
- <recaptcha_public_key>X</recaptcha_public_key>
- <recaptcha_private_key>X</recaptcha_private_key>
- Enable the use of Recaptcha for profile creation/editing; see Protecting message boards from spam for more information.
- <profile_min_credit>X</profile_min_credit>
- The minimum amount of credit to create or edit a profile.
- <team_forums_members_only>0|1</team_forums_members_only>
- If set, team message boards are visible only to team members.
- <moderators_vote_to_ban>0|1</moderators_vote_to_ban>
- If set, banishments require a majority vote among moderators.
Miscellaneous
- <project_user_name> X </project_user_name>
- Only this user is allowed to execute the start script. Use this to prevent root to start a project which will lead to bad file permissions.
- <min_core_client_version> N </min_core_client_version>
- If the scheduler gets a request from a client with a version number less than this, it returns an error message and doesn't do any other processing. The version number is expressed as an integer with the encoding 10000*major + 100*minor + release. You can also specify this separately for each application.
- <ended>0|1</ended>
- Project has permanently ended. Tell clients so user can be notified.
- <disable_account_creation/>
- If present, disallow account creation via Web and RPC. See also <no_web_account_creation>.
- <disable_account_creation_rpc/>
- If present, disallow account creation via Web RPCs.
- <disable_team_creation/>
- If present, disallow team creation via Web and RPC.
- <min_passwd_length> N </min_passwd_length>
- Minimum length of user passwords. Default is 6.
- <request_time_stats_log/>
- If present, the scheduler will tell clients (via scheduler replies) to upload (via scheduler requests) their time stats log (which contains records of when the client started and stopped running).
- <fuh_debug_level> N </fuh_debug_level>
- Verbosity level for file upload handler log output. 1=minimal, 2=normal (default), 3=verbose.
- <dont_store_success_stderr/>
- If present, don't store the stderr log in the database for successful workunits. May be useful to save on database size. Available since r18528.
Database info
- <db_name>name</db_name>
- <db_user>database_user_name</db_user>
- [ <db_host>hostname</db_host> ]
- [ <db_passwd>database_password</db_passwd> ]
- Database name, user name, hostname (default "localhost") and password (default none). The hostname can be of the form hostname:port.
Replica Database info
Description a replica database server, which is used for read-only queries (optional). NOTE: according to Bernd Machenschalk, using a separate user account with read-only access to the replica increases the apparent consistency between main and replica.
- [ <replica_db_name>name</replica_db_name> ]
- [ <replica_db_user>database_user_name</replica_db_user> ]
- [ <replica_db_host>hostname</replica_db_host> ]
- [ <replica_db_passwd>database_password</replica_db_passwd> ]
- [<replica_fallback_mode>N</replica_fallback_mode> ]
N=
- 0 = default, use db_user if no replica_db_user is specified, first try replica_db_host (if specified) then db_host
- 1=only use replica_db_user, first try replica_db_host then db_host
- 2=only use replica_db_user, only try replica_db_host
Hosts, directories, and URLs
(These are created by make_project; normally you don't need to change them.)
- <master_url>URL</master_url>
- <long_name>name</long_name>
- <host>hostname</host>
-
name of project's main host, as given by Python's
socket.hostname()
. Daemons and tasks run on this host by default.
- <shmem_key>shared_memory_key</shmem_key>
- ID of scheduler shared memory. Must be unique on host.
- <download_url>URL</download_url>
- URL of data server for download.
- <download_dir>path</download_dir>
- absolute path of download directory.
- <uldl_dir_fanout>N</uldl_dir_fanout>
- fan-out factor of upload and download directories (see Hierarchical upload/download directories).
- <upload_url>URL</upload_url>
- URL of file upload handler.
- <upload_dir>path</upload_dir>
- absolute path of upload directory.
- <log_dir>path</log_dir>
- absolute path of logfile directory.
- <sandbox_dir>path</sandbox_dir>
- Location to store user-uploaded files (only for projects using remote job submission with user sandbox). Defaults to project_dir/sandbox
- <bin_dir>relative-path</bin_dir> <!-- relative to project_dir -->
- <cgi_bin_dir>relative-path</cgi_dir> <!-- relative to project_dir -->
- <sched_lockfile_dir>path</sched_lockfile_dir>
- Enables scheduler locking (recommended) and specifies directory where scheduler lockfiles are stored. Must be writable to the Apache user.
Parsing project options
A program or script can access project options as follows:
- C/C++: use the SCHED_CONFIG class (sched/sched_config.C,h)
- PHP: use the get_config() and parse_config() functions in inc/util.inc
- scripts: use the bin/parse_config program