[[PageOutline]]
= Trouble-shooting a BOINC server =
== Trouble-shooting tools ==
=== Log files ===
Each server component (scheduler, feeder, transitioner, etc.) has its own log file.
These files are in the '''log_HOSTNAME''' subdirectory of the project directory.
Most error conditions are reported in the log files.
If you're interested in the history of a particular job,
grep for `WU#12345` or `RESULT#12345` (where 12345 represents the ID) in the log files.
The [HtmlOps html/ops pages] also provide an interface for this.
To control the verbosity of the log files:
* Scheduler: set the desired [ProjectOptions#Loggin logging options]
* File upload handler: set [ProjectOptions#misc fuh_debug_level].
* daemons: pass the cmdline arg "-d N" (1=least verbose, 4=most verbose)
If you run server components with '''-d 4''', their database queries will be logged.
This is verbose but extremely useful for tracking down database-level problems.
=== Examining the database ===
The [wiki:HtmlOps admin web interface] provides a web-based interface for
browsing your project's database.
You can also use MySQL tools such as
* The [http://dev.mysql.com/doc/refman/5.0/en/mysql.html mysql interpreter].
The '[http://dev.mysql.com/doc/refman/5.0/en/show-processlist.html show processlist;]' query
is useful for diagnosing DB performance problems.
* [http://jeremy.zawodny.com/mysql/mytop/ mytop]: like 'top' for MySQL: shows running queries.
* [http://www.phpmyadmin.net/ phpMyAdmin]: general-purpose web interface to MySQL
=== Examining shared memory ===
The command
{{{
bin/show_shmem
}}}
will print a textual summary of the contents of the shared-memory structure
that caches jobs and information about applications.
== Trouble-shooting the job pipeline ==
* Are workunits (jobs) getting created correctly?
Examine the database to see.
If you're using a work generator, check its log file.
* Are results (job instances) getting created?
Examine the database to see.
If you don't see results, check the transitioner log file.
* Are jobs getting into shared memory?
Use show_shmem (see above).
You should see jobs.
If not, check the feeder log file.
* Is the scheduler sending jobs? If not, check its log file, preferably with the following log flags:
* : show details of app version selection
* : show details of job assignment
* : show details of quota enforcement
* Are clients processing jobs correctly?
Check the status and stderr output of completed jobs.
* Are output files getting uploaded?
Check the file upload handler log file.
* Are jobs getting validated?
Check the validator log file.
* Are jobs getting assimilated?
Check the assimilator log file.
== Debugging the scheduler ==
If the scheduler is acting incorrectly or crashing,
and you like mucking around in C++ source code,
you can run it under a debugger like `gdb`.
The scheduler is a CGI program;
it reads a request from stdin and writes a reply to stdout.
So you can debug it as follows:
* Copy the "scheduler_request_X.xml" file from a client to the machine running the scheduler. (X = your project URL)
* Run the scheduler under the debugger, giving it this file as stdin, i.e.:
{{{
gdb cgi
(set a breakpoint if desired)
r < scheduler_request_X.xml
}}}
* You may have to doctor the database as follows to keep the scheduler from rejecting the request:
{{{
update host set rpc_seqno=0, rpc_time=0 where hostid=N
}}}
As an alternative to this, edit `sched/handle_request.cpp`,
and put a call to `debug_sched("debug_sched");` just before `sreply.write(fout, sreq);`.
Then, after recompiling, touch a file called 'debug_sched' in the project root directory.
This will cause transcripts of all subsequent scheduler requests and replies
to be written to the `cgi-bin/` directory with separate small files for each request.
The file names are `sched_request_H_R` and `sched_reply_H_R` where H=hostid and R=rpc sequence number.
This can be turned off by deleting the 'debug_sched' file.
To get core files for scheduler crashes, uncomment the following line in sched/sched_main.cpp, and recompile:
{{{
#define DUMP_CORE_ON_SEGV 1
}}}