wiki:DbPurge

Version 14 (modified by davea, 12 years ago) (diff)

--

Database purging utility

As a BOINC project operates, the size of its workunit and result tables increases. Eventually they become so large that adding a field or building an index may take hours or days. To keep this from happening, BOINC provides a utility db_purge that writes result and WU records to XML-format archive files, then deletes them from the database. Workunits are purged only when their input files have been deleted. Because of BOINC's file-deletion policy, this implies that all results are completed. So when a workunit is purged, all its results are purged too.

db_purge creates an archive/ directory and stores archive files there.

db_purge is normally run as a daemon, specified in the config.xml file. It has the following command-line options:

-min_age_days X
Purge only WUs with mod_time at least X (can be < 1) days in the past. Recommended value: 7 or so. This lets users examine their recent results.
--max N
Purge at most N WUs, then exit
--max_wu_per_file N
Write at most N WUs to each archive file. Recommended value: 10,000 or so.
--mod N R
Process only WUs with ID mod N == R. This lets you run multiple instances for increased throughput. Must be used with --no_archive.
--zip
Compress archive files using zip
--gzip
Compress archive files using gzip
--daily_dir
Write archives in a new directory (YYYY_MM_DD) each day
-d N
Set logging verbosity to N (1,2,3)
--no_archive
Don't archive workunits or results
--dont_delete
Don't delete from DB, for testing only
--one_pass
Do one pass, then quit

Recording credited jobs

You may want to keep a record of which user returned a valid result for each workunit (e.g. so that you can give them credit for specific discoveries). If you use db_purge, this information is removed from the database; it can be extracted from the XML archives, but it's slow.

Alternatively, you can store the user/workunit association in a separate table called credited_job. The records in this table are small - they store only a user ID and a 64-bit integer which is taken from the workunit's 'opaque' field (typically an index in a separate database). By default, the table has indices that allow you to efficiently enumerate workunits for a given user and users for a given workunit.

To use the credited_job table, pass the --update_credited_job flag to your validator.

Archive file format

The archive files have names of the form wu_archive_TIME and result_archive_TIME where TIME is the Unix time the file was created. In addition, db_purge generates index files 'wu_index' and 'result_index' associating each WU and result ID with the timestamp of its archive file.

The format of both type of index files is a number of rows each containing:

ID     TIME

The ID field of the WU or result, 5 spaces, and the timestamp part of the archive filename where the record with that ID can be found.

The format of a record in the result archive file is:

<result_archive>
  <id>%d</id>
  <create_time>%d</create_time>
  <workunitid>%d</workunitid>
  <server_state>%d</server_state>
  <outcome>%d</outcome>
  <client_state>%d</client_state>
  <hostid>%d</hostid>
  <userid>%d</userid>
  <report_deadline>%d</report_deadline>
  <sent_time>%d</sent_time>
  <received_time>%d</received_time>
  <name>%s</name>
  <cpu_time>%.15e</cpu_time>
  <xml_doc_in>%s</xml_doc_in>
  <xml_doc_out>%s</xml_doc_out>
  <stderr_out>%s</stderr_out>
  <batch>%d</batch>
  <file_delete_state>%d</file_delete_state>
  <validate_state>%d</validate_state>
  <claimed_credit>%.15e</claimed_credit>
  <granted_credit>%.15e</granted_credit>
  <opaque>%f</opaque>
  <random>%d</random>
  <app_version_num>%d</app_version_num>
  <appid>%d</appid>
  <exit_status>%d</exit_status>
  <teamid>%d</teamid>
  <priority>%d</priority>
  <mod_time>%s</mod_time>
</result_archive>

The format of a record in the WU archive file is:

<workunit_archive>
  <id>%d</id>
  <create_time>%d</create_time>
  <appid>%d</appid>
  <name>%s</name>
  <xml_doc>%s</xml_doc>
  <batch>%d</batch>
  <rsc_fpops_est>%.15e</rsc_fpops_est>
  <rsc_fpops_bound>%.15e</rsc_fpops_bound>
  <rsc_memory_bound>%.15e</rsc_memory_bound>
  <rsc_disk_bound>%.15e</rsc_disk_bound>
  <need_validate>%d</need_validate>
  <canonical_resultid>%d</canonical_resultid>
  <canonical_credit>%.15e</canonical_credit>
  <transition_time>%d</transition_time>
  <delay_bound>%d</delay_bound>
  <error_mask>%d</error_mask>
  <file_delete_state>%d</file_delete_state>
  <assimilate_state>%d</assimilate_state>
  <hr_class>%d</hr_class>
  <opaque>%f</opaque>
  <min_quorum>%d</min_quorum>
  <target_nresults>%d</target_nresults>
  <max_error_results>%d</max_error_results>
  <max_total_results>%d</max_total_results>
  <max_success_results>%d</max_success_results>
  <result_template_file>%s</result_template_file>
  <priority>%d</priority>
  <mod_time>%s</mod_time>
</workunit_archive>

Restoring archived results into the DB

To extract old statistics, one has to recover the xml archive files (records get periodically removed from the results table by db_purge). This section explains a quick hack to recover the xml archive into a database. For now, only the result_archive is recreated, without xml data.

Note. This procedure relies on the LOAD XML INTO MySQL command, which was introduced (about) in MySQL version 6.0 alpha. The 6.0 version was then removed from the MySQL roadmap, and must therefore be resurrected from sameplace on the net. Try e.g. looking for mysql-6.0.11-alpha-linux-x86_64-glibc23.tar.gz. It might be found e.g. at ftp://ftp.fu-berlin.de/unix/databases/mysql/Downloads/MySQL-6.0/ . MySQL 6.0 should not be used in the production DB, but only on a sandbox machine (called local henceforth).

Prerequisites:

  • A local working installation of mysql 6.0 alpha
  • A local image of the current BOINC database (e.g. restored from a recent dump)

Procedure:

  1. put the following script, boinc_xml_import, where it can be executed

#!/bin/sh

# iterate over all arguments in the command line, unzip them,
# strip XML-in-XML and feed them to mysql for import

tmpf=/tmp/boinc_import.$$

for f in $*; do
    echo Processing $f ...

    # Remove the 3 xml snippets of each result. Treat the file as a huge string.
    zcat -f $f | perl -w -e '
        use strict;
        my @a;
        { local $/=undef;
          @a=<>;        }
        my $l="@a";
        $l=~s|<xml_doc_in>.+?</stderr_out>|    |sg;
        print "$l";
    ' > $tmpf

    mysql YOURPROJECT <<EOF
          CREATE TABLE IF NOT EXISTS result_archive LIKE result;
          SET AUTOCOMMIT=0;
          SET UNIQUE_CHECKS=0;
          SET FOREIGN_KEY_CHECKS=0;
          LOAD XML CONCURRENT INFILE '$tmpf' IGNORE INTO TABLE result_archive 
             ROWS IDENTIFIED BY '<result_archive>';
          SET FOREIGN_KEY_CHECKS=1;
          SET UNIQUE_CHECKS=1;
          COMMIT;
EOF

    echo ... done
done
  1. Copy the .xml.gz archive files on a local machine, and run the following command (it will take a long time)
    boinc_xml_import *.xml.gz
    
  1. At the end of the run, a result_archive table will be reconstructed on the server. To add the most recent results, contained in the result table, do (it will also take a while)
    INSERT INTO result_archive SELECT * FROM result;