Opened 16 years ago

Closed 16 years ago

Last modified 15 years ago

#713 closed Defect (fixed)

Multiple applications break anonymous platform

Reported by: Richard Haselgrove Owned by: Bruce Allen
Priority: Critical Milestone: Undetermined
Component: Server - Scheduler Version:
Keywords: Cc:

Description

There are reports from two projects - Einstein and SETI - with similar characteristics.

The user's message log shows:

8/5/2008 12:09:29 AM|Einstein@Home|[error] State file error: missing application einstein_S5R4

8/5/2008 12:09:29 AM|Einstein@Home|[error] Can't handle task h1_0248.60_S5R4_55_S5R4a in scheduler reply

8/5/2008 12:09:29 AM|Einstein@Home|[error] State file error: missing task h1_0248.60_S5R4_55_S5R4a

8/5/2008 12:09:29 AM|Einstein@Home|[error] Can't handle task h1_0248.60_S5R4_55_S5R4a_0 in scheduler reply

when the project has two different applications (S5R3 and S5R4 at Einstein: SAH_enh and Astropulse at SETI), and the user has an app_info.xml file which only specifies one of them. The application specified in the app_info gets work as normal: the application NOT specified generates the error above.

What happens next depends on whether the project has resend_lost_work enabled.

If resend_lost_work is NOT enabled (SETI), the result is a 'ghost' task in the database, and the user (hopefully) receives work for their specified application at the next attempt.

If resend_lost_work is enabled (SETI_Beta and Einstein), the scheduler tries, and fails, to send the same task endlessly: work for that project on that host dries up.

Change History (7)

comment:1 Changed 16 years ago by Bruce Allen

As a workaround, please download the appropriate S5R4 app, and add that also to you app_info.xml file. Then please let us know if this works properly or not.

comment:2 Changed 16 years ago by davea

Resolution: fixed
Status: newclosed

If a client has an app_info.xml for a project, the current server code won't send (or resend) a job for an app not listed in app_info.xml. (That's the intended semantics).

Reopen this if problem persists after these projects have updated their server code.

comment:3 Changed 16 years ago by Juha

Resolution: fixed
Status: closedreopened

Tested with BOINC 6.2.11 and server revision 15755. The server is running the test project and has only uppercase application installed.

Following app_info.xml creates the situation Richard describes:

<app_info>
    <app>
        <name>foo</name>
    </app>
    <file_info>
        <name>uppercase_2.1_windows_intelx86.exe</name>
        <executable/>
    </file_info>
    <app_version>
        <app_name>foo</app_name>
        <version_num>201</version_num>
        <file_ref>
            <file_name>uppercase_2.1_windows_intelx86.exe</file_name>
            <main_program/>
        </file_ref>
    </app_version>
</app_info>

The app name is incorrect on purpose to show the bug.

The bug can be fixed with the following patch:

Index: sched_send.C
===================================================================
--- sched_send.C        (revision 15755)
+++ sched_send.C        (working copy)
@@ -157,6 +157,7 @@
                 // means the client already has the app version
         }
         reply.wreq.best_app_versions.push_back(bavp);
+        if (!bavp->avp) return NULL;
         return bavp;
     }

comment:4 Changed 16 years ago by Bruce Allen

Richard,

Thanks for the additional detail and the suggested patch. However I am confused about [1] what the problem is that you are fixing and [2] what the effects of your proposed solution are. Could you please provide some additional details about (a) what was the situation BEFORE your patch and (b) what was the situation AFTER your patch to the server code.

I thought that the problem we were trying to address was the one where a users app_info.xml did not contain a version for EVERY application that a project was running. Did you try my suggested work-around?

In your comments above you seem to be addressing a different case: one where app_info.xml contains incorrect information. I don't understand how that is relevant.

Finally, you also need to respond to David Anderson's comment about policy, which you can find here: http://lists.ssl.berkeley.edu/pipermail/boinc_dev/2008-August/011427.html Please note that this thread contains (currently) three messages.

comment:5 in reply to:  4 ; Changed 16 years ago by Richard Haselgrove

Replying to ballen:

Richard,

Thanks for the additional detail and the suggested patch. However I am confused about [1] what the problem is that you are fixing and [2] what the effects of your proposed solution are. Could you please provide some additional details about (a) what was the situation BEFORE your patch and (b) what was the situation AFTER your patch to the server code.

I thought that the problem we were trying to address was the one where a users app_info.xml did not contain a version for EVERY application that a project was running. Did you try my suggested work-around?

In your comments above you seem to be addressing a different case: one where app_info.xml contains incorrect information. I don't understand how that is relevant.

Finally, you also need to respond to David Anderson's comment about policy, which you can find here: http://lists.ssl.berkeley.edu/pipermail/boinc_dev/2008-August/011427.html Please note that this thread contains (currently) three messages.

Hi Bruce,

Slight confusion here - I'm just the observer and reporter: the coding patch is by Juha. I think I'm right in saying that he has a full test BOINC server updated to changeset 15755 (i.e. after your initial patch, and after DaveA's revision at changeset 15753) running in a VM, and he's saying that because his only available application (uppercase) isn't listed in app_info.xml, he's generating the same error messages as we see when S5R4 or Astropulse is missing from Einstein/SETI app_info. As to how the patch works - that's way above my payscale. You'll have to ask him.

(The details above come from this SETI Beta message: http://setiweb.ssl.berkeley.edu/beta/forum_thread.php?id=1368&nowrap=true#34575)

Yes, I did try the workround, and it works perfectly: with a fully-populated app_info.xml, and all applications present, work downloads and tasks run as expected. I haven't completed and reported one yet, because they seem to be running more slowly than expected, but I'm confident it's OK.

However, this is NOT the resolution to the problem. The problem is not limited to resends: it happens when the scheduler first allocates work in response to the host's fetch request, and it creates ghost tasks in the database. Remember that SETI has disabled 'resend lost results', so your initial patch is irrelevant there.

The two SETI apps have a runtime differential of at least 30:1, so it is natural that some users will want to pick and chose by deliberately only naming one or the other in an app_info file, by itself. That's the original semantic definition for anonymous platform, and I think it's still correct. The problem is, just at the moment the server application isn't doing what the manual says it should.

comment:6 in reply to:  5 Changed 16 years ago by Juha

Sorry for my earlier rather brief post - I'm constantly surprised that people can't read my mind :)

Replying to Richard Haselgrove:

Slight confusion here - I'm just the observer and reporter: the coding patch is by Juha. I think I'm right in saying that he has a full test BOINC server updated to changeset 15755 (i.e. after your initial patch, and after DaveA's revision at changeset 15753) running in a VM, and he's saying that because his only available application (uppercase) isn't listed in app_info.xml, he's generating the same error messages as we see when S5R4 or Astropulse is missing from Einstein/SETI app_info. As to how the patch works - that's way above my payscale. You'll have to ask him.

That pretty much correct. The VM is the one from The BOINC server virtual machine page. It has been upgraded a few times and just before I started testing I again grabbed newest BOINC source code, compiled and installed it.

The reason I had only one application was to keep things simpler. The bug is still the same - it doesn't matter how many applications the server has. Since I had only one application I had to put incorrect information to app_info.xml. Had I had two or more applications I could have put only one of the applications to app_info.xml. Also, it has been reported at SETI that clients that have only the old app (pre _enchanced) are getting work for _enchanced and astropulse even though SETI doesn't have the old application any more (at least it's not listed in apps.php). See Joe's post in the email thread you linked to.

So, what the bug is? get_app_version() has wrong return value for the first workunit of each of the applications the server has when the client uses anonymous platform and doesn't have application for the workunit.

// return BEST_APP_VERSION for the given host, or NULL if none
//
//
BEST_APP_VERSION* get_app_version(
    SCHEDULER_REQUEST& sreq, SCHEDULER_REPLY& reply, WORKUNIT& wu
)

So, get_app_version() should return NULL when client doesn't have application for the workunit. get_app_version(), however, returns incorrectly an instance of BEST_APP_VERSION. This makes scan_work_array() (or any other function that uses get_app_version() ) to think client can crunch the workunit and then sends the workunit to client. That results in client error messages Richard reported and either 'ghost' workunit or, with resend_lost_result, server sending the workunit again and again to client.

For the second, third etc workunit for each application get_app_version() has correct return value. Why?

    bavp = new BEST_APP_VERSION;
    bavp->appid = wu.appid;
    if (anonymous(sreq.platforms.list[0])) {
        found = sreq.has_version(*app);
        if (!found) {
            if (config.debug_send) {
                log_messages.printf(MSG_DEBUG,
                    "Didn't find anonymous platform app for %s\n", app->name
                );
            }
            bavp->avp = 0;
        } else {
            // snip this part, it works
        }
        reply.wreq.best_app_versions.push_back(bavp);
        return bavp;
    }

The first workunit adds an entry to reply.wreq.best_app_versions with bavp->avp == NULL (and returns an instance of BEST_APP_VERSION as you can see). Now that reply.wreq.best_app_versions has an entry for application the first part

    for (i=0; i<reply.wreq.best_app_versions.size(); i++) {
        bavp = reply.wreq.best_app_versions[i];
        if (bavp->appid == wu.appid) {
            if (!bavp->avp) return NULL;
            return bavp;
        }
    }

of get_app_version() becomes active. That part correctly returns NULL if client doesn't have application for the workunit.

Ok, then my patch. All it does it makes get_app_version() return NULL for the first workunit when bavp->avp == NULL, that is, when client doesn't have application for the workunit.

I apologize for all confusion I have caused and hope this clears things up.

Richard Haselgrove said:

However, this is NOT the resolution to the problem. The problem is not limited to resends: it happens when the scheduler first allocates work in response to the host's fetch request, and it creates ghost tasks in the database. Remember that SETI has disabled 'resend lost results', so your initial patch is irrelevant there.

And then there are those users that don't check BOINC log messages and don't check forums -> more 'ghost' workunits.

comment:7 Changed 16 years ago by davea

Resolution: fixed
Status: reopenedclosed

(In [15765]) - scheduler: fix bug that caused jobs to be sent to clients

using anonymous platform even if they don't have the necessary app version. Also, send an explanatory message in this case. fixes #713

Note: See TracTickets for help on using tickets.