Opened 18 years ago
Last modified 12 years ago
#139 reopened Enhancement
[PATCH] Project-by-project network disable (similar to communications deferred)
Reported by: | MikeMarsUK | Owned by: | davea |
---|---|---|---|
Priority: | Major | Milestone: | Undetermined |
Component: | Manager | Version: | |
Keywords: | patch | Cc: | Pepo |
Description
Hi,
One thing which would be useful would be a way for the user to prevent communications to one particular project at a time (for example, when there are server troubles). If the existing 'communications deferred' facility could be extended to allow the user to defer communications themselves for a period of time, that would be sufficient.
Perhaps the facility should take into account the deadlines of any work units in progress.
This request is prompted by the recent server-out-of-space problems at CPDN which are causing problems for multiproject users on dial up networks (there have also been other ideas suggested to handle the same problem).
-Cheers,
Mike
Attachments (3)
Change History (22)
comment:1 Changed 17 years ago by
Component: | Client - Scheduler Policy → Client - Manager |
---|---|
Owner: | changed from davea to romw |
Priority: | Undetermined → Major |
comment:2 Changed 17 years ago by
I agree with Mike. When the cpdn servers could not accept intermediate or final file uploads for a recent prolonged period, some multi-project crunchers who followed our advice to suspend boinc network activity while the problem lasted, ran out of work on other projects. I think this problem arose whether members had dialup or broadband.
Mo
comment:3 Changed 17 years ago by
Owner: | changed from romw to davea |
---|
comment:4 Changed 15 years ago by
Resolution: | → wontfix |
---|---|
Status: | new → closed |
probably not worth the trouble
comment:5 Changed 15 years ago by
Ironic that this was closed on the same day that the CPDN administrator wrote:
"cpdn-upload1.comlab is shown as down at the moment. The server is running but it's shut down apache as the data partition is full. I've got nowhere else to put the data at the moment so this may well cause a problem for hadam3p uploads until I can obtain more hardware."
comment:6 Changed 15 years ago by
Resolution: | wontfix |
---|---|
Status: | closed → reopened |
I think such a facility would be worth the trouble and not just for members of CPDN. The disk of the CPDN server that takes file uploads has filled up more than once and I would be surprised if this has never occurred on any other project. File upload servers of course also occasionally crash or fail.
When this has happened we have advised members to either suspend BOINC network activity altogether or suspend tasks before they complete. Not all members read this advice and some do not realise in time. They can find themselves with partially-uploaded files which I consider to be in a fragile state. Such files can be the end result of long periods of processing.
If members in this position urgently need to fetch work to keep the computer busy, the only possibilities at the moment are to either suspend network activity altogether letting at least part of the computer's processing capability run idle, or to allow BOINC network activity and put partially-uploaded files at greater risk.
Now that CUDA devices and multi-core computers are increasingly common, requiring large amounts of work to keep all cores busy, I consider MikeMarsUK's proposal even more desirable than it was two years ago.
I am therefore taking the liberty of reopening this ticket in the hope that Mike's proposal will be reconsidered.
comment:7 Changed 15 years ago by
I'm not sure what you mean by "fragile state". BOINC has a mechanism for backing off and retrying file transfers. If this mechanism doesn't work we need to fix it. We can't rely on user actions to provide reliability.
comment:8 follow-up: 12 Changed 15 years ago by
I consider files waiting for transfer, whether already partially uploaded or not, to be in a fragile state for several reasons:
- the moment a transfer is attempted files are on a 14-day countdown to abandonment by BOINC. An extension of this period was requested on one of the BOINC mailing lists but did not meet with universal approval, it being thought that two weeks is plenty of time to obtain funding for a new server, order it, take delivery and install it
- many users with files already waiting for transfer give priority to fetching new work and leave network activity enabled. This causes multiple failed upload attempts of the untransferable files. Every failed attempt increases the likelihood of file corruption
- the greatest risk to waiting untransferable files is probably user action. Because the only apparent solution within the BOINC Manager Transfers tab is the Retry button, users may attempt this repeatedly. When this does not work users may resort to increasingly desperate attempts, for example repeatedly disallowing and reallowing BOINC network activity. (I have seen this action cause BOINC to abandon a file.)
It would therefore be helpful if users had a button within the Transfers tab to suspend tranfers either of selected files or to a specific project. It would allow users to take a safe precaution, in many cases avoiding all the risks I have outlined. Some users, having taken safe action, would be spared some worry, and I believe that the proportion of successfully uploaded results would increase.
Milo, the CPDN programmer who usually looks after the servers, said yesterday after reading this ticket 'It sounds like it would be very useful, particularly now.'
comment:9 Changed 15 years ago by
Users at SETI have now also realised how useful this would be, during their current server outage:
http://setiathome.berkeley.edu/forum_thread.php?id=54188&nowrap=true#908786
comment:10 Changed 15 years ago by
I have implemented the requested functionality, tested by a number of users over the past 2 months.
The changes allow networking to be suspended and resumed for selected projects, adding a new "Suspend network"/"Resume network" button to BOINC Manager's Projects tab.
When project networking is suspended any in progress uploads will have their timers reset and upload will not be restarted until project networking is resumed. No scheduler requests will be made but any pending downloads for the project will be completed. The project's status will be displayed as "Network activity suspended by user".
If a network suspended project generates a trickle-up this will be shown in the project's status message as "Network activity suspended by user, Trickle upload pending".
A scheduler request can be forced at any time by clicking the Update button. That will send any pending trickle-up messages and (if required) request new work for the project. If new tasks are allocated any required downloads will be made automatically without the need to enable project networking.
The status message on the Tasks tab for completed tasks which haven't been uploaded will be "Uploading, project networking suspended".
The status message on the Transfers tab for blocked uploads will be "Upload pending, project networking suspended".
When project networking is resumed any blocked uploads will be started.
I have patches (at revision 18840) available for boinc_core_release_6_6a, boinc_core_release_6_8 and boinc_trunk but the attachment option seems to be disabled at the moment.
comment:11 Changed 15 years ago by
Thank you, Thyme, and also to your testers.
I assume that for the time being this is only available in your own private build.
Changed 15 years ago by
Attachment: | trac139_trunk.patch added |
---|
Patch for boinc_trunk (revision 18844)
Changed 15 years ago by
Attachment: | trac139_6_6a.patch added |
---|
Patch for boinc_core_release_6_6a (revision 18844)
Changed 15 years ago by
Attachment: | trac139_6_8.patch added |
---|
Patch for boinc_core_release_6_8 (revision 18844)
comment:12 Changed 15 years ago by
Replying to mo.v:
- many users with files already waiting for transfer give priority to fetching new work and leave network activity enabled. This causes multiple failed upload attempts of the untransferable files. Every failed attempt increases the likelihood of file corruption
How could a file get corrupted by just trying to upload it too many times?
- the greatest risk to waiting untransferable files is probably user action. Because the only apparent solution within the BOINC Manager Transfers tab is the Retry button, users may attempt this repeatedly. When this does not work users may resort to increasingly desperate attempts, for example repeatedly disallowing and reallowing BOINC network activity. (I have seen this action cause BOINC to abandon a file.)
BOINC abandons files if they spend more than 14 days trying to upload, but there is no limit for the number of retries (I thought there was, but checked code and there isn't).
comment:13 Changed 14 years ago by
Cc: | Pepo added |
---|
comment:14 Changed 14 years ago by
Keywords: | patch added |
---|---|
Summary: | Project-by-project network disable (similar to communications deferred) → [PATCH] Project-by-project network disable (similar to communications deferred) |
comment:15 Changed 14 years ago by
This ticket is now 3 years old. I am sure that many members of more than one project would still at times find this option useful. It would be nice to know whether Thyme Lawn's patch would still work on new versions of Boinc, or whether there are any reasons why this extra function would not be advisable.
Thank you
Mo
comment:16 follow-up: 18 Changed 14 years ago by
My preference is to make BOINC to do the right thing automatically rather than add new manual controls.
If there are problems with server communication (file transfers or scheduler RPCs) the "network suspend" is supposed to happen automatically. Is this mechanism not working? Do its parameters (e.g. 2-week give-up time on transfers) need to be adjusted?
comment:17 Changed 14 years ago by
The previous 2-week give-up time limit for transfers was changed a while ago to 90 days as a result of []http://boinc.berkeley.edu/trac/ticket/919[] Ticket 919 which led to this changeset: []http://boinc.berkeley.edu/trac/changeset/18845[]. I agree that this makes the problem less pressing and hope that 90 days is still the permitted time limit.
It's a while since I had a problem with server communication but I'm pretty sure that when I had a file that couldn't upload the Boinc network suspend didn't kick in. I'm sure Boinc network activity remained open, though with the upload backoff.
comment:18 Changed 12 years ago by
Replying to davea:
My preference is to make BOINC to do the right thing automatically rather than add new manual controls.
If there are problems with server communication (file transfers or scheduler RPCs) the "network suspend" is supposed to happen automatically. Is this mechanism not working?
If you mean the upload backoff timer yes, that mechanism is working correctly. What that doesn't cater for is the wasted bandwidth when uploads are repeatedly failing at 100% (as has been happening recently with some CPDN uploads).
For "BOINC to do the right thing automatically" perhaps scheduler replies should be including a status indication for each of the project's upload and download servers. That would allow clients to automatically suspend a file transfer when none of the servers the file could be sent to/received from is running. When file transfers are suspended by that mechanism the client would have to make periodic scheduler requests to check server availability.
comment:19 Changed 12 years ago by
To upload a file, the client already sends two requests, the second is where the file data is actually sent. Thus the CPDN problem where the client sends the whole file and only then gets an error, can be fixed entirely server side, by making the first request return an error if the server has any problems.
This has also been suggested on a per file basis.