#897 closed Defect (fixed)
CUDA tasks error if start delayed by slow CUDA preemption
Reported by: | Richard Haselgrove | Owned by: | davea |
---|---|---|---|
Priority: | Major | Milestone: | 6.8 |
Component: | Client - Scheduler Policy | Version: | 6.6.28 |
Keywords: | Cc: |
Description
Description If a CUDA task is scheduled to run immediately, an existing CUDA task may be pre-empted to make way for it. A delay has been introduced to allow the preempted task to exit fully and release allocated memory. If this delay is invoked, the scheduled task is eventually called with a "Resume" instead of an "Initial" status, and fails because the data files are not available.
Reproducibility Easy. For SETI: manually suspend the next task due to run (FIFO). Allow the current task to finish, and the next task (after the suspended task) to start and get well into CUDA processing. Then resume the intermediate task. It will start immediately, and fail with "SETI@home error -5 Can't open file (work_unit.sah) in read_wu_state() errno=2" if the preempted task takes too long to clean up. As in:
10-May-2009 19:13:07 [SETI@home] task 27fe09ac.24805.3754.6.8.29_1 resumed by user
10-May-2009 19:13:07 [SETI@home] [cpu_sched_debug] scheduling 27fe09ac.24805.3754.6.8.29_1 (coprocessor job, FIFO)
10-May-2009 19:13:07 [SETI@home] [cpu_sched_debug] 0: 27fe09ac.24805.3754.6.8.29_1
10-May-2009 19:13:07 [SETI@home] [cpu_sched_debug] 4: 27fe09ac.24805.3754.6.8.29_1
10-May-2009 19:13:07 [SETI@home] [cpu_sched_debug] scheduling 27fe09ac.24805.3754.6.8.29_1
10-May-2009 19:13:07 [SETI@home] [cpu_sched_debug] 27fe09ac.24805.3754.6.8.29_1 sched state 0 next 2 task state 0
10-May-2009 19:13:07 [SETI@home] [cpu_sched_debug] 0: 27fe09ac.24805.3754.6.8.29_1
10-May-2009 19:13:07 [SETI@home] [cpu_sched_debug] 4: 27fe09ac.24805.3754.6.8.29_1
10-May-2009 19:13:07 [SETI@home] [cpu_sched_debug] scheduling 27fe09ac.24805.3754.6.8.29_1
10-May-2009 19:13:07 [SETI@home] [cpu_sched_debug] 27fe09ac.24805.3754.6.8.29_1 sched state 2 next 2 task state 0
10-May-2009 19:13:08 [SETI@home] [cpu_sched_debug] scheduling 27fe09ac.24805.3754.6.8.29_1 (coprocessor job, FIFO)
10-May-2009 19:13:08 [SETI@home] [cpu_sched_debug] 0: 27fe09ac.24805.3754.6.8.29_1
10-May-2009 19:13:08 [SETI@home] [cpu_sched_debug] 4: 27fe09ac.24805.3754.6.8.29_1
10-May-2009 19:13:08 [SETI@home] [cpu_sched_debug] scheduling 27fe09ac.24805.3754.6.8.29_1
10-May-2009 19:13:08 [SETI@home] [cpu_sched_debug] 27fe09ac.24805.3754.6.8.29_1 sched state 2 next 2 task state 0
10-May-2009 19:13:08 [SETI@home] [cpu_sched] Starting 27fe09ac.24805.3754.6.8.29_1(resume)
10-May-2009 19:13:08 [SETI@home] Restarting task 27fe09ac.24805.3754.6.8.29_1 using setiathome_enhanced version 608
10-May-2009 19:13:10 [SETI@home] Computation for task 27fe09ac.24805.3754.6.8.29_1 finished
10-May-2009 19:13:10 [SETI@home] Output file 27fe09ac.24805.3754.6.8.29_1_0 for task 27fe09ac.24805.3754.6.8.29_1 absent
Suspect code area Changeset 17797 for cpu_sched.cpp. The pre-existing code assumes "sched state 0 next 2" will be valid for all newly-scheduled tasks. Changeset 17797 violates this assumption: tasks which are scheduled, but have their startup delayed, should remain in state 0 for the next attempt.
Reporting history
On boinc_alpha mailing list
10 May 2009 "CUDA: computation error on restart" (with log)
Comment This may be the same as trac #890, but there isn't enough detail in that ticket to be certain.
Change History (3)
comment:1 Changed 16 years ago by
comment:2 Changed 16 years ago by
changeset:18217 appears to address this issue.
Awaiting new client to test before closing ticket.
comment:3 Changed 15 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
Aha. The SETI replica database has now caught up to the extent that I can display the errored task I sacrificed in the cause of testing the 'Reproducibility' recipe.
http://setiathome.berkeley.edu/result.php?resultid=1232092555