Opened 15 years ago

Closed 15 years ago

Last modified 15 years ago

#897 closed Defect (fixed)

CUDA tasks error if start delayed by slow CUDA preemption

Reported by: Richard Haselgrove Owned by: davea
Priority: Major Milestone: 6.8
Component: Client - Scheduler Policy Version: 6.6.28
Keywords: Cc:

Description

Description If a CUDA task is scheduled to run immediately, an existing CUDA task may be pre-empted to make way for it. A delay has been introduced to allow the preempted task to exit fully and release allocated memory. If this delay is invoked, the scheduled task is eventually called with a "Resume" instead of an "Initial" status, and fails because the data files are not available.

Reproducibility Easy. For SETI: manually suspend the next task due to run (FIFO). Allow the current task to finish, and the next task (after the suspended task) to start and get well into CUDA processing. Then resume the intermediate task. It will start immediately, and fail with "SETI@home error -5 Can't open file (work_unit.sah) in read_wu_state() errno=2" if the preempted task takes too long to clean up. As in:

10-May-2009 19:13:07 [SETI@home] task 27fe09ac.24805.3754.6.8.29_1 resumed by user
10-May-2009 19:13:07 [SETI@home] [cpu_sched_debug] scheduling 27fe09ac.24805.3754.6.8.29_1 (coprocessor job, FIFO)
10-May-2009 19:13:07 [SETI@home] [cpu_sched_debug] 0: 27fe09ac.24805.3754.6.8.29_1
10-May-2009 19:13:07 [SETI@home] [cpu_sched_debug] 4: 27fe09ac.24805.3754.6.8.29_1
10-May-2009 19:13:07 [SETI@home] [cpu_sched_debug] scheduling 27fe09ac.24805.3754.6.8.29_1
10-May-2009 19:13:07 [SETI@home] [cpu_sched_debug] 27fe09ac.24805.3754.6.8.29_1 sched state 0 next 2 task state 0
10-May-2009 19:13:07 [SETI@home] [cpu_sched_debug] 0: 27fe09ac.24805.3754.6.8.29_1
10-May-2009 19:13:07 [SETI@home] [cpu_sched_debug] 4: 27fe09ac.24805.3754.6.8.29_1
10-May-2009 19:13:07 [SETI@home] [cpu_sched_debug] scheduling 27fe09ac.24805.3754.6.8.29_1
10-May-2009 19:13:07 [SETI@home] [cpu_sched_debug] 27fe09ac.24805.3754.6.8.29_1 sched state 2 next 2 task state 0
10-May-2009 19:13:08 [SETI@home] [cpu_sched_debug] scheduling 27fe09ac.24805.3754.6.8.29_1 (coprocessor job, FIFO)
10-May-2009 19:13:08 [SETI@home] [cpu_sched_debug] 0: 27fe09ac.24805.3754.6.8.29_1
10-May-2009 19:13:08 [SETI@home] [cpu_sched_debug] 4: 27fe09ac.24805.3754.6.8.29_1
10-May-2009 19:13:08 [SETI@home] [cpu_sched_debug] scheduling 27fe09ac.24805.3754.6.8.29_1
10-May-2009 19:13:08 [SETI@home] [cpu_sched_debug] 27fe09ac.24805.3754.6.8.29_1 sched state 2 next 2 task state 0
10-May-2009 19:13:08 [SETI@home] [cpu_sched] Starting 27fe09ac.24805.3754.6.8.29_1(resume)
10-May-2009 19:13:08 [SETI@home] Restarting task 27fe09ac.24805.3754.6.8.29_1 using setiathome_enhanced version 608
10-May-2009 19:13:10 [SETI@home] Computation for task 27fe09ac.24805.3754.6.8.29_1 finished
10-May-2009 19:13:10 [SETI@home] Output file 27fe09ac.24805.3754.6.8.29_1_0 for task 27fe09ac.24805.3754.6.8.29_1 absent

Suspect code area Changeset 17797 for cpu_sched.cpp. The pre-existing code assumes "sched state 0 next 2" will be valid for all newly-scheduled tasks. Changeset 17797 violates this assumption: tasks which are scheduled, but have their startup delayed, should remain in state 0 for the next attempt.

Reporting history On boinc_alpha mailing list
10 May 2009 "CUDA: computation error on restart" (with log)

Comment This may be the same as trac #890, but there isn't enough detail in that ticket to be certain.

Change History (3)

comment:1 Changed 15 years ago by Richard Haselgrove

Aha. The SETI replica database has now caught up to the extent that I can display the errored task I sacrificed in the cause of testing the 'Reproducibility' recipe.

http://setiathome.berkeley.edu/result.php?resultid=1232092555

comment:2 Changed 15 years ago by Richard Haselgrove

changeset:18217 appears to address this issue.

Awaiting new client to test before closing ticket.

comment:3 Changed 15 years ago by romw

Resolution: fixed
Status: newclosed
Note: See TracTickets for help on using tickets.