mod_shib: session cache core dumps on graceful restart
Description
Environment
Shibboleth SP 2.3.1 64bit
Apache HTTPD 2.2.15 64 bit
Solaris 10 u8 x86
Activity
Scott Cantor December 17, 2010 at 2:39 PM
Closing after release.
Scott Cantor June 8, 2010 at 5:05 PM
No sign of any delays in my testing, so marking fixed.
Please open a new issue if you can provide any additional insight into what you're seeing with the patch (as you said).
Scott Cantor June 8, 2010 at 4:28 PM
I'm attempting to reproduce one or the other condition on an OpenSolaris VM, but the race isn't showing up for me. I'll apply my patch and see if it shows any delays.
If I can't reproduce, I'll mark this issue resolved since the patch brings this code in sync with other areas in the codebase.
Scott Severtson June 8, 2010 at 4:23 PM
Agreed that this issue may not be related to the race condition you've patched. Also agreed that it should not be making a difference the method of shutdown (restart versus stop).
I will do some more research on our end, and open another bug once I have more conclusive information. Thanks.
Scott Cantor June 8, 2010 at 2:31 PM
Whether it's graceful or not can't have any impact on it, it's the same shutdown code when the object is destroyed. And it's signaling the thread at that point.
I can understand the original crash, since that's just a race issue, but a full stop here shouldn't be different than a restart in terms of how it signals the thread to wake up. The same kind of code is used in shibd as well, so shutdown there would also have to hang on other threads.
We're experiencing repeatable core dumps when gracefully restarting Apache on Solaris x86. This prevents HTTPD from restarting cleanly, and disrupts our ability to reconfigure Apache without downtime.
pstack core.httpd.12817
lwp# 1 / thread# 1 --------------------
fffffd7fff06cdea __lwp_wait () + a
fffffd7fff063eee _thrp_join () + 3e
fffffd7fff0640cc pthread_join () + 1c
fffffd7ffe47892d _ZN6shibsp7SSCacheD0Ev () + 8d
fffffd7ffe4db3e6 ZN56_GLOBAL_N_impl_XMLServiceProvider.cpp_DFF67DD7_CE7B99549XMLConfigD0Ev () + b6
fffffd7ffe4464ea _ZN6shibsp8SPConfig18setServiceProviderEPNS_15ServiceProviderE () + 3a
fffffd7ffe447c78 _ZN6shibsp8SPConfig4termEv () + 78
fffffd7ffec8eedb shib_exit () + 4b
fffffd7fff270d25 apr_pool_destroy () + 65
0000000000470e1b child_main () + 38b
0000000000471137 make_child () + 147
0000000000471703 ap_mpm_run () + 553
000000000042fd81 main () + 8b1
000000000042f08c _start () + 6c
lwp# 2 / thread# 2 --------------------
00000000000d271d ???????? ()
fffffd7fff06b9ef _SUNW_Unwind_ForcedUnwind () + 53
fffffd7ffe6693da _ex_unwind () + 1a
fffffd7fff06b19c _thrp_unwind () + 3c
fffffd7fff063e6c _thr_exit_common () + 9c
fffffd7fff063eae pthread_exit () + e
fffffd7ffe5cd559 ???????? ()
fffffd7ffe47addd _ZN6shibsp7SSCache7cleanupEv () + 1cd
fffffd7ffe47af79 _ZN6shibsp7SSCache10cleanup_fnEPv () + 19
fffffd7fff06727b _thr_setup () + 5b
fffffd7fff0674b0 _lwp_start ()
We get 3 core dumps on every restart, one per Apache worker thread on our test machine. The stack traces for each are consistent. We do not see similar core dumps when performing a stop followed by a start, nor from a non-graceful restart.
The mod_shib log doesn't provide any clues, nor does the HTTPD error log, even when both are set to DEBUG.
Per Scott Cantor:
Based on the trace, it's probably a race condition I've seen in some newer code doing similar cleanup of a background thread that needs to be patched.