NPE in IndexedXMLObjectChildrenList
Description
Environment
Activity
Brent Putman March 4, 2021 at 5:04 AM
Huge thanks for that, it's a big help.
Unfortunately the lack of any future errors won't necessarily mean it's fixed (maybe statistically suggests it), but if you do see continued errors we'll positively know it's not fixed.
Scott Cantor March 4, 2021 at 1:15 AM
I've deployed a patched opensaml-core to production tonight.
Brent Putman March 3, 2021 at 6:53 PMEdited
That might be true in a general case, but here I don't think it's an issue. It's because the critical section is just the classic "check map and create key/value if it doesn't exist". So it (logically) should actually only happen once for a given key, the first time something requests that QName index. It should be ok now, since that's guarded and so (physically) it will absolutely only happen once - only 1 thread can now eval "!objectIndex.containsKey(index)" as true.
And then the code below checkAndCreateIndex
doesn't mutate anything, just reading the map, which is now guaranteed to be in a consistent state.
My read of the race is: The actual error was an NPE in ListView#isEmpty()
. That must be because the view's indexList
member is null. That means that in the view's ctor the "indexList = backingList.get(index)"
must have assigned null. And I think that could happen if one thread was partially through checkAndCreateIndex
such that a second thread evaled the "!objectIndex.containsKey(index)" as false, fell through and later (before the first thread finished the critical section) called the map get() before the value was actually available from the map.
I'm pretty sure that can happen since - and I always forget this - the plain Java collection types (at least the hashtable-based ones) are NOT thread-safe for mutation. That is in fact the reason for the existence for the ConcurrentHashMap, -HashSet
impls - to at least allow the collection to be structurally correct and not corrupted, etc in the fact of concurrent mods. Esp important since those will internally re-hash and re-structure themselves based on the capacity algorithm, when you add new items. (They don't of course help in this kind of logical concurrency issue of "check and create if doesn't exist", because that involves 2+ calls to the collection).
Scott Cantor March 3, 2021 at 1:24 PM
I haven't exactly figured out what the actual race is, but my thought was that it might be a problem to have one thread reading the collection below the call to checkAndCreateIndex
at the same time something else is inside it.
Since both threads have to go through that initial step, I can imagine it's fine, but since I don't quite understand what's happening now I'm being paranoid.
I'll try and get a patched version in but I really haven't seen the issue often, so it probably won't prove it's fixed unfortunately.
Brent Putman March 3, 2021 at 4:17 AM
Yes, but simply reading the map in multiple threads doesn't cause an issue, I think. The initial setup and mutation of the index map for a given QName key is now governed by the mutex, which is where I think the problem is. If you see otherwise, let me know. Always good to have multiple sets of eyes on concurrency problems (semi-pun intended).
Yeah, if you could cherry pick this and test in your production IdP that would be awesome.
We're seeing around a dozen or two of these errors each day in production:
ERROR [net.shibboleth.idp.saml.profile:-2] - Uncaught runtime exception java.lang.NullPointerException: null at org.opensaml.core.xml.util.ListView.isEmpty(IndexedXMLObjectChildrenList.java:332)
Unfortunately this is only happening in production so I haven't been able to turn on DEBUG logging to troubleshoot further.
Please let me know if I can provide additional information.