Support hash-based directory structure in LocalDynamic

Description

There aren't any modern performance limits on directory size but you do run into issues with filesystem commands when the count gets really big, so it might be nice to implement a simple mechanism to put files into subdirectories based on a hash prefix in the LocalDynamic provider.

Environment

None

Activity

Show:

Brent PutmanNovember 6, 2020 at 2:49 AM

The IdP parser and schema were harder to work out than I thought, but I finally got something which I think makes sense.

For the sourceDirectory convenience way of configuring the local filesystem resolver, when the default source key generator strategy is used (the piece that determines say SHA-1 hashing), there's 2 new attributes for specifying the intermediate directory segment number and length.  If you aren't using the default source key generator strategy or just want a custom intermediate directory strategy, then there's the usual bean ref attribute that overrides the convenience attribs.

If/when I get to adding the Base64URL source key strategy support, I'll also add the appropriate intermediate directory strategy support impl(s), and also utility bean(s) in the IdP that can be wired in via the new bean ref.

java-opensaml: 5a5ecb18728517ea13f20572b2785c9690a8e4d8
java-identity-provider: 39427322864632ff12dc30bf398f3413a04bb8dc

 

Brent PutmanJuly 16, 2020 at 9:36 PM

Having done some work on this, I think for a reason which is obvious in hindsight, it does need to be a pluggable impl, in FilesystemLoadSaveManager.

It's because in actual real-world usage of the provider, the only currently extant source key generator for the filesystem case produces the SHA-1 hash of the entityID, and then the key supplied to the manager is "<sha-1>.xml". But the manger doesn't know anything about that, so just taking that string key input and computing the digest would lead to a completely different directory hash value.

My realization is that people will think it's just really weird and awkward that the hashed directory structure isn't just the first X chars of the SHA-1 hash of the entityID. I would personally certainly want it to work that way.

So I think in the load-save manager component it will be pluggable. Then in the IdP for the custom parser and wiring, for the default case of the SHA-1 source key generator you'd wire in an impl that just takes the first X chars as-is.

We could have other impls for other source key cases that could be wired in. For example I plan to support the Base64URL of the entityID (https://shibboleth.atlassian.net/browse/OSJ-207#icft=OSJ-207), and there you'd still likely want the directory hashing based on the SHA-1 hash of the entityID. If you just naively took the first X chars either of the encoded or decoded form, you'd mostly get everything clumped in one or two directories, since virtually all entityIDs start with either "http" or "urn". So that impl would decode the Base64URL input and then SHA-1 hash the entityID to get the directory hash.

Ian YoungJune 22, 2020 at 9:35 AM

Bias in any subset of the bits in the hash values would be a very bad sign for a cryptographic hash, so I think you're safe to assume there is none. Nothing wrong with actually trying it, of course. 

Scott CantorJune 19, 2020 at 4:18 PM

When I've seen it done I think it's usually just one level. I'd probably favor that, just to avoid it getting deeper than necessary.

Brent PutmanJune 19, 2020 at 4:15 PM

I think maybe my comment about the math wasn't very clear.  You get 64K "leaf" directories in total which could contain files.  But not 64K in a single directory.  You have (potentially) 256 1st level dirs.  Then in each of those you have (potentially) 256 2nd level dirs. So the max dirs you have in an 'ls' for example would be 256.

(I say "potentially" because there is no need to create them all up front.  They can be created lazily only when needed to store a file that hashes to that path).

So assuming an (unlikely) perfect distribution of the hashed files, if you had say 64K files, you'd have only 1 file in each 2nd level dir.  If you had 1 million files, it would be 1,000,000/65,536 = ~ 15 files per directory.

So having talked that out...maybe we don't need 2 levels - maybe 1 level of 256 is enough?  That would distribute 100,000 files as roughly 390 files per dir.

I guess it all depends on what size dataset we think is actually realistic to expect.

The other thing I personally don't know is the characteristics of the SHA algorithms when you just take some of the bits and not all of them. I.e. in reality how "distributed" are the first 8 bits or the first 16 bits?  The whole hash value is supposed to be relatively distributed, but not sure about mathematical properties of a subset of the bits.  Seems like it would be, but this is not my area...

Completed

Details

Assignee

Reporter

Components

Fix versions

Created March 9, 2020 at 4:27 PM
Updated November 6, 2020 at 2:57 PM
Resolved November 6, 2020 at 3:06 AM