[WIP] faster group membership resolution in AD
This is a follow up to !215 (closed) after recent versions of GitLab changed the way LDAP group sync works.
The new implementation is still not working acceptably in our environment. The LDAP query used to recursively evaluate all group members takes 70-80 seconds per group (it seems to increase with the total number of user accounts in the Active Directory). In any case, it never completes within the default 10s timeout, and extending the timeout would result in extremely slow LDAP group sync.
This MR proposes an alternative method to resolve group membership, which is much faster in large AD environments (but probably a bit slower in small ones). The method is to recursively walk the nested groups. In our environment, with this patch GitLab can sync 1000 LdapGroupLinks in about 300 sec, i.e. 250 times faster.
In its current state, it does not behave exactly as the upstream implementation: nested groups outside the group_base path are currently ignored, but this is just a matter of search scope. Feedback on the approach would be appreciated.
Merge request reports
Activity
mentioned in merge request !215 (closed)
Thanks @cernvcs We've been discussing changing the recursive lookup. We'll review it as soon as possible.
@cernvcs @dblessing maybe it would be good if we have a call together, I find myself writing too many comments. I like what this MR does but I have a strong feeling we can make it even better by slightly changing how it does what it does.
Edited by Jacob Vosmaer (GitLab)@cernvcs this is the issue where @dblessing and I were thinking out loud about doing what you just did here (minus ranged member retrieval). https://gitlab.com/gitlab-org/gitlab-ee/issues/593
The reason I suggest having a call is that I have difficulty understanding what some of this code does. And if it is difficult for me, having 3 years experience with the EE LDAP code, then it will probably also be challenging for most other people on our team.
A few weeks ago we were reminded what can go wrong when code gets too hard to understand, when it turned out that in some cases LDAP group sync started incorrectly adding the extra LDAP users to GitLab groups. This forced us to do an 8.7 security patch release.
My conclusion from that incident is that we should hold the Ruby code in lib/gitlab/ldap to an extra high standard of intelligibility. But me spewing comments all over this MR is probably not the best way to make it better. This is why I suggest having a call.
Thanks @jacobvosmaer-gitlab for the review
I fully agree that this is not finalized code, even less ready for long-term maintenance, and it may not apply to the general use case in its current state; the main idea with this MR was to share how we quickly patched the issue internally, in case some of what we did could be useful. I'm glad to hear that you were already considering something in the same direction.
I'm happy to participate in a call if it can help, please feel free to contact me directly on the notification email address for this account to organize this.
Thanks @cernvcs I have sent an email to you and @dblessing
Call notes (present: @dblessing, @jacobvosmaer-gitlab , @cernvcs (Alexandre Lossent, Borja Aparicio)) :
- server side recursive lookup took >1 min per group, too slow
- AD specialist suggested doing recursive group lookup client side
- MR also adds range retrieval needed for >1500 members in group
- when the group has >1500 members the normal member atribute will be reported empty (!)
- with server-side recursive lookup, automatic result pagination was (probably) hiding the 'group is too large' problem. Client-side recursive lookup forces us to implement attribute range retrieval.
two challenges about client-side recursive lookups:
- distinguish users from groups
- avoid memberships loops => not hard but subtle; range retrieval may also have repetitions
To distinguish users from groups CERN used GitLab's group_base. This could cause a change in behavior if current EE installations use nested groups where some subgroup is outside of the group_base.
CERN also relies on Net::LDAP::Filter.eq("objectClass", "group"). After the call Drew points out: we may also have to look for the 'groupOfNames' class.
It is OK if we reimplement rather than refine CERN's MR. The final performance is what matters to them.
Other requests: can we alleviate confusion for new GitLab users who do not have access to anything the first time they log in.
Jacob suggests: notice for new users that it may take X minutes for their projects/groups to show up.
Jacob points out that maintaining a separate group membership algorithm has a high complexity cost and may show inconsistent results. Still, CERN argues, it might be worth to fix this because it is such a crucial part of what GitLab EE is about. Signing in for the first time and seeing no groups/projects is not a good first impression, also from a marketing viewpoint.
Other request: make group batch sync frequency configurable. Drew: we are thinking about this, it needs to be tweaked at several levels (sidekiq-cron, ExclusiveLease).
@cernvcs Thanks for your patience. I'm starting work on this to see if we can get it in for 8.10. I may have some questions or ask you to test things at some point.
Thanks @dblessing
Happy to test.mentioned in commit 8883e292
@jacobvosmaer-gitlab We should consider shipping this essentially as-is for 8.10. A couple of other customers have run in to this issue and patching with these changes has worked for them.
Milestone changed to %8.12
mentioned in merge request !719 (merged)
Mentioned in merge request !719 (merged)