We need to provide a new column in the database, last_activity_at, linked to a user, showing the last time this user did something with the system, either log in to the UI or do some git action via ssh or http.
That way, administrators of an instance can query the DB to know who's active, and who's not.
@regisF I think we already have most of the data in the "events" table. For this table data older than 12 months is pruned (comes with 8.12).
Login dates are also already provided (users.current_sign_in_at and users.last_sign_in_at) but we may remove some of this information. Right now whenever you log in we update a bunch of columns of the users table and this isn't ideal. We haven't really made any decisions yet though.
Performance wise this shouldn't be too much of a problem.
Random thought here... Are numbers such as 3829 or 193 relevant? (to check the license price, for example?) Or would it be more useful some sort of timeline graph? /cc @JobV
@yorickpeterse wasn't there a discussion about removing entries from the Events table after a certain length of time? Would that be harmful for this feature?
@connorshea Correct, we remove events older than 12 months. For this particular feature that should be fine if we limit things to at most the past 12 months.
@regisF Do you think the Overview screen in the admin panel would be a good place for this? In that page we show some information similar to what we want to add.
I thought about combining it with the Statistics section, but that'd probably put too much information together. Maybe that screen needs a makeover as a whole.
@cperessini Yes it's a good place for that, but as you say, it needs a slight makeover then. However for the sake of being able to ship it in this release, the makeover has to be small.
From the list in the issue description, I'm not sure we keep anything other than push events (I assume both ssh and http?) in the Events table. Am I wrong ?
I've tried to figure out what services are involved/called (ignoring auth calls to the API) and came up with the following (probably not accurate) table:
Operation
Table
Service
git push ssh
Events
GitlabShell (gitlab-shell) and GitPushService after as a hook?
So, I'm not sure I could keep track of all the things using hooks (Specially things like clone, and obviously login counts) which was my first thought. Do you think is possible to unify all of backend code for keeping track of events/counts in one of the apps? I'm thinking I may have to extract everything to some sort of library/gem that can get called through the different services, to avoid duplication. It could be that this is much simpler than I thought and I'm just missing something - hence why I'm asking this before I add a new MR with changes...
I'm assuming we'll want to render this data statically on pageload, but respond to users' selection of new date ranges asynchronously. Seems like overkill to refresh the page (and thus all the info on the Admin > Overview view) just to display a new activity range. Do you have thoughts on this @jameslopez or @yorickpeterse?
We don't track 'git pull' / 'git clone' at the moment as @yorickpeterse said.
Events translate into the 'Activity' tab of the project. If we decide to track 'pulls' we should not create an event for each 'pull' because that would spam the activity feed with useless information.
One thing to bear in mind is that we cannot distinguish well between checking for changes in a git repository and actually fetching changes. All we see on our end is 'successfully authenticated read attempt from repo X'. Some people use Git clients (or CI systems, hello Jenkins) that periodically check for changes, so the 'number of pulls' when measured as 'successful authentication attempts for repo read' will be very high.
That means the table I've put is right and I'll need another model for tracking all of the things that are missing, which is fine.
One thing to bear in mind is that we cannot distinguish well between checking for changes in a git repository and actually fetching changes. All we see on our end is 'successfully authenticated read attempt from repo X'. Some people use Git clients (or CI systems, hello Jenkins) that periodically check for changes, so the 'number of pulls' when measured as 'successful authentication attempts for repo read' will be very high.
I see... @jacobvosmaer-gitlab Are you saying is not possible to exclusively track and distinguish git pull / clone (ssh/http) ? Do you think we can intercept those commands somehow? Also, where do you see that log? gitlab-shell? :)
@regisF Here's a design for the overview screen in the admin panel. I cleaned up the design a little to make it more readable. I wonder if we should show the Unique users line just once in a single place instead of repeating it for every line.
@jameslopez all we can see is: the user, the access method, the repository, and download/upload. We cannot distinguish git clone from git fetch or git pull: these all use the same server-side operation. (All the server sees is git upload-pack or git receive-pack.)
@jacobvosmaer-gitlab ouch.. Thanks for all the info! Is there any other level up the stack we could figure this out, that you can think of? Like Nginx for git clone/fetch/pull via HTTP? Or in any log...
@jameslopez no. You type git clone or git pull, a git upload-pack process is created on the server, and then all further details are exchanged directly between your local git process and the git upload-pack process on the server. We are not listening in on that stream.
In the case of HTTP we can see a little bit more: if the client decides it does not want to download anything (because it already has all the changes it needs locally) it only issues a GET, vs a GET followed by a POST. But that only tells us whether the client downloaded something (instead of just querying the repo status). Not whether the client command was git clone or git pull.
And while we could track this little bit of extra information for HTTP it would possibly be unsatisfying because we do not have the corresponding information for SSH.
Customers need this feature so they can know if they have to upgrade or downgrade the seat count, basically. The only way this feature is valuable is if it shows both logins through UI and git activities. The latter is important because some devs do not use the UI at all, so we still need to know if people are accessing the repositories with git.
That being said, to sum up: from what you are saying, we can't currently know git pull/push/clone operations with SSH and possibly with HTTP. Will it ever be possible to know? Can we change how we proceed to know it, or is it a technical impossibility?
At first glance the indents for via ssh and via https seem so small that to me it almost just looks like those lines are incorrectly aligned, rather than indented. Is that just me? Or is that a pattern we use elsewhere?
@regisF Rigth, I was working off of the CE Admin Panel. Let's move the license stuff below the tables. Otherwise the layout looks weird because of the number of columns changing:
@cperessini can we then do something about the big projects, users and groups block? Now that we have two rows of big blocks, the second row looks weird. I think the second row could be smaller.
@regisF I was working on a new style for the Project/User/Group panels. I was holding off on it to make this iteration smaller, but we can include it to solve the problem of the big blocks. I made two versions for the Projects/User/Groups panels.
@regisF seeing whether people access repositories with Git clients is not a problem at all. We can also see whether it is download access or upload access. But breaking down download access into specific client side commands like git pull, git clone etc. is not possible.
@jameslopez I don't see in that second link how gitolite distinguishes between pull and clone. I strongly doubt that it does considering how much work you have to do to get this information.
To get this information you need to listen in on the data exchanged between the git processes on the client and the server. It is probably technically possible but I think it is total engineering overkill. In gitlab-shell you would need:
change the behavior of bin/gitlab-shell from using exec to staying around
add signal handlers to terminate git-upload-pack when gitlab-shell receives a signal
copy all data sent to the git-upload-pack process into the gitlab-shell process first
understand enough about the SSH transport protocol of Git to parse the messages being sent to see what the client is asking for
So you would have to add a partial Git transport protocol parser, process supervision code, and stream copying / stream inspection code to gitlab-shell just to see this. It would make Git SSH access in GitLab considerably more complex and we would have to work hard to make sure that it not becomes less reliable than it is now. Is that worth it?
@chriscool just to be sure, am I correct when I say that in order to see on the server side if a client is doing git clone (i.e. a full clone, instead of a partial git pull) over SSH, you need to inspect the stdin stream of git-upload-pack and parse protocol messages?
@jacobvosmaer-gitlab that's certainly overkill. Perhaps we can just track all git upload-pack and git receive-pack and merge them in a stat called something like Other Git operations (as we can track git push separately). Now, not sure if that would still be useful /cc @regisF
@jacobvosmaer-gitlab with GIT_TRACE_PACKET it shoud be possible to see if the client sends some "have XXXX" or not.
By the way there is not much difference on the server side if the client does git init; git remote add origin <url>; git fetch or just git clone, so I don't think it's useful to separate those.
@regisF how does this relate to the EE usage ping? Can we make sure that both are showing and sending exactly the same data? This will make it easier for customers to understand what data the usage ping is sending out. Right now we do that with a JSON string that times out, not ideal. /cc @JobV
When I first asked for this data, my only hope was that either the "last_sign_in_at" or "last_credential_check_at" or some other value in the user table would actually represent the last time that this user did something with the system, either log in to the UI or do some git action via ssh or http. Armed with a column like that, we can query our database to our hearts' content to find out who is or isn't using the sytem. I can see how this could cause performance problems when the "user" is really something like phabricator that is slamming us all to hell, so maybe there would be a time window during which it wouldn't be updated, like there is for the git gc counting.
Having the admin page display a more data-driven "active" vs "non-active" statistic rather than "active" vs "blocked" is a useful thing from just a big picture perspective, but I'm not sure it's even really useful for the license sizing purposes you imagine it being used for, since the license is still applying only to non-blocked users, right?
@regisF Ideally it would be one single column that combined all evidence of life, but if you want to keep the current sign_in column and add one for git operations that could work as well.
I presume that as far as inactive users are concerned this issue will only expose a list of users inactive over some period of time. One of our partners has a prospect who is interested in making such users inactive. I presumed a separate issue will be required for that sort of follow-on functionality. Therefore I present issue #1085 ...
I forgot to say that I had updated the body of this issue last week to reflect what we do for this issue. The simplest route: a new column in the DB to track user activity.
This has now been merged and should be in 8.13. We ended up adding a new table for this for performance reasons. I've attached some examples on how to query this data... /cc @dstanley
mySQL/postgreSQL examples - active users last month:
SELECTu.username,a.last_activity_atFROMusersuINNERJOINuser_activitiesaONu.id=a.user_idWHERElast_activity_at>(SELECTNow()-INTERVAL1month)-- INTERVAL '1 MONTH' in postgreSQLORDERBYlast_activity_at;
SELECTCount(*)FROMuser_activitiesWHERElast_activity_at>(SELECTNow()-INTERVAL1month);-- INTERVAL '1 MONTH' in postgreSQL