Data about activity of user base in an instance

@regisF I think we already have most of the data in the "events" table. For this table data older than 12 months is pruned (comes with 8.12).

Login dates are also already provided (users.current_sign_in_at and users.last_sign_in_at) but we may remove some of this information. Right now whenever you log in we update a bunch of columns of the users table and this isn't ideal. We haven't really made any decisions yet though.

Performance wise this shouldn't be too much of a problem.

@regisF where is this info found? I suppose somewhere in the admin panel.

Added ~313021 label

Milestone changed to %8.13

Reassigned to @dimitrieh

@dimitrieh Will you look at the UX work for this issue? Thanks!

Added ~126459 label

Random thought here... Are numbers such as 3829 or 193 relevant? (to check the license price, for example?) Or would it be more useful some sort of timeline graph? /cc @JobV

@jameslopez numbers are! Timeline graph would be nice, but not necessary at this point in time.

Reassigned to @cperessini

@cperessini Switching you to this issue, as the other issue I had added you to was more in Dimitrie's area. Thanks!

@yorickpeterse wasn't there a discussion about removing entries from the Events table after a certain length of time? Would that be harmful for this feature?

@cperessini we have to find a place to display this data in the administration panel :-)

@connorshea Correct, we remove events older than 12 months. For this particular feature that should be fine if we limit things to at most the past 12 months.

@regisF do we have a slack channel to discuss this issue? Might be a good idea to create one...

@regisF Do you think the Overview screen in the admin panel would be a good place for this? In that page we show some information similar to what we want to add.

I thought about combining it with the Statistics section, but that'd probably put too much information together. Maybe that screen needs a makeover as a whole.

@cperessini Yes it's a good place for that, but as you say, it needs a slight makeover then. However for the sake of being able to ship it in this release, the makeover has to be small.

Mentioned in issue #574

@DouweM @yorickpeterse @jacobvosmaer-gitlab sorry to bug you for some obvious questions, but you may save me a fair amount of time here :)

From the list in the issue description, I'm not sure we keep anything other than push events (I assume both ssh and http?) in the Events table. Am I wrong ?

I've tried to figure out what services are involved/called (ignoring auth calls to the API) and came up with the following (probably not accurate) table:

Operation	Table	Service
git push ssh	Events	GitlabShell (gitlab-shell) and GitPushService after as a hook?
git push http	Events	GitPushService (gitlab-rails + gitlab-shell + Workhorse?)
git pull ssh	-	GitlabShell (gitlab-shell)
git pull http	-	GitLab::Workhorse (gitlab-rails + Workhorse)
git clone ssh	-	gitlab-shell
git clone http	-	gitlab-shell + gitlab-rails
Login count	User (no historical data)	gitlab-rails

So, I'm not sure I could keep track of all the things using hooks (Specially things like clone, and obviously login counts) which was my first thought. Do you think is possible to unify all of backend code for keeping track of events/counts in one of the apps? I'm thinking I may have to extract everything to some sort of library/gem that can get called through the different services, to avoid duplication. It could be that this is much simpler than I thought and I'm just missing something - hence why I'm asking this before I add a new MR with changes...

@jameslopez Git events wise we only track push events I believe, we don't track pulls (and there's no solid way of doing that either I think).

I'm assuming we'll want to render this data statically on pageload, but respond to users' selection of new date ranges asynchronously. Seems like overkill to refresh the page (and thus all the info on the Admin > Overview view) just to display a new activity range. Do you have thoughts on this @jameslopez or @yorickpeterse?

@brycepj I would say it makes sense to load it async, yep.

@brycepj Yup, loading this asynchronously always seems like a good thing.

We don't track 'git pull' / 'git clone' at the moment as @yorickpeterse said.

Events translate into the 'Activity' tab of the project. If we decide to track 'pulls' we should not create an event for each 'pull' because that would spam the activity feed with useless information.

One thing to bear in mind is that we cannot distinguish well between checking for changes in a git repository and actually fetching changes. All we see on our end is 'successfully authenticated read attempt from repo X'. Some people use Git clients (or CI systems, hello Jenkins) that periodically check for changes, so the 'number of pulls' when measured as 'successful authentication attempts for repo read' will be very high.

Thanks a lot @yorickpeterse and @jacobvosmaer-gitlab !

That means the table I've put is right and I'll need another model for tracking all of the things that are missing, which is fine.

One thing to bear in mind is that we cannot distinguish well between checking for changes in a git repository and actually fetching changes. All we see on our end is 'successfully authenticated read attempt from repo X'. Some people use Git clients (or CI systems, hello Jenkins) that periodically check for changes, so the 'number of pulls' when measured as 'successful authentication attempts for repo read' will be very high.

I see... @jacobvosmaer-gitlab Are you saying is not possible to exclusively track and distinguish git pull / clone (ssh/http) ? Do you think we can intercept those commands somehow? Also, where do you see that log? gitlab-shell? :)

@regisF Here's a design for the overview screen in the admin panel. I cleaned up the design a little to make it more readable. I wonder if we should show the Unique users line just once in a single place instead of repeating it for every line.

cc @brycepj

@jameslopez all we can see is: the user, the access method, the repository, and download/upload. We cannot distinguish git clone from git fetch or git pull: these all use the same server-side operation. (All the server sees is git upload-pack or git receive-pack.)

@jacobvosmaer-gitlab ouch.. Thanks for all the info! Is there any other level up the stack we could figure this out, that you can think of? Like Nginx for git clone/fetch/pull via HTTP? Or in any log...

@jameslopez no. You type git clone or git pull, a git upload-pack process is created on the server, and then all further details are exchanged directly between your local git process and the git upload-pack process on the server. We are not listening in on that stream.

In the case of HTTP we can see a little bit more: if the client decides it does not want to download anything (because it already has all the changes it needs locally) it only issues a GET, vs a GET followed by a POST. But that only tells us whether the client downloaded something (instead of just querying the repo status). Not whether the client command was git clone or git pull.

And while we could track this little bit of extra information for HTTP it would possibly be unsatisfying because we do not have the corresponding information for SSH.

These are all the commands we can distinguish for SSH: https://gitlab.com/gitlab-org/gitlab-shell/blob/0b4fd0af16555b2bdc28f6b18781d72226c5d56c/lib/gitlab_shell.rb#L11-12

For Git HTTP, we can only see git upload-pack vs git download-pack and whether it is a GET or a POST. https://gitlab.com/gitlab-org/gitlab-workhorse/blob/c3d62d2b54a06f5dba34fe445919a438527cd0df/internal/upstream/routes.go#L61-63

Thanks @cperessini This looks good. We still need this section though:

@jacobvosmaer-gitlab @jameslopez Just so we are all on the same page.

Customers need this feature so they can know if they have to upgrade or downgrade the seat count, basically. The only way this feature is valuable is if it shows both logins through UI and git activities. The latter is important because some devs do not use the UI at all, so we still need to know if people are accessing the repositories with git.

That being said, to sum up: from what you are saying, we can't currently know git pull/push/clone operations with SSH and possibly with HTTP. Will it ever be possible to know? Can we change how we proceed to know it, or is it a technical impossibility?

@cperessini I like the design! Nice job!

At first glance the indents for via ssh and via https seem so small that to me it almost just looks like those lines are incorrectly aligned, rather than indented. Is that just me? Or is that a pattern we use elsewhere?

@regisF Rigth, I was working off of the CE Admin Panel. Let's move the license stuff below the tables. Otherwise the layout looks weird because of the number of columns changing:

3 columns
4 columns
3 columns

@brycepj Thanks for that comment! I'll try to improve that

@cperessini can we then do something about the big projects, users and groups block? Now that we have two rows of big blocks, the second row looks weird. I think the second row could be smaller.

@regisF I was working on a new style for the Project/User/Group panels. I was holding off on it to make this iteration smaller, but we can include it to solve the problem of the big blocks. I made two versions for the Projects/User/Groups panels.

Bottom panels with big numbers for the count

Bottom panels with the count in a badge

@cperessini Option A works better for me.

@regisF seeing whether people access repositories with Git clients is not a problem at all. We can also see whether it is download access or upload access. But breaking down download access into specific client side commands like git pull, git clone etc. is not possible.

@jacobvosmaer-gitlab @regisF I wonder if we could implement that in gitlab-shell somehow. GitHub certainly has this https://github.com/blog/1873-clone-graphs and it looks like gitolite has it too (which we used ages ago) http://stackoverflow.com/questions/23468737/pull-clone-statistics-with-gitolite

@jameslopez I don't see in that second link how gitolite distinguishes between pull and clone. I strongly doubt that it does considering how much work you have to do to get this information.

To get this information you need to listen in on the data exchanged between the git processes on the client and the server. It is probably technically possible but I think it is total engineering overkill. In gitlab-shell you would need:

change the behavior of bin/gitlab-shell from using exec to staying around
add signal handlers to terminate git-upload-pack when gitlab-shell receives a signal
copy all data sent to the git-upload-pack process into the gitlab-shell process first
understand enough about the SSH transport protocol of Git to parse the messages being sent to see what the client is asking for

So you would have to add a partial Git transport protocol parser, process supervision code, and stream copying / stream inspection code to gitlab-shell just to see this. It would make Git SSH access in GitLab considerably more complex and we would have to work hard to make sure that it not becomes less reliable than it is now. Is that worth it?

@chriscool just to be sure, am I correct when I say that in order to see on the server side if a client is doing git clone (i.e. a full clone, instead of a partial git pull) over SSH, you need to inspect the stdin stream of git-upload-pack and parse protocol messages?

@jacobvosmaer-gitlab that's certainly overkill. Perhaps we can just track all git upload-pack and git receive-pack and merge them in a stat called something like Other Git operations (as we can track git push separately). Now, not sure if that would still be useful /cc @regisF

@jacobvosmaer-gitlab with GIT_TRACE_PACKET it shoud be possible to see if the client sends some "have XXXX" or not. By the way there is not much difference on the server side if the client does git init; git remote add origin <url>; git fetch or just git clone, so I don't think it's useful to separate those.

@jameslopez git push is the same thing as git receive-pack: that is just what it looks like on the server.

@chriscool thanks!

@jameslopez @jacobvosmaer-gitlab

To sum up: we know when someone pushes to Git, and we know when someone pull to Git, but basically we can't know the details of those operations.

That means we could report on the admin panel the following data, over a given time period and per unique user:

Login in the UI
Git pull operations, without any details
Git push operations, without any details

Please confirm. I'll then validate with users who've asked for this.

@regisF that makes sense, I'm not sure about calling it Git pull operations I think @jacobvosmaer-gitlab mentioned we couldn't distinguish if it's a pull/clone/fetch... https://gitlab.com/gitlab-org/gitlab-ee/issues/1022#note_16209650 - perhaps something like Git read operations...

@dstanley can you please read this discussion (specifically this comment) and let me know if this would be satisfactory for your needs?

cc @dblessing @JobV scope has changed a little bit.

@regisF how does this relate to the EE usage ping? Can we make sure that both are showing and sending exactly the same data? This will make it easier for customers to understand what data the usage ping is sending out. Right now we do that with a JSON string that times out, not ideal. /cc @JobV

When I first asked for this data, my only hope was that either the "last_sign_in_at" or "last_credential_check_at" or some other value in the user table would actually represent the last time that this user did something with the system, either log in to the UI or do some git action via ssh or http. Armed with a column like that, we can query our database to our hearts' content to find out who is or isn't using the sytem. I can see how this could cause performance problems when the "user" is really something like phabricator that is slamming us all to hell, so maybe there would be a time window during which it wouldn't be updated, like there is for the git gc counting.

Having the admin page display a more data-driven "active" vs "non-active" statistic rather than "active" vs "blocked" is a useful thing from just a big picture perspective, but I'm not sure it's even really useful for the license sizing purposes you imagine it being used for, since the license is still applying only to non-blocked users, right?

@jameslopez @regisF @dblessing @JobV

Um, hi

Thanks @dstanley for your input. So basically, just having a last_sign_in_at and last_git_operation_at (or similar) columns would be enough for you?

@regisF Ideally it would be one single column that combined all evidence of life, but if you want to keep the current sign_in column and add one for git operations that could work as well.

@dstanley thanks for that context!

@JobV usage ping issue https://gitlab.com/gitlab-org/gitlab-ee/issues/1044

Mentioned in issue #1085

I presume that as far as inactive users are concerned this issue will only expose a list of users inactive over some period of time. One of our partners has a prospect who is interested in making such users inactive. I presumed a separate issue will be required for that sort of follow-on functionality. Therefore I present issue #1085 ...

Mentioned in merge request !781 (merged)

Reassigned to @jameslopez

@awhildy are we still doing UX on this and if not should we remove the label?

Removed ~126459 label

@awhildy @jschatz1 I removed the label. We don't need UX for this.

I forgot to say that I had updated the body of this issue last week to reflect what we do for this issue. The simplest route: a new column in the DB to track user activity.

Status changed to closed by merge request !781 (merged)

This has now been merged and should be in 8.13. We ended up adding a new table for this for performance reasons. I've attached some examples on how to query this data... /cc @dstanley

mySQL/postgreSQL examples - active users last month:

SELECT u.username,
       a.last_activity_at
FROM   users u
       INNER JOIN user_activities a
               ON u.id = a.user_id
WHERE  last_activity_at > (SELECT Now() - INTERVAL 1 month) -- INTERVAL '1 MONTH' in postgreSQL
ORDER  BY last_activity_at;

SELECT Count(*)
FROM   user_activities
WHERE  last_activity_at > (SELECT Now() - INTERVAL 1 month); -- INTERVAL '1 MONTH' in postgreSQL

Mentioned in issue #450

mentioned in merge request gitlab-com/www-gitlab-com!4428 (closed)

mentioned in merge request gitlab-foss!14785

Data about activity of user base in an instance

Description

References

Designs

Child items ...

Activity

Bottom panels with big numbers for the count

Bottom panels with the count in a badge

mySQL/postgreSQL examples - active users last month:

Admin message

Admin message

Data about activity of user base in an instance

Description

References

Activity

Bottom panels with big numbers for the count

Bottom panels with the count in a badge

mySQL/postgreSQL examples - active users last month: