User Cohort Insights

Cohort analysis is used to observe what happens to a group of users that joined in a particular time period.

For us (GitLab.com), we can't use revenues to know the churn over time, which is one of the use case of user cohorts (because we don't have revenues).

We could use last activity date however to determine if new users actually stay active on GitLab.

-	Cohort total	0	1	2	3
AUG	273	0%	6.92%	12.31%	37.06%
SEP	324	0%	4.09%	14.38%
OCT	312	0%	7.04%
NOV	145	0%

In the table above, we would see that for the October cohort, 7% of users are already inactive 1 month after they've signed up. We could take actions to reduce the "churn" of activity here. I see the value here, for our use case. In case we monetize some features in GitLab.com in the future, we'll also be able to do user cohorts for the different features we'll monetize.

That being said, I don't see how that would benefit other EE users. What do you think @JobV ?

@regisF People are interesting in seeing how their colleagues are using GitLab. It'd be great if we'd have a system where we could plugin new features to see usage over time, for instance. This helps organisations making payment decisions.

The first and most simple user cohorts we can measure is the last activity of the users.

As a matter of fact, we don't record yet other usage data over time, except this, at the instance level.

Using the last_activity_date (defined in https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/781), we can measure how users stay active over time. This will be really useful for GitLab.com.

-	Cohort total	Month 0	Month 1	Month 2	Month 3
AUG	273	100%	96.92% (261)	89.31% (251)	37.06% (98)
SEP	324	100%	94.09% (282)	78.38% (254)
OCT	312	100%	97.04% (298)
NOV	145	100%

We would read the table above like: for the October cohort, 97.04% of the users who registered in October are still active one month after their registration.

This table will be shown in the administration panel interface under a new tab called Reports. (I'll update the body of this issue if we agree on the next step)

mentioned in merge request gitlab-com/www-gitlab-com!4428 (closed)

@regisF we're needing this more and more. Let's do this for 9.0 if possible.

added gitlab-ee~~901432 gitlab-ee~~481018 labels

changed milestone to gitlab-ee%11

@regisF gitlab-ee~~481018 doesn’t have capacity to build this. gitlab-ee~~901432 might

We don't have access to the last_activity_date anymore. It was moved out of the database for performance reasons, and now that data is now stored in Redis (https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/915).

Creating this table is still feasible, but more complex.

We have an API call to see last activity of all the members of a GitLab instance, which returns something like:

[
  {
    "username": "user1",
    "last_activity_at": "2015-12-14 01:00:00"
  },
  {
    "username": "user2",
    "last_activity_at": "2015-12-15 01:00:00"
  },
  {
    "username": "user3",
    "last_activity_at": "2015-12-16 01:00:00"
  }
]

So one hand, we can get a list of all users who were active in the last 30 days.

On the other hand we know the registration dates of each user in the database. We can combine the two to populate the table above.

The problem now is that the data is in Redis, and Redis can be flushed at any moment - therefore we would lose data and this table will become not only useless, but grossly inaccurate also. I think we have to move the data that is stored on Redis periodically to the database, in order to keep the information forever.

@jameslopez @yorickpeterse can you help me there? Am I right in saying that if Redis dies for some reason, we don't have the data anymore? What would be required to save it to the database and how to proceed?

@regisF Redis periodically flushes data to disk, which it can use to restore in the event of an error. However, the data could be flushed explicitly at any given time. Indeed it's best to treat Redis as a temporary storage, similar to /tmp.

Considering the recent increase in requests for reporting, and considering this is all specific to GitLab, I think it might be worth looking into building a simple application that pulls data from the DB/Redis, then stores this elsewhere (e.g. like our version app). This means we can build whatever GitLab needs, without having to somehow shoehorn this into GitLab itself.

@yorickpeterse thanks for your input.

As a matter of fact, as we want to display a user cohort about last activity date straight in the administration panel of an instance, we don't need the system you are talking about. We still need however to make this data persistent at the instance level. How can we do it simply?

Hmm I see a few options here:

We have a background job that moves the data from Redis to the DB (daily, for instance). This means the data won't be super accurate, but not sure we need that anyway.
We completely avoid rake cache:clear (which we probably run at least every time we upgrade) and use zone cache clearing instead. So substitute cache:clear for something like cache:clear:gitlab and never flush the activity data.
We pull and store stuff in the DB on another app (we can query the GitLab API to get the data). Then we can query this app via an API to get the info.

added gitlab-ee~1281159 label

@jameslopez I have the impression that solution 1 might be what we need here. We don't need great accuracy - once per day is enough. I've documented the issue here https://gitlab.com/gitlab-org/gitlab-ce/issues/27790

added gitlab-ee~992792 label

changed milestone to gitlab-ee%22

@regisF Please talk with @mydigitalself about when we should do this!

changed milestone to gitlab-ee%16

@DouweM @mydigitalself we want to do this in 9.1

Before we do this, as described in https://gitlab.com/gitlab-org/gitlab-ce/issues/23361, we need https://gitlab.com/gitlab-org/gitlab-ce/issues/27790 first, which hopefully will be done in 9.0. We can't move forward without this.

@regisF As it is right now, gitlab-ee~481018 doesn't have capacity for this in 9.1—the major efforts as discussed with @mydigitalself are going to be https://gitlab.com/gitlab-org/gitlab-ce/issues/27084, https://gitlab.com/gitlab-org/gitlab-ce/issues/18471, and https://gitlab.com/gitlab-org/gitlab-ce/issues/28433.

added gitlab-ee~1599850 label

changed milestone to gitlab-ee%23

@mydigitalself I moved to 9.2, we can't postpone this any further as we want to have the CE usage ping and this is a requirement for that.

@DouweM @mydigitalself I'm not completely satisfied with how this was moved. I understand that we have limited capacity and many important things (everything is important ^tm), but this is extremely important to our business (as it'll allow us to enable the usage ping for CE). Let's try to still make it work, by trading it for something else.

@DouweM is there any way we can make capacity for this? E.g. do Teams integration later?

Alternatively, @smcgivern and @victorwu is there any chance you could pick this up?

@smcgivern : Anything we can do here for 9.1? Have all BE resources already started something in 9.1? I don't see anything in particular that we can swap out?

changed milestone to gitlab-ee%22

moved from gitlab-ee#1245

@regisF told me that the usage ping data itself should be pretty-printed on this page, too: https://gitlab.com/gitlab-org/gitlab-ee/issues/1498

Do we also need a design for when usage ping is not enabled? Or will we just not show the link to the page in that case?

@regisF is there a limit to either axis of this? The first user for GitLab.com is from September 2012, so the table will be really tall and wide.

@smcgivern it could be nice to have a blank state for this screen. If we can make it, nice, if not, let's do it in another release.

About limits:

We should have the same number of columns for both X and Y axis
I think if we could fit one year of data, that'd be nice, to show progress. Meaning 12 rows.
At launch, we won't have historical data for CE, so they'll all be empty. Let's see once we'll have the table, how we can make it still pretty.

@regisF one thing to note: the first column won't always be 100%. For instance, if a user was created through the API, but has never logged in, pushed, etc., then they won't have an activity date.

This is OK performance-wise on staging (it will need a MySQL version too; I'm using last_sign_in_at instead of the activity column for now):

User.where('created_at > ?', Time.utc(2016)).where.not(last_sign_in_at: nil).group("DATE_TRUNC('month', created_at)").group("DATE_TRUNC('month', last_sign_in_at)").reorder(nil).count
D, [2017-03-20T14:52:15.099336 #59274] DEBUG -- :    (1048.8ms)  SELECT COUNT(*) AS count_all, DATE_TRUNC('month', created_at) AS date_trunc_month_created_at, DATE_TRUNC('month', last_sign_in_at) AS date_trunc_month_last_sign_in_at FROM "users" WHERE (created_at > '2016-01-01 00:00:00.000000') AND ("users"."last_sign_in_at" IS NOT NULL) GROUP BY DATE_TRUNC('month', created_at), DATE_TRUNC('month', last_sign_in_at)

assigned to @smcgivern

@smcgivern you are right about the first column not always be 100%. Wonder how the table would look in that case.

@smcgivern I believe the query should use current_sign_in_at as last_sign_in_at is for the previous session? Also, see https://gitlab.com/gitlab-org/gitlab-ce/issues/29523 (the index on current_sign_in_at will probably be removed), maybe it shouldn't if we start to actually query on current_sign_in_at?

@rymai I won't be using either! I'll be using the activity column But good point when I test it.

I will have an MR in soon; the only column in the WHERE will be created_at. Although I will group by activity, that will be grouped by the first day of the month, so I'm not certain an index is needed there either.

I won't be using either! I'll be using the activity column

@smcgivern Aha ok, makes sense (don't know what I was thinking )!

mentioned in issue #23361 (closed)

added docs-review label

mentioned in issue #30469 (closed)

@regisF @smcgivern : I'm helping out with https://gitlab.com/gitlab-org/gitlab-ce/issues/30469 and https://gitlab.com/gitlab-org/gitlab-ce/issues/30470 from a product perspective. So this issue here is of course very relevant. I've read through this and looked at the merge request (https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/1545). Could you help me with these questions?

Wanted to verify what @smcgivern mentions here: https://gitlab.com/gitlab-org/gitlab-ce/issues/29551#note_25812161. So we are using created_at to define the cohort size. But we are using last_activity_at to define active-ness? So a user can be created by not active during a given month? So that's why we can have < 100% active-ness for Month 0 of a cohort?
So we are using last_activity_at to indicate when a user last was active as a definition here right? So suppose the user was created and active in Dec 2016. And last_activity_at is in April 2017 as of now (now being April 6). Do we automatically assume that the user was active in Jan, Feb, and Mar? The screenshots seem to imply so with the monotonically decreasing numbers for each cohort (and also what @regisF mentioned in https://gitlab.com/gitlab-org/gitlab-ce/issues/29551#note_25497316). So I wanted to verify that's how we were defining it, i.e. we just assume that the users were active in those intermediary months. So if the user wasn't actually active during Jan/Feb/Mar, then a person viewing the cohort page in Jan/Feb/Mar and in Apr would get seemingly inconsistent results. Not something we need to address right away. But wanted to understand this.

@victorwu we defined a cohort to be all the users who registered in a month and have since had any activity, so the first column is always 100%.

For the second question, you're right, it's monotonically decreasing

closed via commit gitlab-ee@7c5ba0f9e3d86c96d4cecf3c791621536b18c83e

mentioned in issue #31192 (moved)

mentioned in merge request !14785

User Cohort Insights

Description

Proposal

Design

Links / references

Designs

Child items ...

Activity

Admin message

Admin message

User Cohort Insights

Description

Proposal

Design

Links / references

Activity