We would read the table above like: for the October cohort, 97.04% of the users who registered in October are still active one month after their registration.
Month 0 is always 100%. It's based on the user registration date (created_at)
Month 1 contains all the users who were still active one month after the user registration date. This is based on the last_activity_date field.
This table will be shown in the administration panel interface under a new tab called User cohorts (see below)
Cohort analysis is used to observe what happens to a group of users that joined in a particular time period.
For us (GitLab.com), we can't use revenues to know the churn over time, which is one of the use case of user cohorts (because we don't have revenues).
We could use last activity date however to determine if new users actually stay active on GitLab.
-
Cohort total
0
1
2
3
AUG
273
0%
6.92%
12.31%
37.06%
SEP
324
0%
4.09%
14.38%
OCT
312
0%
7.04%
NOV
145
0%
In the table above, we would see that for the October cohort, 7% of users are already inactive 1 month after they've signed up. We could take actions to reduce the "churn" of activity here. I see the value here, for our use case. In case we monetize some features in GitLab.com in the future, we'll also be able to do user cohorts for the different features we'll monetize.
That being said, I don't see how that would benefit other EE users. What do you think @JobV ?
@regisF People are interesting in seeing how their colleagues are using GitLab. It'd be great if we'd have a system where we could plugin new features to see usage over time, for instance. This helps organisations making payment decisions.
We would read the table above like: for the October cohort, 97.04% of the users who registered in October are still active one month after their registration.
This table will be shown in the administration panel interface under a new tab called Reports. (I'll update the body of this issue if we agree on the next step)
So one hand, we can get a list of all users who were active in the last 30 days.
On the other hand we know the registration dates of each user in the database. We can combine the two to populate the table above.
The problem now is that the data is in Redis, and Redis can be flushed at any moment - therefore we would lose data and this table will become not only useless, but grossly inaccurate also. I think we have to move the data that is stored on Redis periodically to the database, in order to keep the information forever.
@jameslopez@yorickpeterse can you help me there? Am I right in saying that if Redis dies for some reason, we don't have the data anymore? What would be required to save it to the database and how to proceed?
@regisF Redis periodically flushes data to disk, which it can use to restore in the event of an error. However, the data could be flushed explicitly at any given time. Indeed it's best to treat Redis as a temporary storage, similar to /tmp.
Considering the recent increase in requests for reporting, and considering this is all specific to GitLab, I think it might be worth looking into building a simple application that pulls data from the DB/Redis, then stores this elsewhere (e.g. like our version app). This means we can build whatever GitLab needs, without having to somehow shoehorn this into GitLab itself.
As a matter of fact, as we want to display a user cohort about last activity date straight in the administration panel of an instance, we don't need the system you are talking about. We still need however to make this data persistent at the instance level. How can we do it simply?
We have a background job that moves the data from Redis to the DB (daily, for instance). This means the data won't be super accurate, but not sure we need that anyway.
We completely avoid rake cache:clear (which we probably run at least every time we upgrade) and use zone cache clearing instead. So substitute cache:clear for something like cache:clear:gitlab and never flush the activity data.
We pull and store stuff in the DB on another app (we can query the GitLab API to get the data). Then we can query this app via an API to get the info.
@DouweM@mydigitalself I'm not completely satisfied with how this was moved. I understand that we have limited capacity and many important things (everything is important ^tm), but this is extremely important to our business (as it'll allow us to enable the usage ping for CE). Let's try to still make it work, by trading it for something else.
@DouweM is there any way we can make capacity for this? E.g. do Teams integration later?
Alternatively, @smcgivern and @victorwu is there any chance you could pick this up?
@smcgivern : Anything we can do here for 9.1? Have all BE resources already started something in 9.1? I don't see anything in particular that we can swap out?
@regisF one thing to note: the first column won't always be 100%. For instance, if a user was created through the API, but has never logged in, pushed, etc., then they won't have an activity date.
This is OK performance-wise on staging (it will need a MySQL version too; I'm using last_sign_in_at instead of the activity column for now):
User.where('created_at > ?', Time.utc(2016)).where.not(last_sign_in_at: nil).group("DATE_TRUNC('month', created_at)").group("DATE_TRUNC('month', last_sign_in_at)").reorder(nil).countD, [2017-03-20T14:52:15.099336 #59274] DEBUG -- : (1048.8ms) SELECT COUNT(*) AS count_all, DATE_TRUNC('month', created_at) AS date_trunc_month_created_at, DATE_TRUNC('month', last_sign_in_at) AS date_trunc_month_last_sign_in_at FROM "users" WHERE (created_at > '2016-01-01 00:00:00.000000') AND ("users"."last_sign_in_at" IS NOT NULL) GROUP BY DATE_TRUNC('month', created_at), DATE_TRUNC('month', last_sign_in_at)
@smcgivern I believe the query should use current_sign_in_at as last_sign_in_at is for the previous session? Also, see https://gitlab.com/gitlab-org/gitlab-ce/issues/29523 (the index on current_sign_in_at will probably be removed), maybe it shouldn't if we start to actually query on current_sign_in_at?
@rymai I won't be using either! I'll be using the activity column But good point when I test it.
I will have an MR in soon; the only column in the WHERE will be created_at. Although I will group by activity, that will be grouped by the first day of the month, so I'm not certain an index is needed there either.
Wanted to verify what @smcgivern mentions here: https://gitlab.com/gitlab-org/gitlab-ce/issues/29551#note_25812161. So we are using created_at to define the cohort size. But we are using last_activity_at to define active-ness? So a user can be created by not active during a given month? So that's why we can have < 100% active-ness for Month 0 of a cohort?
So we are using last_activity_at to indicate when a user last was active as a definition here right? So suppose the user was created and active in Dec 2016. And last_activity_at is in April 2017 as of now (now being April 6). Do we automatically assume that the user was active in Jan, Feb, and Mar? The screenshots seem to imply so with the monotonically decreasing numbers for each cohort (and also what @regisF mentioned in https://gitlab.com/gitlab-org/gitlab-ce/issues/29551#note_25497316). So I wanted to verify that's how we were defining it, i.e. we just assume that the users were active in those intermediary months. So if the user wasn't actually active during Jan/Feb/Mar, then a person viewing the cohort page in Jan/Feb/Mar and in Apr would get seemingly inconsistent results. Not something we need to address right away. But wanted to understand this.