WIP: Use a new table for user contribution stats
What does this MR do?
It defers the user contribution stats to a new table, instead of the events table.
Are there points in the code the reviewer needs to double check?
Why was this MR needed?
This will allow down sizing the events table.
Screenshots (if relevant)
Does this MR meet the acceptance criteria?
-
Changelog entry added -
Documentation created/updated -
API support added - Tests
-
Added for this feature/bug -
All builds are passing
-
-
Conform by the merge request performance guides -
Conform by the style guides -
Branch has no merge conflicts with master
(if it does - rebase it please) -
Squashed related commits together
What are the relevant issue numbers?
Merge request reports
Activity
@yorickpeterse Hi, what do you think of the migration?
- Resolved by username-removed-714523
- Resolved by username-removed-714523
- Resolved by yorickpeterse-staging
@leibo I posted some comments on the migration above.
@yorickpeterse Hi, I fixed according to your comments, WDYT?
- Resolved by username-removed-714523
- Resolved by username-removed-714523
Reassigned to @leibo
@yorickpeterse fixed, moving on I think
@leibo Looks good so far
Mentioned in issue #24344 (closed)
@yorickpeterse Added the model and Sidekiq worker, take a look.
- Resolved by username-removed-714523
- Resolved by username-removed-714523
Added 661 commits:
-
31c2b3b1...75c8faf7 - 660 commits from branch
gitlab-org:master
- ecb46de4 - Use a new table for user conribution stats
-
31c2b3b1...75c8faf7 - 660 commits from branch
@yorickpeterse Ok, comments fixed, how do we move on? What else is needed?
@leibo Two things remain:
- The contribution calendar should use this new table
- Tests
@yorickpeterse I'm wondering about two more things:
- How will the user_contribution table be updated with all the data from the past?
- How do we tell the worker to run every day?
How will the user_contribution table be updated with all the data from the past?
We can run a migration that populates the table with data from the past year. This migration may take quite a while to run so perhaps we need to split it into chunks and run those in parallel. Let's start with a simple migration and we can measure how long this takes on GitLab's staging environment.
How do we tell the worker to run every day?
In
config/initializers/1_settings.rb
we define various cronjobs using this pattern:Settings.cron_jobs['trending_projects_worker'] ||= Settingslogic.new({}) Settings.cron_jobs['trending_projects_worker']['cron'] = '0 1 * * *' Settings.cron_jobs['trending_projects_worker']['job_class'] = 'TrendingProjectsWorker'
We can probably run the worker around 01:00.
@yorickpeterse Does gitlab have a notion of 'data migrations', or should I just use a normal migration?
@leibo We use regular migrations for it.
Added 617 commits:
-
defba3a6...448c19aa - 616 commits from branch
gitlab-org:master
- e5786eef - Use a new table for user conribution stats
-
defba3a6...448c19aa - 616 commits from branch
@yorickpeterse I've added the new migration and the schedule. Once you'll go over them I'll continue to the tests and contribution calendar.
- Resolved by username-removed-714523
- Resolved by username-removed-714523
@yorickpeterse Hi, just an update: I've made some of the changes, but experiencing trouble in my environment: for some reason when I try to push to my code to the remote branch, I get rejected. I've posted questions on the gitlab forum and IRC channel, hope I get can resolve this soon.
@yorickpeterse Hi again, I can't get this to work at the moment - very weird. Can you help in any way or point to someone who can help?
@leibo What's the link of the support forum question?
@leibo You could try re-creating the project, perhaps that solves the problem. Alternatively, make sure that your SSH keys and all that are still set up properly.
@yorickpeterse Ok, I finally got some help in the IRC channel, take a look at the new migration :)
- Resolved by username-removed-714523
@yorickpeterse Ok, fixed the migration as discussed. WDYT? I'm not sure how to test it - do sidekiq workers go up by default when running gdk? How can I see when if the job was scheduled? Or if it failed/succeeded?
@leibo All Sidekiq workers are executed by default, given the queues are in
config/sidekiq_queues.yml
. Testing is best done in two steps:- Writing an RSpec test
- Scheduling a job from a Rails console (mostly to see if you might have missed anything)
- Resolved by username-removed-714523
- Resolved by username-removed-714523
- Resolved by username-removed-714523
changed milestone to %8.16
mentioned in issue #22623 (moved)
@yorickpeterse @smcgivern Any conclusions? how should we progress with this?
@leibo Yes. If I understand @smcgivern correctly we should only add an extra minute of delay for every job, instead of delaying per month.
So:
- We set T to the current time
- Every job is scheduled to run at
T + 15 minutes
- For every job we add an extra minute to the delay, so job 10 runs at
T + 15 minutes + 10 minutes
. This spreads the jobs out over the course of the next 6 hours
The initial delay of 15 minutes or so is needed to prevent multiple jobs from running by the time all migrations are done (as this could take a while).
@smcgivern: Does this sound correct?
Edited by yorickpeterse-staging@yorickpeterse sounds good to me! This is assuming that a minute's delay is sufficient, of course.
@yorickpeterse I think this change is what you meant, take a look
@yorickpeterse regarding the workers: now when I run the migration, I expect many jobs to be added to sidekiq (one job every minute), and than ran. How can I see that they are indeed scheduled? How can I see that they ran? Do the sidekiq workers go up by default when starting the gdk process?
@leibo You can check what is currently in the queue by inspecting the Redis data. Using a Rails console you can do something like
Sidekiq::Queue.new('queue_name_here').size
to get the size. See https://github.com/mperham/sidekiq/wiki/API for more info.- Resolved by username-removed-714523
- Resolved by username-removed-714523
- Resolved by username-removed-714523
mentioned in merge request !8495
@yorickpeterse Fixed latest comments, LMK what you think.
Hi @leibo - coming in a little late with some comments here but have some questions around this. To start I'm a public contributor who has created the MR https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/8495 - the MR alters the users contribution calendar view to display all activity counts, hence why this MR has been associated as we have similar goals.
Note that I am happy for my MR to use this table and park that work until this is completed, however let me wear a few hats here.
As a user: It looks as though this contributions are calculated on a daily basis. Is the expected behaviour such that users will have to wait a day until commits are displayed? I assume not and that a Sidekiq worker will be fired to update the counts? If this is the case, then I think we are only complicating things by having a seperate process and only alleviating some up front cpu in reading the events table.
As an engineer: looking at the transaction code in https://gitlab.com/gitlab-org/gitlab-ce/issues/22623, unless there is row/table locking, then that code is racy. I think Postgres is the underlying store & isn't there an upsert method? This would ensure the db engine handles the concerns you raised. In addition, if you are wanting to mitigate table size, then drop the ID as the PK and use the User_ID & Date as a composite PK instead.
Also even if we start reading contribution counts from this new table, will the actual event ever be removed from the event table? For example, I see an event being a commit. That's not something that will be removed I'm sure and we can still derive contributor info from it, effectively rendering this new table redundant.
As an architect: the reason for this MR is to downsize the events table. If this is true, then I don't think event data should be backed by a relational datastore. The events are append only, eg logging so a append only data store would be ideal here, Kafka? This may go on the nose of the Boring solution value, but you clearly have upper limits to storage space which can be removed by changing to a store built for this kind of information.
If I can help to get this MR into a release, that will help unblock my MR too. I'm in no way wanting to impede progress, but I've had to deal with aggregate append only data before and a relation dbase is not the way forward. Just wanted to make sure that we are all ok with steps going forward. Thanks Kyle
Edited by username-removed-91641Addendum: Been out for a walk and thought more about this =]
As I see it (IMO), to remove/reduce the size of the event table properly with webscale, you would look to remove the storing of append only data to a more suitable storage engine, eg Kafka. Now I know that involves a whole heap more time and people so that isn't going to be the solution at this stage.
Though, adding a new table for just user contribution events is an anti pattern, you will end up having seperate tables for each piece of aggregated data. Not what an RDMS is built for and it doesn't solve the issue of size this MR is trying to address. It just shifts it.
So lets find some middle ground. Why don't we just create a Postgres View to do our aggregation for us? We then remove all the migration code and don't have to worry about background worker jobs. The events table will still be used for the storage yes, but this MR doesn't reduce any size overall that I can see, just shifts the problem. The lookup from the controller code will be super fast as the view will handle the aggregate values. This also means we can easily add more aggregate values as needed in future and when a complete fix is implemented (Kafka), it will be easier to switch.
Thoughts? Kyle
Edited by username-removed-91641mentioned in issue #14078 (moved)
I'll try to answer your questions since I'm the one who originally proposed making these changes.
It looks as though this contributions are calculated on a daily basis. Is the expected behaviour such that users will have to wait a day until commits are displayed?
Yes. Statistics will not be displayed in those little squares in the calendar until they have been aggregated. You can however still click on one and see the events for that day. This is intentional, the overhead of doing this on the fly is far greater than the value it adds.
I assume not and that a Sidekiq worker will be fired to update the counts?
No, the worker is scheduled daily.
If this is the case, then I think we are only complicating things by having a seperate process and only alleviating some up front cpu in reading the events table.
Not exactly. The logic used for this is pretty simple, and it greatly simplifies (and speeds up) the process of displaying it in a calendar.
looking at the transaction code in #22623 (moved), unless there is row/table locking, then that code is racy.
The transaction code was a sketch, and yes it is racy but that's OK due to the way it handles races. You can never insert duplicate data due to the database constraints that are present. The transaction sketch is not used in this MR as we went with a simpler approach.
if you are wanting to mitigate table size, then drop the ID as the PK and use the User_ID & Date as a composite PK instead.
Dropping the
id
column is not a bad idea since we don't need it.Also even if we start reading contribution counts from this new table, will the actual event ever be removed from the event table?
Raw events are removed once they are older than a year (this was implemented a while ago). Contribution calendars in turn display only the data of the past year.
That's not something that will be removed I'm sure and we can still derive contributor info from it, effectively rendering this new table redundant.
I'm not sure if I understand this part. The changes in this MR only deal with the counts per user per date, nothing more.
If this is true, then I don't think event data should be backed by a relational datastore
I disagree. Events are inherently relational since they are tied to users, projects, etc. We're not going to introduce more complex systems (e.g. Kafka) just for this.
As I see it (IMO), to remove/reduce the size of the event table properly with webscale, you would look to remove the storing of append only data to a more suitable storage engine, eg Kafka. Now I know that involves a whole heap more time and people so that isn't going to be the solution at this stage.
See above, we're not going to add another data store (besides the RDBMS and Redis) just for events data.
Though, adding a new table for just user contribution events is an anti pattern, you will end up having seperate tables for each piece of aggregated data. Not what an RDMS is built for and it doesn't solve the issue of size this MR is trying to address. It just shifts it.
I disagree. Storing aggregated data in separate tables very much makes perfect sense. You don't want to dump all of this in a single table as this is terrible for performance data.
So lets find some middle ground. Why don't we just create a Postgres View to do our aggregation for us?
Because regular views are terribly slow; they just execute the query on the fly and create a temporary table with the results. At best this will perform the same as executing the underlying query directly, in the worst case it will be slower due to the temporary table creation overhead.
Materialized views are basically the exact same of what we're doing in this merge request, except MySQL doesn't support them.
The events table will still be used for the storage yes, but this MR doesn't reduce any size overall that I can see, just shifts the problem.
I think you are missing a crucial part here, something that is mentioned in https://gitlab.com/gitlab-org/gitlab-ce/issues/22623. This MR is not meant to reduce the size of
events
at all. Instead it is meant to allow us to get closer to removing data we don't need as there is one component less using it directly. The second reason for this MR is performance, calculating these statistics on the fly just doesn't work.To summarise, we'll continue with this MR. Introducing something complex like Kafka just for storing events is out of the question until we truly need it. Even then I'd rather opt for removing data/storing less, instead of using more complex systems.
- Resolved by username-removed-714523
- Resolved by username-removed-714523
- Resolved by username-removed-443319
@yorickpeterse Thanks for your detailed and prompt response. Good talking points and helps me better understand the end goal. I COMPLETELY agree about NOT complicating the codebase with a differing tech! Forgive my confusion around the down sizing of Events table, perhaps we should change the description of "Why was this MR needed?"
Good news in regards to being backed my MySQL, you can easily upsert with a
INSERT ON DUPLICATE KEY UPDATE
. Though I still have x2 concerns with this MR.-
User experience with expectation that todays commits should be viewable today.
It's the current user behaviour to see todays' commits. Granted someone can click on the clear white box of the calendar, but who will "really" do that to force a value to be displayed? If we are only pre-canning values for the previous day, it may make sense to change the UI to only show up to the previous day? From a users point of view, it almost looks as though we are taking away something by not having the most recent contributions displayed. I see this being a bone of contention.
-
Multiple tables for aggregated data.
I think you're missing my point so let me help clarify. I in no way would want a single table with all aggregate data, that would be crazy yes! But IMO you don't want to have to create a new table for each new aggregate data, eg a user_login_count, user_forgot_password_attempts, user_commented_on_x - granted these are conflated examples, but they mimic this new tables behaviour. This is the bad pattern you want to steer away from IMO. As this table still seems to be more of a key value pair, may I ask what were the reasons for not using the existing Redis as the backing store? In doing so will remove the potentially bad pattern above. I take it the convenience of being able to return the data via the FK is the reason which all AOK by me.
Regrettably I don't have the experience/insights that you do with the code base. I'm just trying to be pragmatic looking at design decisions now that will impact in future. Thanks again for your responses, I wasn't aware that events are dropped after a year so I'm glad that this values are going to be retained! K
Edited by username-removed-91641-
@kylehqcom No problem, and thanks for bringing this up :)
User experience with expectation that todays commits should be viewable today.
This is an experience that will thus have to change. Either we document that today's counts are not visible in the calendar until the end of the day, or we change the tooltip so it shows something like "No data yet" instead of "0 contributions".
I suspect that some users may not like it, but the reasoning for this should be something our users can understand:
Generating this data on the fly won't work given enough users and events, and this limit can be reached fairly easily. Furthermore, the counts per date are not important enough to warrant degraded performance.
We'll have to make sure this is documented/announced properly, but I wouldn't worry too much about this.
But IMO you don't want to have to create a new table for each new aggregate data
The examples you give are simple counters of "The number of X for Y". What we are counting for user contributions is instead "The number of X for Y for Z", or in this case "The number of contributions per user per date". Data like this is fine to be stored in a RDBMS.
As this table still seems to be more of a key value pair, may I ask what were the reasons for not using the existing Redis as the backing store?
Using Redis there are two ways we could store this data (that I can think of):
- Keys using the format
contributions/user-id/date
, and the value being the contributions as an integer - Keys using the format
contributions/user-id
, and the value being a list of counts per day (one value for every day)
The first solution requires 365 GET commands to get the data of the last year, though inserts are pretty easy (just a SET). The second solution is rather complex because:
- We need to pop the first value before adding a new one when the length is greater than 365
- Values are mapped to dates relative to the current date. For example, value 364 is for today, 363 for yesterday, etc. This means you can't get the counts for a given date; only for offsets relative to today
Meanwhile, using SQL we can do something like this:
SELECT * FROM user_contributions WHERE date >= now() - '1 year'::interval AND user_id = X
And boom, we have all the counts for the past year for a user. Updating the data in turn is just a simple insert for every
(user, date)
pair.So in short, using an RDBMS has the benefits of:
- Not needing a new system
- Being persistent (remember that Redis, despite syncing to disk, is ephemeral)
- Making it easy to insert and query the data
- Making it easy to join related data (e.g. user information) on to the statistics, if needed
- Making data migrations easy as we can leverage Rails' database migration system, instead of having to do something on our own when using Redis
- Keys using the format
I suspect that some users may not like it, but the reasoning for this should be something our users can understand:
Cool, as long as it's been aired & agreed, just a matter of changing expectations and behaviour =]
So in short, using an RDBMS has the benefits of:
Nice rational, all ok by me.Thank you for responding to my points! I will follow this MR so that I can utilize the new table in https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/8495 Best Kyle
Edited by username-removed-91641@yorickpeterse Ok, fixed the comments. Should I move on to modifying the contribution calendar to use it?
mentioned in merge request !8821 (merged)
@leibo Yes, you can go ahead with that.
@leibo As mentioned in https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/8821#note_22450468 a bunch of event types need to be filtered out, which we're currently not doing. To be specific, we only want to include events where (using pseudo sql/ruby):
-
action = Event::PUSHED
or -
target_type in ('MergeRequest', 'Issue') AND action in (Event::CREATED, Event::CLOSED, Event::MERGED)
or target_type = 'Note' and action = Event::COMMENTED
-
@yorickpeterse said (comment was lost due to DB outage):
@leibo No, you should use the variables instead as the values may change in the future. Make sure you're using double quoted strings and use string interpolation (e.g. execute ".... #{Event::COMMENTED} ...").
@yorickpeterse ok, got it - done.
- Resolved by yorickpeterse-staging
added 1 commit
- 77e3fdc5 - example of usage of new user_contributions table
@yorickpeterse I've added an example of the how the new user contribution should be used in this temporary commit. What do you think? There seems to be a problem here - how do you I redefine this line:
events.select {|event| event.visible_to_user?(current_user) }
- Resolved by yorickpeterse-staging
- Resolved by yorickpeterse-staging
added 5048 commits
-
77e3fdc5...57d5a549 - 5046 commits from branch
gitlab-org:master
- e202ed84 - Use a new table for user conribution stats
- f47e2f12 - example of usage of new user_contributions table
-
77e3fdc5...57d5a549 - 5046 commits from branch
added 1 commit
- 88971f58 - example of usage of new user_contributions table
added 1 commit
- e9b7f11e - example of usage of new user_contributions table
- Resolved by username-removed-714523
added 1 commit
- b1e8ec43 - example of usage of new user_contributions table
@yorickpeterse Fixed above mentioned issues. What do you think about this one?
" @yorickpeterse I've added an example of the how the new user contribution should be used in this temporary commit. What do you think? There seems to be a problem here - how do you I redefine this line:
events.select {|event| event.visible_to_user?(current_user) }
"
@leibo Where is that line used? For just the calendar we should display the counts directly, regardless of visibility. This means they'd show private contributions, but since they're only counts this shouldn't expose any unwanted information.
The list of events for a given date can be left untouched since it doesn't use the contribution calendar.
- Resolved by yorickpeterse-staging
added 1 commit
- efa31379 - example of usage of new user_contributions table
@yorickpeterse Regarding the permissions issue: Can you point me to where we should replace the call to the events table with the new call to the contribution table? I assume it's in user_contribution.rb, but I'm not sure exactly where.
- Resolved by yorickpeterse-staging
- Resolved by yorickpeterse-staging
added 1 commit
- 68546d8c - example of usage of new user_contributions table
@yorickpeterse Ok, I removed the inner worker and am using the original worker.
@yorickpeterse Can you outline the changes needed in
contributions_calendar.rb
?Can you outline the changes needed in
contributions_calendar.rb
?This file needs to be changed so that instead of aggregating events on the fly it uses
user_contributions
instead. I'm not exactly sure where this is done, but if the new setup produces the same output we probably only have to change a small amount of code.mentioned in merge request gitlab-com/www-gitlab-com!4935 (merged)
@yorickpeterse Hi! I went over
contributions_calendar.rb
and have a fundamental question: Right now, the behaviour on the calendar page is that when pushing on a 'square', which represents a date, the user sees a full description of the events on that day. This will no longer be possible to do with the new table, as we only aggregate events. How do you think we should handle this?@leibo It will still be possible since we don't use the contributions table for that. The contributions table is only used to display the counts per day (those squares) in the calendar.
@yorickpeterse got it, thanks!
added 1 commit
- 18df87e9 - example of usage of new user_contributions table
@yorickpeterse Added the usage of the new table to
contributions_calendar.rb
. WDYT?- Resolved by username-removed-714523
- Resolved by username-removed-714523
@yorickpeterse Fixed your comments. I didn't understand your class comment - what part of the code do you want to move to a class? What is the class you suggest to create?
@yorickpeterse Ok, good idea - fixed.
- Resolved by username-removed-443319
- Resolved by yorickpeterse-staging
@yorickpeterse Fixed your comments. Anything else or should we continue to tests?
@yorickpeterse One more point: I noticed the test
Every Sidekiq worker uses the cronjob queue when the worker runs as a cronjob
is failing. I want to change the queue to the cronjob queue as required by the test. Any objections?@yorickpeterse Also note that this test:
only shows private events to authorized users
will fail by definition.Anything else or should we continue to tests?
You can move on to the tests.
One more point: I noticed the test
Every Sidekiq worker uses the cronjob queue when the worker runs as a cronjob
is failingThat's OK, the worker can use that queue. Don't forget to also update the migration.
Also note that this test:
only shows private events to authorized users
will fail by definition.If that test merely tests private event counts then I think we can get rid of it.
What's the status of this MR?
@blackst0ne Fixing existing failing tests and writing new ones for new functionality. Towards the end I guess...
@leibo, ok, thank you!
added 3063 commits
-
3ce835e4...37b5b7a5 - 3062 commits from branch
gitlab-org:master
- 233a2e8f - Use a new table for user conribution stats
-
3ce835e4...37b5b7a5 - 3062 commits from branch
added 245 commits
-
26364140...ac67ce85 - 244 commits from branch
gitlab-org:master
- d8ca4252 - Use a new table for user conribution stats
-
26364140...ac67ce85 - 244 commits from branch
@yorickpeterse While fixing some failing tests, I discovered that while before our change the code was counting all events with
action: Event::PUSHED
regardless of the target, we now check only for such events with a specific target. WDYT? Should I fix this to ignore the target on such events? I've added a change that fixes this, tell if it looks ok to youEdited by username-removed-714523@yorickpeterse Another issue from the tests: it seems from the failing tests that the DB does not accept
date.beginning_of_day
as a viable input. Should I usedate.to_time.beginning_of_day
instead?we now check only for such events with a specific target. WDYT? Should I fix this to ignore the target on such events
If I look at the migration/model it seems to count PUSHED events without setting a target, matching old behaviour. The target is only filtered by for CREATED, CLOSED, MERGED, and COMMENTED events.
Should I use
date.to_time.beginning_of_day
instead?Yes.
@yorickpeterse I've changed the database INSERT query to use
date.to_time.beginning_of_day
but the rspec test is still failing. Can you take a look? Also the rollback test is failing, can you look at that too? Thank you!@leibo The latest pipeline is giving a 404 for me. Could you do a force push to trigger a new build? Also make sure this MR is up to date with master, we recently fixed a bunch of tests that would sporadically fail.
added 836 commits
-
cb6cc2a1...46aadc5c - 835 commits from branch
gitlab-org:master
- a37dddc5 - Use a new table for user conribution stats
-
cb6cc2a1...46aadc5c - 835 commits from branch
@yorickpeterse I've rebased and pushed - a lot of new test failures :/ anyway, the tests related to this MR are the DB:rollback, and rspec 8 20, where I can't run an INSERT query for some reason. Can you take a look?
@leibo Unfortunately the pipeline is giving a 404 for me, so I'm unable to check the output. Could you share the output of the
db:rollback
test here?@yorickpeterse See attached.
@leibo It seems that MySQL doesn't like the index
index_user_contributions_on_user_id_and_date
being removed before the foreign key is removed. We can probably work around this by adding an explicitdown
method to the migrationCreateUserContribution
. This method can just drop theuser_contributions
table.@yorickpeterse pipeline_rspec_17.txt
Thank you for that, it fixed the test. Can you look at the other failing pipeline? For some reason I can't insert a row to the UserContibutions DB table. I've attached the failure file.
@leibo Your problem appears to be the following:
ActiveRecord::StatementInvalid: Mysql2::Error: Incorrect datetime value: '2017-05-01 00:00:00 +0000' for column 'created_at' at row 1: INSERT INTO user_contributions (user_id, date, contributions) SELECT author_id, '2017-05-01', count(*) AS contributions FROM events WHERE created_at >= '2017-05-01 00:00:00 +0000' AND created_at <= '2017-05-01 23:59:59 +0000' AND author_id IS NOT NULL AND ( ( target_type in ('MergeRequest', 'Issue') AND action in ( 1, 3, 7 ) ) OR (target_type = 'Note' AND action = 6) OR action = 5 ) GROUP BY author_id # ./config/initializers/connection_fix.rb:20:in `execute' # ./app/models/user_contribution.rb:5:in `calculate_for' # ./spec/features/calendar_spec.rb:187:in `block (4 levels) in <top (required)>' # ./spec/spec_helper.rb:65:in `block (2 levels) in <top (required)>' # ------------------ # --- Caused by: --- # Mysql2::Error: # Incorrect datetime value: '2017-05-01 00:00:00 +0000' for column 'created_at' at row 1 # ./config/initializers/connection_fix.rb:20:in `execute'
Most likely MySQL doesn't like the zone offset (
+0000
), so you'll need to get rid of that. The easiest way I can think of doing this is to simply quote the values usingconnection.quote
so you get something like this:WHERE created_at >= #{connection.quote(date.beginning_of_day)} AND created_at <= #{connection.quote(date.end_of_day})
@yorickpeterse rspec-12.txt Still failing - see attached file.
@leibo You quoted the data incorrectly. As shown in my example you should not manually add any quotes and only use
connection.quote
.@yorickpeterse Ok, that worked, my bad! So now that all existing tests are passing, I'm moving on to my own tests for new functionality.
@yorickpeterse Hi, I've added a first version of tests. Let me know what tests I should and anything else. Take a look
- Resolved by username-removed-714523
- Resolved by username-removed-714523
- Resolved by username-removed-714523
- Resolved by username-removed-714523
- Resolved by username-removed-714523
- Resolved by username-removed-443319
- Resolved by yorickpeterse-staging
@yorickpeterse Fixed your comments. What more should we do?
- Resolved by username-removed-443319
@leibo Some things that are worth checking:
- In the interface, does the contribution calendar still show the contributions correctly? This is mostly to make sure there's no difference between what the tests do, and what the runtime does.
- Could you add a changelog entry using
bin/changelog
? - Some tests appear to be failing, could you look into those?
Other than that we'll need to run some tests on our staging/production database to see how the used queries and migration will perform, but we'll do that once the tests and everything are passing.
added 4107 commits
- ec8e63e1...fc1090d9 - 4106 commits from branch
gitlab-org:master
- 178c7cb0 - Use a new table for user conribution stats
- ec8e63e1...fc1090d9 - 4106 commits from branch
@yorickpeterse I checked the interface of course, and it looks ok, added the change log entry, and the failing tests don't look related to this MR. What do you think?
@leibo Since the MR is behind quite a bit (over 800 commits), could you rebase with master and force push? That might solve the spec failures too.
For future/personal references:
The output of
EXPLAIN ANALYZE
for the query used to aggregate data is this:QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=324567.08..325121.72 rows=55464 width=44) (actual time=959.466..984.066 rows=52582 loops=1) Group Key: author_id -> Index Scan using index_events_on_created_at on events (cost=0.56..323412.00 rows=231015 width=4) (actual time=0.144..776.022 rows=251554 loops=1) Index Cond: ((created_at >= '2017-06-20 00:00:00+00'::timestamp with time zone) AND (created_at <= '2017-06-20 23:59:59+00'::timestamp with time zone)) Filter: ((author_id IS NOT NULL) AND ((((target_type)::text = ANY ('{MergeRequest,Issue}'::text[])) AND (action = ANY ('{1,3,7}'::integer[]))) OR (((target_type)::text = 'Note'::text) AND (action = 6)) OR (action = 5))) Rows Removed by Filter: 20002 Planning time: 0.225 ms Execution time: 995.561 ms
So we spend roughly 1 second to get the data of a single day. The exact time varies depending on the amount of events.
Should this become problematic we can change this to aggregating data every hour, but this will require additional logic to perform an upsert so we update existing data if present. Upserts are support in both MySQL and PostgreSQL (albeit slightly differently), but this will require PostgreSQL 9.3 or newer if I'm not mistaken.
For now this should suffice.
added 1044 commits
- 60af6309...8fa464e6 - 1043 commits from branch
gitlab-org:master
- d7d46a32 - Use a new table for user conribution stats
- 60af6309...8fa464e6 - 1043 commits from branch
@yorickpeterse Ok, rebased, most test are passing now.
- Resolved by username-removed-443319
assigned to @smcgivern
@smcgivern Database wise this looks good, could you (or somebody) else also take a look?
mentioned in issue #34308 (moved)
mentioned in issue #34151 (closed)
- Resolved by username-removed-443319
- Resolved by username-removed-443319
- Resolved by username-removed-443319
- Resolved by username-removed-443319
- Resolved by username-removed-443319
- spec/models/user_contribution_spec.rb 0 → 100644
2 3 describe UserContribution, models: true do 4 describe '.calculate_for' do 5 it 'writes all user contributions for a given date' do 6 contributor = create(:user) 7 project = create(:empty_project) 8 Event.create!( 9 project: project, 10 action: Event::CREATED, 11 target: create(:issue, project: project, author: contributor), 12 author: contributor, 13 created_at: 1.day.ago 14 ) 15 16 UserContribution.calculate_for(1.day.ago) 17 expect(UserContribution.first.contributions).to eq(1) - spec/models/user_contribution_spec.rb 0 → 100644
1 require "spec_helper" 2 3 describe UserContribution, models: true do 4 describe '.calculate_for' do 5 it 'writes all user contributions for a given date' do 6 contributor = create(:user) 7 project = create(:empty_project) 8 Event.create!( 9 project: project, 10 action: Event::CREATED, 11 target: create(:issue, project: project, author: contributor), 12 author: contributor, 13 created_at: 1.day.ago 14 ) @leibo thanks, really great work on this! Most of my comments are pretty minor, but I would like some more comprehensive tests for
UserContribution
.assigned to @leibo
@leibo what is the status of this MR?
It blocks https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/8495 which implements this very popular feature request: https://gitlab.com/gitlab-org/gitlab-ce/issues/14078
@blackst0ne As you can see from the conversation, there are still a few fixes I need to address. I will give this a look in the next few days to see if I have time to do this soon.
@blackst0ne @smcgivern Hi, unfortunately I won't be able to get to those fixes in the next few months. The MR is almost complete, just needs a few small touches. Ido
@leibo thank you for letting us know.
@smcgivern how can we finish this MR?
@yorickpeterse how important is this MR to you? If I finish it (while still giving credit to @leibo), would you be happy to review?
added coach will finish label
@smcgivern Fine by me
@leibo https://gitlab.com/leibo/gitlab-ce/tree/user-contribution-table gives me a 404
Can you make sure that the access rights for your fork are set up correctly, please?assigned to @smcgivern
@yorickpeterse I've been looking at this, and there are a few failures after rebasing onto the latest master, which I think are a difference of opinion about what this feature does:
- https://gitlab.com/gitlab-org/gitlab-ce/issues/23403 - this changes the number of contributions you can see, depending on the visibility of the events you took. (Basic example: if all of your events are in private repos, and you view as an anonymous user, it looks like you had zero events.) If we change this, then the counts won't match the events you can see.
- https://gitlab.com/gitlab-org/gitlab-ce/issues/27616 and https://gitlab.com/gitlab-org/gitlab-ce/issues/1943 - these allow the events to be grouped by timezone, which isn't possible if we just store a date.
Wdyt? Is it still worth pursuing this?
assigned to @yorickpeterse
@smcgivern I would treat the calendar differently from the list of events displayed below it, and force the timezone to UTC. Ultimately the calendar is just an indication of how "busy" someone has been, so it doesn't have be a 100% match to any filters and what not you may set.
The big benefit of this table would be that it allows us to remove more potentially expensive queries from the user profile. Caching doesn't really help since the first person hitting a cold cache will still have a bad experience.
So yes, I think it's definitely worth pursuing as long as our documentation is clear that the calendar doesn't display the exact same data as the event feeds.
Edited by yorickpeterse-staging@yorickpeterse in that case, I won't have time to do this. The above issues show that people clearly do expect the events count to match the list of events, which it won't once this is in.
I'd recommend discussing this in the issue with the relevant PM (which is @jramsay, I think).
Thanks @smcgivern. With the impending change to show count of all contributions regardless of privacy in https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/8495 I think the difference between the count in the calendar and list is unavoidable and will should be documented below the calendar.
No longer being able to see the count of contributions until the daily job runs is a little disappointing. I sometimes check it and expect to see results for today.
@yorickpeterse in it's current state, does the contribution graph use UTC? Or the timezone of the contributors profile? Dropping time will mean that early-Monday and late-Friday activity will look like Saturday and Sunday activity for people close to the date line which is misleading.
@jramsay If we were to use some kind of background job it would use UTC as the timezone. I think the current implementation (the one in
master
that is) uses your profile's timezone.No longer being able to see the count of contributions until the daily job runs is a little disappointing. I sometimes check it and expect to see results for today.
It would be possible to run the job every hour and only calculate contributions for the hours since midnight. There are a few ways of doing this but they're all fairly easy to implement.
I think the current implementation (the one in
master
that is) uses your profile's timezone.Actually this is incorrect. Currently it uses the gitlab.yml
time_zone
setting. This was to allow for cases like a GitLab installation in Melbourne in which everybody's contributions early in the day would look as though they took place on the prior calendar day (if it were using UTC time). The same could go for something like Honolulu in which a contribution near the end of the workday would appear to take place on the following day. I think preserving this feature is worthwhile.I think it would be really nice if users could customize this to their liking. In a global company like GitLab, users are in several timezones. My contribution calendar, for instance, shows a lot of activity on Saturday, but most of the time it's actually because I worked late on a Friday evening.
I don't think this would be terribly difficult to account for. Every profile could have a UTC offset setting for their calendar (which could default to the global setting in
gitlab.yml
), and the calendar content could be generated with that timezone in mind. If a user ever changes their preferred timezone, treat it as an expired cache and re-generate the calendar. It shouldn't be prohibitively expensive, since we're essentially generating this from scratch for a dynamic timezone right now.Edited by username-removed-636429@mikegreiling the idea is to not calculate this dynamically at all, though. I was going to say we could store by hour, but that won't work for non-whole-hour offsets (like India).
changed milestone to %Backlog
With regards to privacy, it needs to be easy for people to maintain their privacy, and activity on private projects is part of that. The current implementation protects privacy by calculating counts based on the permissions of the viewer.
Moving to a job based calculation approach would need to err on the side of caution, and aggregate public and private contributions so that the user can decide what they'd like to show.
For reference https://gitlab.com/gitlab-org/gitlab-ee/issues/1233
One option is this: when calculating the statistics we use two counters instead of 1:
- A counter that stores the public contributions (= contributions for public projects)
- A counter that stores private contributions (= contributions for private/internal projects)
If a user then sets their calendar to show both we'll just sum the two counters, otherwise we'll use the public one. This does however mean that changing project membership won't update past statistics, but I think that's not that big of a deal.
@mikegreiling the idea is to not calculate this dynamically at all, though. I was going to say we could store by hour, but that won't work for non-whole-hour offsets (like India).
I think you misunderstood my comment. I'm not suggesting we calculate this dynamically. I was just illustrating that when/if a user were to change their preferred timezone, we'd have to recalculate the calendar again, which given that we're currently doing this dynamically, wouldn't be a terribly expensive operation. But 99.9% of the time this would be served from the pre-calculated database table.
One option is this: when calculating the statistics we use two counters instead of 1:
-
A counter that stores the public contributions (= contributions for public projects)
-
A counter that stores private contributions (= contributions for private/internal projects)
If a user then sets their calendar to show both we'll just sum the two counters, otherwise we'll use the public one. This does however mean that changing project membership won't update past statistics, but I think that's not that big of a deal.
what abut a case in which a public project or group is suddenly made private, or vice versa? won't we need a way to trigger a re-build of their contribution calendar?
-