Skip to content
Snippets Groups Projects

WIP: Use a new table for user contribution stats

3 unresolved threads

What does this MR do?

It defers the user contribution stats to a new table, instead of the events table.

Are there points in the code the reviewer needs to double check?

Why was this MR needed?

This will allow down sizing the events table.

Screenshots (if relevant)

Does this MR meet the acceptance criteria?

What are the relevant issue numbers?

#22623 (moved)

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • @leibo I posted some comments on the migration above.

  • Added 1 commit:

    • 0cd5c1c1 - Use a new table for user conribution stats

    Compare with previous version

  • @yorickpeterse Hi, I fixed according to your comments, WDYT?

  • Reassigned to @leibo

  • Added 1 commit:

    • 76160d06 - Use a new table for user conribution stats

    Compare with previous version

  • username-removed-714523 Resolved all discussions

    Resolved all discussions

  • @yorickpeterse fixed, moving on I think

  • @leibo Looks good so far :thumbsup:

  • Mentioned in issue #24344 (closed)

  • Added ~164274 label

  • Added 1 commit:

    • 31c2b3b1 - Use a new table for user conribution stats

    Compare with previous version

  • @yorickpeterse Added the model and Sidekiq worker, take a look.

  • username-removed-714523 Resolved all discussions

    Resolved all discussions

  • Added 661 commits:

    Compare with previous version

  • @yorickpeterse Ok, comments fixed, how do we move on? What else is needed?

  • Added 1 commit:

    • defba3a6 - Use a new table for user conribution stats

    Compare with previous version

  • @leibo Two things remain:

    1. The contribution calendar should use this new table
    2. Tests
  • @yorickpeterse I'm wondering about two more things:

    1. How will the user_contribution table be updated with all the data from the past?
    2. How do we tell the worker to run every day?
  • @leibo

    How will the user_contribution table be updated with all the data from the past?

    We can run a migration that populates the table with data from the past year. This migration may take quite a while to run so perhaps we need to split it into chunks and run those in parallel. Let's start with a simple migration and we can measure how long this takes on GitLab's staging environment.

    How do we tell the worker to run every day?

    In config/initializers/1_settings.rb we define various cronjobs using this pattern:

    Settings.cron_jobs['trending_projects_worker'] ||= Settingslogic.new({})
    Settings.cron_jobs['trending_projects_worker']['cron'] = '0 1 * * *'
    Settings.cron_jobs['trending_projects_worker']['job_class'] = 'TrendingProjectsWorker'

    We can probably run the worker around 01:00.

  • @yorickpeterse Does gitlab have a notion of 'data migrations', or should I just use a normal migration?

  • @leibo We use regular migrations for it.

  • Added 617 commits:

    Compare with previous version

  • @yorickpeterse I've added the new migration and the schedule. Once you'll go over them I'll continue to the tests and contribution calendar.

  • @yorickpeterse Hi, just an update: I've made some of the changes, but experiencing trouble in my environment: for some reason when I try to push to my code to the remote branch, I get rejected. I've posted questions on the gitlab forum and IRC channel, hope I get can resolve this soon.

  • @yorickpeterse Hi again, I can't get this to work at the moment - very weird. Can you help in any way or point to someone who can help?

  • @leibo What's the link of the support forum question?

  • @leibo You could try re-creating the project, perhaps that solves the problem. Alternatively, make sure that your SSH keys and all that are still set up properly.

  • added 1 commit

    • 3973daa0 - Use a new table for user conribution stats

    Compare with previous version

  • @yorickpeterse Ok, I finally got some help in the IRC channel, take a look at the new migration :)

  • added 1 commit

    • e0cf9afb - Use a new table for user conribution stats

    Compare with previous version

  • @yorickpeterse Ok, fixed the migration as discussed. WDYT? I'm not sure how to test it - do sidekiq workers go up by default when running gdk? How can I see when if the job was scheduled? Or if it failed/succeeded?

  • @leibo All Sidekiq workers are executed by default, given the queues are in config/sidekiq_queues.yml. Testing is best done in two steps:

    1. Writing an RSpec test
    2. Scheduling a job from a Rails console (mostly to see if you might have missed anything)
  • changed milestone to %8.16

  • mentioned in issue #22623 (moved)

  • @yorickpeterse @smcgivern Any conclusions? how should we progress with this?

  • @leibo Yes. If I understand @smcgivern correctly we should only add an extra minute of delay for every job, instead of delaying per month.

    So:

    1. We set T to the current time
    2. Every job is scheduled to run at T + 15 minutes
    3. For every job we add an extra minute to the delay, so job 10 runs at T + 15 minutes + 10 minutes. This spreads the jobs out over the course of the next 6 hours

    The initial delay of 15 minutes or so is needed to prevent multiple jobs from running by the time all migrations are done (as this could take a while).

    @smcgivern: Does this sound correct?

    Edited by yorickpeterse-staging
  • @yorickpeterse sounds good to me! This is assuming that a minute's delay is sufficient, of course.

  • added 1 commit

    • 2d6722b1 - Use a new table for user conribution stats

    Compare with previous version

  • @yorickpeterse I think this change is what you meant, take a look

  • @yorickpeterse regarding the workers: now when I run the migration, I expect many jobs to be added to sidekiq (one job every minute), and than ran. How can I see that they are indeed scheduled? How can I see that they ran? Do the sidekiq workers go up by default when starting the gdk process?

  • @leibo You can check what is currently in the queue by inspecting the Redis data. Using a Rails console you can do something like Sidekiq::Queue.new('queue_name_here').size to get the size. See https://github.com/mperham/sidekiq/wiki/API for more info.

  • yorickpeterse-staging mentioned in merge request !8495

    mentioned in merge request !8495

  • added 1 commit

    • a047bd5c - Use a new table for user conribution stats

    Compare with previous version

  • username-removed-714523 resolved all discussions

    resolved all discussions

  • @yorickpeterse Fixed latest comments, LMK what you think.

  • Hi @leibo - coming in a little late with some comments here but have some questions around this. To start I'm a public contributor who has created the MR https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/8495 - the MR alters the users contribution calendar view to display all activity counts, hence why this MR has been associated as we have similar goals.

    Note that I am happy for my MR to use this table and park that work until this is completed, however let me wear a few hats here.

    As a user: It looks as though this contributions are calculated on a daily basis. Is the expected behaviour such that users will have to wait a day until commits are displayed? I assume not and that a Sidekiq worker will be fired to update the counts? If this is the case, then I think we are only complicating things by having a seperate process and only alleviating some up front cpu in reading the events table.

    As an engineer: looking at the transaction code in https://gitlab.com/gitlab-org/gitlab-ce/issues/22623, unless there is row/table locking, then that code is racy. I think Postgres is the underlying store & isn't there an upsert method? This would ensure the db engine handles the concerns you raised. In addition, if you are wanting to mitigate table size, then drop the ID as the PK and use the User_ID & Date as a composite PK instead.

    Also even if we start reading contribution counts from this new table, will the actual event ever be removed from the event table? For example, I see an event being a commit. That's not something that will be removed I'm sure and we can still derive contributor info from it, effectively rendering this new table redundant.

    As an architect: the reason for this MR is to downsize the events table. If this is true, then I don't think event data should be backed by a relational datastore. The events are append only, eg logging so a append only data store would be ideal here, Kafka? This may go on the nose of the Boring solution value, but you clearly have upper limits to storage space which can be removed by changing to a store built for this kind of information.

    If I can help to get this MR into a release, that will help unblock my MR too. I'm in no way wanting to impede progress, but I've had to deal with aggregate append only data before and a relation dbase is not the way forward. Just wanted to make sure that we are all ok with steps going forward. Thanks Kyle

    Edited by username-removed-91641
  • Addendum: Been out for a walk and thought more about this =]

    As I see it (IMO), to remove/reduce the size of the event table properly with webscale, you would look to remove the storing of append only data to a more suitable storage engine, eg Kafka. Now I know that involves a whole heap more time and people so that isn't going to be the solution at this stage.

    Though, adding a new table for just user contribution events is an anti pattern, you will end up having seperate tables for each piece of aggregated data. Not what an RDMS is built for and it doesn't solve the issue of size this MR is trying to address. It just shifts it.

    So lets find some middle ground. Why don't we just create a Postgres View to do our aggregation for us? We then remove all the migration code and don't have to worry about background worker jobs. The events table will still be used for the storage yes, but this MR doesn't reduce any size overall that I can see, just shifts the problem. The lookup from the controller code will be super fast as the view will handle the aggregate values. This also means we can easily add more aggregate values as needed in future and when a complete fix is implemented (Kafka), it will be easier to switch.

    Thoughts? Kyle

    Edited by username-removed-91641
  • mentioned in issue #14078 (moved)

  • @kylehqcom

    I'll try to answer your questions since I'm the one who originally proposed making these changes.

    It looks as though this contributions are calculated on a daily basis. Is the expected behaviour such that users will have to wait a day until commits are displayed?

    Yes. Statistics will not be displayed in those little squares in the calendar until they have been aggregated. You can however still click on one and see the events for that day. This is intentional, the overhead of doing this on the fly is far greater than the value it adds.

    I assume not and that a Sidekiq worker will be fired to update the counts?

    No, the worker is scheduled daily.

    If this is the case, then I think we are only complicating things by having a seperate process and only alleviating some up front cpu in reading the events table.

    Not exactly. The logic used for this is pretty simple, and it greatly simplifies (and speeds up) the process of displaying it in a calendar.

    looking at the transaction code in #22623 (moved), unless there is row/table locking, then that code is racy.

    The transaction code was a sketch, and yes it is racy but that's OK due to the way it handles races. You can never insert duplicate data due to the database constraints that are present. The transaction sketch is not used in this MR as we went with a simpler approach.

    if you are wanting to mitigate table size, then drop the ID as the PK and use the User_ID & Date as a composite PK instead.

    Dropping the id column is not a bad idea since we don't need it.

    Also even if we start reading contribution counts from this new table, will the actual event ever be removed from the event table?

    Raw events are removed once they are older than a year (this was implemented a while ago). Contribution calendars in turn display only the data of the past year.

    That's not something that will be removed I'm sure and we can still derive contributor info from it, effectively rendering this new table redundant.

    I'm not sure if I understand this part. The changes in this MR only deal with the counts per user per date, nothing more.

    If this is true, then I don't think event data should be backed by a relational datastore

    I disagree. Events are inherently relational since they are tied to users, projects, etc. We're not going to introduce more complex systems (e.g. Kafka) just for this.

    As I see it (IMO), to remove/reduce the size of the event table properly with webscale, you would look to remove the storing of append only data to a more suitable storage engine, eg Kafka. Now I know that involves a whole heap more time and people so that isn't going to be the solution at this stage.

    See above, we're not going to add another data store (besides the RDBMS and Redis) just for events data.

    Though, adding a new table for just user contribution events is an anti pattern, you will end up having seperate tables for each piece of aggregated data. Not what an RDMS is built for and it doesn't solve the issue of size this MR is trying to address. It just shifts it.

    I disagree. Storing aggregated data in separate tables very much makes perfect sense. You don't want to dump all of this in a single table as this is terrible for performance data.

    So lets find some middle ground. Why don't we just create a Postgres View to do our aggregation for us?

    Because regular views are terribly slow; they just execute the query on the fly and create a temporary table with the results. At best this will perform the same as executing the underlying query directly, in the worst case it will be slower due to the temporary table creation overhead.

    Materialized views are basically the exact same of what we're doing in this merge request, except MySQL doesn't support them.

    The events table will still be used for the storage yes, but this MR doesn't reduce any size overall that I can see, just shifts the problem.

    I think you are missing a crucial part here, something that is mentioned in https://gitlab.com/gitlab-org/gitlab-ce/issues/22623. This MR is not meant to reduce the size of events at all. Instead it is meant to allow us to get closer to removing data we don't need as there is one component less using it directly. The second reason for this MR is performance, calculating these statistics on the fly just doesn't work.

    To summarise, we'll continue with this MR. Introducing something complex like Kafka just for storing events is out of the question until we truly need it. Even then I'd rather opt for removing data/storing less, instead of using more complex systems.

  • @yorickpeterse Thanks for your detailed and prompt response. Good talking points and helps me better understand the end goal. I COMPLETELY agree about NOT complicating the codebase with a differing tech! Forgive my confusion around the down sizing of Events table, perhaps we should change the description of "Why was this MR needed?"

    Good news in regards to being backed my MySQL, you can easily upsert with a INSERT ON DUPLICATE KEY UPDATE. Though I still have x2 concerns with this MR.

    1. User experience with expectation that todays commits should be viewable today.

      It's the current user behaviour to see todays' commits. Granted someone can click on the clear white box of the calendar, but who will "really" do that to force a value to be displayed? If we are only pre-canning values for the previous day, it may make sense to change the UI to only show up to the previous day? From a users point of view, it almost looks as though we are taking away something by not having the most recent contributions displayed. I see this being a bone of contention.

    2. Multiple tables for aggregated data.

      I think you're missing my point so let me help clarify. I in no way would want a single table with all aggregate data, that would be crazy yes! But IMO you don't want to have to create a new table for each new aggregate data, eg a user_login_count, user_forgot_password_attempts, user_commented_on_x - granted these are conflated examples, but they mimic this new tables behaviour. This is the bad pattern you want to steer away from IMO. As this table still seems to be more of a key value pair, may I ask what were the reasons for not using the existing Redis as the backing store? In doing so will remove the potentially bad pattern above. I take it the convenience of being able to return the data via the FK is the reason which all AOK by me.

    Regrettably I don't have the experience/insights that you do with the code base. I'm just trying to be pragmatic looking at design decisions now that will impact in future. Thanks again for your responses, I wasn't aware that events are dropped after a year so I'm glad that this values are going to be retained! K

    Edited by username-removed-91641
  • @kylehqcom No problem, and thanks for bringing this up :)

    User experience with expectation that todays commits should be viewable today.

    This is an experience that will thus have to change. Either we document that today's counts are not visible in the calendar until the end of the day, or we change the tooltip so it shows something like "No data yet" instead of "0 contributions".

    I suspect that some users may not like it, but the reasoning for this should be something our users can understand:

    Generating this data on the fly won't work given enough users and events, and this limit can be reached fairly easily. Furthermore, the counts per date are not important enough to warrant degraded performance.

    We'll have to make sure this is documented/announced properly, but I wouldn't worry too much about this.

    But IMO you don't want to have to create a new table for each new aggregate data

    The examples you give are simple counters of "The number of X for Y". What we are counting for user contributions is instead "The number of X for Y for Z", or in this case "The number of contributions per user per date". Data like this is fine to be stored in a RDBMS.

    As this table still seems to be more of a key value pair, may I ask what were the reasons for not using the existing Redis as the backing store?

    Using Redis there are two ways we could store this data (that I can think of):

    1. Keys using the format contributions/user-id/date, and the value being the contributions as an integer
    2. Keys using the format contributions/user-id, and the value being a list of counts per day (one value for every day)

    The first solution requires 365 GET commands to get the data of the last year, though inserts are pretty easy (just a SET). The second solution is rather complex because:

    1. We need to pop the first value before adding a new one when the length is greater than 365
    2. Values are mapped to dates relative to the current date. For example, value 364 is for today, 363 for yesterday, etc. This means you can't get the counts for a given date; only for offsets relative to today

    Meanwhile, using SQL we can do something like this:

    SELECT *
    FROM user_contributions
    WHERE date >= now() - '1 year'::interval
    AND user_id = X

    And boom, we have all the counts for the past year for a user. Updating the data in turn is just a simple insert for every (user, date) pair.

    So in short, using an RDBMS has the benefits of:

    1. Not needing a new system
    2. Being persistent (remember that Redis, despite syncing to disk, is ephemeral)
    3. Making it easy to insert and query the data
    4. Making it easy to join related data (e.g. user information) on to the statistics, if needed
    5. Making data migrations easy as we can leverage Rails' database migration system, instead of having to do something on our own when using Redis
  • @yorickpeterse

    I suspect that some users may not like it, but the reasoning for this should be something our users can understand:

    Cool, as long as it's been aired & agreed, just a matter of changing expectations and behaviour =]

    So in short, using an RDBMS has the benefits of:

    :thumbsup: Nice rational, all ok by me.

    Thank you for responding to my points! I will follow this MR so that I can utilize the new table in https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/8495 Best Kyle

    Edited by username-removed-91641
  • username-removed-128633 added ~18308 ~480950 labels

    added ~18308 ~480950 labels

  • added 1 commit

    • 0f0b86f6 - Use a new table for user conribution stats

    Compare with previous version

  • @yorickpeterse Ok, fixed the comments. Should I move on to modifying the contribution calendar to use it?

  • mentioned in merge request !8821 (merged)

  • @leibo Yes, you can go ahead with that.

  • @leibo As mentioned in https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/8821#note_22450468 a bunch of event types need to be filtered out, which we're currently not doing. To be specific, we only want to include events where (using pseudo sql/ruby):

    • action = Event::PUSHED or
    • target_type in ('MergeRequest', 'Issue') AND action in (Event::CREATED, Event::CLOSED, Event::MERGED) or
    • target_type = 'Note' and action = Event::COMMENTED
  • @yorickpeterse said (comment was lost due to DB outage):

    @leibo No, you should use the variables instead as the values may change in the future. Make sure you're using double quoted strings and use string interpolation (e.g. execute ".... #{Event::COMMENTED} ...").

  • added 1 commit

    • 4ff131bd - Use a new table for user conribution stats

    Compare with previous version

  • @yorickpeterse ok, got it - done.

  • Douwe Maan removed milestone

    removed milestone

  • added 1 commit

    • 9ee90a91 - Use a new table for user conribution stats

    Compare with previous version

  • added 1 commit

    • 77e3fdc5 - example of usage of new user_contributions table

    Compare with previous version

  • @yorickpeterse I've added an example of the how the new user contribution should be used in this temporary commit. What do you think? There seems to be a problem here - how do you I redefine this line:

    events.select {|event| event.visible_to_user?(current_user) }
  • added 5048 commits

    • 77e3fdc5...57d5a549 - 5046 commits from branch gitlab-org:master
    • e202ed84 - Use a new table for user conribution stats
    • f47e2f12 - example of usage of new user_contributions table

    Compare with previous version

  • username-removed-714523 marked as a Work In Progress from leibo/gitlab-ce@0619fb629dec69f28cb12b6ad896bbb38c841700

    marked as a Work In Progress from leibo/gitlab-ce@0619fb629dec69f28cb12b6ad896bbb38c841700

  • added 1 commit

    • 88971f58 - example of usage of new user_contributions table

    Compare with previous version

  • added 1 commit

    • e9b7f11e - example of usage of new user_contributions table

    Compare with previous version

  • added 1 commit

    • b1e8ec43 - example of usage of new user_contributions table

    Compare with previous version

  • @yorickpeterse Fixed above mentioned issues. What do you think about this one?

    " @yorickpeterse I've added an example of the how the new user contribution should be used in this temporary commit. What do you think? There seems to be a problem here - how do you I redefine this line:

    events.select {|event| event.visible_to_user?(current_user) }

    "

  • @leibo Where is that line used? For just the calendar we should display the counts directly, regardless of visibility. This means they'd show private contributions, but since they're only counts this shouldn't expose any unwanted information.

    The list of events for a given date can be left untouched since it doesn't use the contribution calendar.

  • added 1 commit

    • efa31379 - example of usage of new user_contributions table

    Compare with previous version

  • @yorickpeterse Regarding the permissions issue: Can you point me to where we should replace the call to the events table with the new call to the contribution table? I assume it's in user_contribution.rb, but I'm not sure exactly where.

  • added 1 commit

    • 68546d8c - example of usage of new user_contributions table

    Compare with previous version

  • @yorickpeterse Ok, I removed the inner worker and am using the original worker.

  • @yorickpeterse Can you outline the changes needed in contributions_calendar.rb?

  • @leibo

    Can you outline the changes needed in contributions_calendar.rb?

    This file needs to be changed so that instead of aggregating events on the fly it uses user_contributions instead. I'm not exactly sure where this is done, but if the new setup produces the same output we probably only have to change a small amount of code.

  • @yorickpeterse Hi! I went over contributions_calendar.rb and have a fundamental question: Right now, the behaviour on the calendar page is that when pushing on a 'square', which represents a date, the user sees a full description of the events on that day. This will no longer be possible to do with the new table, as we only aggregate events. How do you think we should handle this?

  • @leibo It will still be possible since we don't use the contributions table for that. The contributions table is only used to display the counts per day (those squares) in the calendar.

  • @yorickpeterse got it, thanks!

  • added 1 commit

    • 18df87e9 - example of usage of new user_contributions table

    Compare with previous version

  • added 1 commit

    • a1ccd439 - Use a new table for user conribution stats

    Compare with previous version

  • @yorickpeterse Added the usage of the new table to contributions_calendar.rb. WDYT?

  • added 1 commit

    • b493a138 - Use a new table for user conribution stats

    Compare with previous version

  • added 1 commit

    • 97b30fd5 - Use a new table for user conribution stats

    Compare with previous version

  • @yorickpeterse Fixed your comments. I didn't understand your class comment - what part of the code do you want to move to a class? What is the class you suggest to create?

  • added 1 commit

    • 6503b834 - Use a new table for user conribution stats

    Compare with previous version

  • @yorickpeterse Ok, good idea - fixed.

  • added 1 commit

    • 04d6ec8d - Use a new table for user conribution stats

    Compare with previous version

  • @yorickpeterse Fixed your comments. Anything else or should we continue to tests?

  • @yorickpeterse One more point: I noticed the test Every Sidekiq worker uses the cronjob queue when the worker runs as a cronjob is failing. I want to change the queue to the cronjob queue as required by the test. Any objections?

  • @yorickpeterse Also note that this test: only shows private events to authorized users will fail by definition.

  • username-removed-128633 changed title from WIP: Use a new table for user conribution stats to WIP: Use a new table for user contribution stats

    changed title from WIP: Use a new table for user conribution stats to WIP: Use a new table for user contribution stats

  • @leibo

    Anything else or should we continue to tests?

    You can move on to the tests.

    One more point: I noticed the test Every Sidekiq worker uses the cronjob queue when the worker runs as a cronjob is failing

    That's OK, the worker can use that queue. Don't forget to also update the migration.

    Also note that this test: only shows private events to authorized users will fail by definition.

    If that test merely tests private event counts then I think we can get rid of it.

  • added 1 commit

    • 30a570bb - Use a new table for user conribution stats

    Compare with previous version

  • added 1 commit

    • 3ce835e4 - Use a new table for user conribution stats

    Compare with previous version

  • @leibo

    What's the status of this MR?

  • @blackst0ne Fixing existing failing tests and writing new ones for new functionality. Towards the end I guess...

  • added 3063 commits

    Compare with previous version

  • added 1 commit

    • 26364140 - Use a new table for user conribution stats

    Compare with previous version

  • added 245 commits

    Compare with previous version

  • @yorickpeterse While fixing some failing tests, I discovered that while before our change the code was counting all events with action: Event::PUSHED regardless of the target, we now check only for such events with a specific target. WDYT? Should I fix this to ignore the target on such events? I've added a change that fixes this, tell if it looks ok to you

    Edited by username-removed-714523
  • added 1 commit

    • 78b24039 - Use a new table for user conribution stats

    Compare with previous version

  • @yorickpeterse Another issue from the tests: it seems from the failing tests that the DB does not accept date.beginning_of_day as a viable input. Should I use date.to_time.beginning_of_day instead?

  • @leibo

    we now check only for such events with a specific target. WDYT? Should I fix this to ignore the target on such events

    If I look at the migration/model it seems to count PUSHED events without setting a target, matching old behaviour. The target is only filtered by for CREATED, CLOSED, MERGED, and COMMENTED events.

    Should I use date.to_time.beginning_of_day instead?

    Yes.

  • added 1 commit

    • cb6cc2a1 - Use a new table for user conribution stats

    Compare with previous version

  • @yorickpeterse I've changed the database INSERT query to use date.to_time.beginning_of_day but the rspec test is still failing. Can you take a look? Also the rollback test is failing, can you look at that too? Thank you!

  • @leibo The latest pipeline is giving a 404 for me. Could you do a force push to trigger a new build? Also make sure this MR is up to date with master, we recently fixed a bunch of tests that would sporadically fail.

  • added 836 commits

    Compare with previous version

  • @yorickpeterse I've rebased and pushed - a lot of new test failures :/ anyway, the tests related to this MR are the DB:rollback, and rspec 8 20, where I can't run an INSERT query for some reason. Can you take a look?

  • @leibo Unfortunately the pipeline is giving a 404 for me, so I'm unable to check the output. Could you share the output of the db:rollback test here?

  • @leibo It seems that MySQL doesn't like the index index_user_contributions_on_user_id_and_date being removed before the foreign key is removed. We can probably work around this by adding an explicit down method to the migration CreateUserContribution. This method can just drop the user_contributions table.

  • added 1 commit

    • 42b73ef9 - Use a new table for user conribution stats

    Compare with previous version

  • @yorickpeterse pipeline_rspec_17.txt

    Thank you for that, it fixed the test. Can you look at the other failing pipeline? For some reason I can't insert a row to the UserContibutions DB table. I've attached the failure file.

  • @leibo Your problem appears to be the following:

    ActiveRecord::StatementInvalid:
      Mysql2::Error: Incorrect datetime value: '2017-05-01 00:00:00 +0000' for column 'created_at' at row 1:       INSERT INTO user_contributions (user_id, date, contributions)
            SELECT author_id, '2017-05-01', count(*) AS contributions
            FROM events
            WHERE created_at >= '2017-05-01 00:00:00 +0000'
            AND created_at <= '2017-05-01 23:59:59 +0000'
            AND author_id IS NOT NULL
            AND (
              (
                target_type in ('MergeRequest', 'Issue')
                AND action in (
                  1,
                  3,
                  7
                )
              )
              OR (target_type = 'Note' AND action = 6)
              OR action = 5
            )
            GROUP BY author_id
    # ./config/initializers/connection_fix.rb:20:in `execute'
    # ./app/models/user_contribution.rb:5:in `calculate_for'
    # ./spec/features/calendar_spec.rb:187:in `block (4 levels) in <top (required)>'
    # ./spec/spec_helper.rb:65:in `block (2 levels) in <top (required)>'
    # ------------------
    # --- Caused by: ---
    # Mysql2::Error:
    #   Incorrect datetime value: '2017-05-01 00:00:00 +0000' for column 'created_at' at row 1
    #   ./config/initializers/connection_fix.rb:20:in `execute'

    Most likely MySQL doesn't like the zone offset (+0000), so you'll need to get rid of that. The easiest way I can think of doing this is to simply quote the values using connection.quote so you get something like this:

    WHERE created_at >= #{connection.quote(date.beginning_of_day)}
    AND created_at <= #{connection.quote(date.end_of_day})
  • added 1 commit

    • 119e7232 - Use a new table for user conribution stats

    Compare with previous version

  • @yorickpeterse rspec-12.txt Still failing - see attached file.

  • @leibo You quoted the data incorrectly. As shown in my example you should not manually add any quotes and only use connection.quote.

  • added 1 commit

    • 696a2811 - Use a new table for user conribution stats

    Compare with previous version

  • @yorickpeterse Ok, that worked, my bad! So now that all existing tests are passing, I'm moving on to my own tests for new functionality.

  • @yorickpeterse Hi, I've added a first version of tests. Let me know what tests I should and anything else. Take a look

  • @yorickpeterse Fixed your comments. What more should we do?

  • @leibo Some things that are worth checking:

    • In the interface, does the contribution calendar still show the contributions correctly? This is mostly to make sure there's no difference between what the tests do, and what the runtime does.
    • Could you add a changelog entry using bin/changelog?
    • Some tests appear to be failing, could you look into those?

    Other than that we'll need to run some tests on our staging/production database to see how the used queries and migration will perform, but we'll do that once the tests and everything are passing.

  • added 4107 commits

    • ec8e63e1...fc1090d9 - 4106 commits from branch gitlab-org:master
    • 178c7cb0 - Use a new table for user conribution stats

    Compare with previous version

  • @yorickpeterse I checked the interface of course, and it looks ok, added the change log entry, and the failing tests don't look related to this MR. What do you think?

  • @leibo Since the MR is behind quite a bit (over 800 commits), could you rebase with master and force push? That might solve the spec failures too.

  • For future/personal references:

    The output of EXPLAIN ANALYZE for the query used to aggregate data is this:

                                                                                                                 QUERY PLAN                                                                                                              
    -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     HashAggregate  (cost=324567.08..325121.72 rows=55464 width=44) (actual time=959.466..984.066 rows=52582 loops=1)
       Group Key: author_id
       ->  Index Scan using index_events_on_created_at on events  (cost=0.56..323412.00 rows=231015 width=4) (actual time=0.144..776.022 rows=251554 loops=1)
             Index Cond: ((created_at >= '2017-06-20 00:00:00+00'::timestamp with time zone) AND (created_at <= '2017-06-20 23:59:59+00'::timestamp with time zone))
             Filter: ((author_id IS NOT NULL) AND ((((target_type)::text = ANY ('{MergeRequest,Issue}'::text[])) AND (action = ANY ('{1,3,7}'::integer[]))) OR (((target_type)::text = 'Note'::text) AND (action = 6)) OR (action = 5)))
             Rows Removed by Filter: 20002
     Planning time: 0.225 ms
     Execution time: 995.561 ms

    So we spend roughly 1 second to get the data of a single day. The exact time varies depending on the amount of events.

    Should this become problematic we can change this to aggregating data every hour, but this will require additional logic to perform an upsert so we update existing data if present. Upserts are support in both MySQL and PostgreSQL (albeit slightly differently), but this will require PostgreSQL 9.3 or newer if I'm not mistaken.

    For now this should suffice.

  • added 1044 commits

    • 60af6309...8fa464e6 - 1043 commits from branch gitlab-org:master
    • d7d46a32 - Use a new table for user conribution stats

    Compare with previous version

  • @yorickpeterse Ok, rebased, most test are passing now.

  • @smcgivern Database wise this looks good, could you (or somebody) else also take a look?

  • mentioned in issue #34308 (moved)

  • mentioned in issue #34151 (closed)

  • 1 require 'spec_helper'
    2
    3 describe UserContributionWorker do
    4 describe "#perform" do
    5 it "calls the calculation of user contributions for the given date" do
  • 2
    3 describe UserContribution, models: true do
    4 describe '.calculate_for' do
    5 it 'writes all user contributions for a given date' do
    6 contributor = create(:user)
    7 project = create(:empty_project)
    8 Event.create!(
    9 project: project,
    10 action: Event::CREATED,
    11 target: create(:issue, project: project, author: contributor),
    12 author: contributor,
    13 created_at: 1.day.ago
    14 )
    15
    16 UserContribution.calculate_for(1.day.ago)
    17 expect(UserContribution.first.contributions).to eq(1)
  • 1 require "spec_helper"
    2
    3 describe UserContribution, models: true do
    4 describe '.calculate_for' do
    5 it 'writes all user contributions for a given date' do
    6 contributor = create(:user)
    7 project = create(:empty_project)
    8 Event.create!(
    9 project: project,
    10 action: Event::CREATED,
    11 target: create(:issue, project: project, author: contributor),
    12 author: contributor,
    13 created_at: 1.day.ago
    14 )
  • @leibo thanks, really great work on this! Most of my comments are pretty minor, but I would like some more comprehensive tests for UserContribution.

  • @leibo what is the status of this MR? :slight_smile:

    It blocks https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/8495 which implements this very popular feature request: https://gitlab.com/gitlab-org/gitlab-ce/issues/14078

  • @blackst0ne As you can see from the conversation, there are still a few fixes I need to address. I will give this a look in the next few days to see if I have time to do this soon.

  • @blackst0ne @smcgivern Hi, unfortunately I won't be able to get to those fixes in the next few months. The MR is almost complete, just needs a few small touches. Ido

  • @leibo thank you for letting us know. :thumbsup_tone1:

    @smcgivern how can we finish this MR? :slight_smile:

  • @yorickpeterse how important is this MR to you? If I finish it (while still giving credit to @leibo), would you be happy to review?

  • @leibo https://gitlab.com/leibo/gitlab-ce/tree/user-contribution-table gives me a 404 :thinking: Can you make sure that the access rights for your fork are set up correctly, please?

  • Actually, I can fetch from this repo, so it doesn't matter!

  • @yorickpeterse I've been looking at this, and there are a few failures after rebasing onto the latest master, which I think are a difference of opinion about what this feature does:

    1. https://gitlab.com/gitlab-org/gitlab-ce/issues/23403 - this changes the number of contributions you can see, depending on the visibility of the events you took. (Basic example: if all of your events are in private repos, and you view as an anonymous user, it looks like you had zero events.) If we change this, then the counts won't match the events you can see.
    2. https://gitlab.com/gitlab-org/gitlab-ce/issues/27616 and https://gitlab.com/gitlab-org/gitlab-ce/issues/1943 - these allow the events to be grouped by timezone, which isn't possible if we just store a date.

    Wdyt? Is it still worth pursuing this?

  • @smcgivern I would treat the calendar differently from the list of events displayed below it, and force the timezone to UTC. Ultimately the calendar is just an indication of how "busy" someone has been, so it doesn't have be a 100% match to any filters and what not you may set.

    The big benefit of this table would be that it allows us to remove more potentially expensive queries from the user profile. Caching doesn't really help since the first person hitting a cold cache will still have a bad experience.

    So yes, I think it's definitely worth pursuing as long as our documentation is clear that the calendar doesn't display the exact same data as the event feeds.

    Edited by yorickpeterse-staging
  • @yorickpeterse in that case, I won't have time to do this. The above issues show that people clearly do expect the events count to match the list of events, which it won't once this is in.

    I'd recommend discussing this in the issue with the relevant PM (which is @jramsay, I think).

  • Thanks @smcgivern. With the impending change to show count of all contributions regardless of privacy in https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/8495 I think the difference between the count in the calendar and list is unavoidable and will should be documented below the calendar.

    No longer being able to see the count of contributions until the daily job runs is a little disappointing. I sometimes check it and expect to see results for today.

    @yorickpeterse in it's current state, does the contribution graph use UTC? Or the timezone of the contributors profile? Dropping time will mean that early-Monday and late-Friday activity will look like Saturday and Sunday activity for people close to the date line which is misleading.

  • @jramsay If we were to use some kind of background job it would use UTC as the timezone. I think the current implementation (the one in master that is) uses your profile's timezone.

    No longer being able to see the count of contributions until the daily job runs is a little disappointing. I sometimes check it and expect to see results for today.

    It would be possible to run the job every hour and only calculate contributions for the hours since midnight. There are a few ways of doing this but they're all fairly easy to implement.

  • I think the current implementation (the one in master that is) uses your profile's timezone.

    Actually this is incorrect. Currently it uses the gitlab.yml time_zone setting. This was to allow for cases like a GitLab installation in Melbourne in which everybody's contributions early in the day would look as though they took place on the prior calendar day (if it were using UTC time). The same could go for something like Honolulu in which a contribution near the end of the workday would appear to take place on the following day. I think preserving this feature is worthwhile.

    I think it would be really nice if users could customize this to their liking. In a global company like GitLab, users are in several timezones. My contribution calendar, for instance, shows a lot of activity on Saturday, but most of the time it's actually because I worked late on a Friday evening.

    I don't think this would be terribly difficult to account for. Every profile could have a UTC offset setting for their calendar (which could default to the global setting in gitlab.yml), and the calendar content could be generated with that timezone in mind. If a user ever changes their preferred timezone, treat it as an expired cache and re-generate the calendar. It shouldn't be prohibitively expensive, since we're essentially generating this from scratch for a dynamic timezone right now.

    Edited by username-removed-636429
  • @mikegreiling the idea is to not calculate this dynamically at all, though. I was going to say we could store by hour, but that won't work for non-whole-hour offsets (like India).

  • removed assignee

  • changed milestone to %Backlog

  • With regards to privacy, it needs to be easy for people to maintain their privacy, and activity on private projects is part of that. The current implementation protects privacy by calculating counts based on the permissions of the viewer.

    Moving to a job based calculation approach would need to err on the side of caution, and aggregate public and private contributions so that the user can decide what they'd like to show.

    For reference https://gitlab.com/gitlab-org/gitlab-ee/issues/1233

  • One option is this: when calculating the statistics we use two counters instead of 1:

    1. A counter that stores the public contributions (= contributions for public projects)
    2. A counter that stores private contributions (= contributions for private/internal projects)

    If a user then sets their calendar to show both we'll just sum the two counters, otherwise we'll use the public one. This does however mean that changing project membership won't update past statistics, but I think that's not that big of a deal.

  • @mikegreiling the idea is to not calculate this dynamically at all, though. I was going to say we could store by hour, but that won't work for non-whole-hour offsets (like India).

    I think you misunderstood my comment. I'm not suggesting we calculate this dynamically. I was just illustrating that when/if a user were to change their preferred timezone, we'd have to recalculate the calendar again, which given that we're currently doing this dynamically, wouldn't be a terribly expensive operation. But 99.9% of the time this would be served from the pre-calculated database table.

  • One option is this: when calculating the statistics we use two counters instead of 1:

    1. A counter that stores the public contributions (= contributions for public projects)

    2. A counter that stores private contributions (= contributions for private/internal projects)

    If a user then sets their calendar to show both we'll just sum the two counters, otherwise we'll use the public one. This does however mean that changing project membership won't update past statistics, but I think that's not that big of a deal.

    what abut a case in which a public project or group is suddenly made private, or vice versa? won't we need a way to trigger a re-build of their contribution calendar?

  • Please register or sign in to reply
    Loading