BI at GitLab
We currently have no SSOT, systems or people in place at GitLab to collect, analyse and make decisions based on data.
To help set a little context, if you have 20 minutes to spare, here's a great podcast with Redpoint's Tom Tunguz on data supply chain: http://www.thetwentyminutevc.com/tomasztunguz/ If you enjoy reading, I'd also suggest his book: https://www.amazon.co.uk/Winning-Data-Transform-Culture-Empower/dp/1119257239
Recommendations
- Hire specialist dedicated BI headcount with domain experience, including product selection (RedShift/Looker/MixPanel/Heap/Segment/Tableau) unless we have internal candidates with these skills.
- Evaluate BI tool chain, including storage infrastructure and analytics tooling
- Define set of initial priority metrics
- Implement solution on GitLab.com - need to understand how to do this in a way that's friendly to open source - perhaps as an integration
- Collate customer data from version.gitlab.com into BI solution
- Explore ways to do this on self-hosted GitLab instances so this data is (voluntarily) fed back to us
Examples of issues
1. Lacking SSOT
Issues such as https://gitlab.com/gitlab-com/infrastructure/issues/1207 shouldn't really exist. We should have a place where anyone can go that not only graphs these metrics over time, and allows people to easily query the data.
In the same issue, we are running analysis on active projects, whereas a similar query was dealt with in a different way previously. Without having an SSOT or a place to go to for data, this means we're inefficiently not re-using prior knowledge and effort.
2. Data not stored for analytics
Our customer data on GitLab.com is architected and stored specifically for the GitLab application. This makes complete sense. What this means is that querying data in different ways, in particular doing cohort analysis, distribution analysis, aggregation etc... is not effective.
In the GitLab.com data analysis that was done recently, these queries weren't abe to be executed against the live data set for performance reasons. Instead, we had to run against stale data in staging.
There are numerous technology solutions that are designed to store data for the purposes of insight. To the best of my knowledge, Prometheus isn't designed for this type of use case.
3. Lack of customer data, in particular, feature usage
In 9.0 we are removing the sidebar. Regardless of whether or not this is a good thing to do, we have absolutely no data on how many people use the sidebar or have intentionally pinned it.
We have no easily consumable dashboards to understand basic usage of features, how frequently they are used and, if we make product enhancements, how this impact customer usage.
Piwik doesn't even appear to work properly to give us a sense of page popularity.
4. No dedicated resources available to query and analyse data
We don't have any people in GitLab who's primary job is to query data. This means that we are competing with resources for other initiatives and begging, borrowing and stealing to get answers.