Monitoring platform as an alternative to New Relic
Currently we're trying out New Relic as a performance monitoring tool for gitlab.com. While New Relic takes care of some of the basics it has various flaws/shortcomings that make it less than ideal.
One of the most important parts of any production application is to quite literally monitor everything (OK almost everything). Every web request, Git command, API call, database query, system calls, etc, all of this should be monitored.
The average monitoring solution usually only does the following:
- Measure the time a web request took in total
- Measure the time per database query
- Divide request timings into categories (Ruby, database, middleware, etc)
If you're lucky you are also able to measure individual methods, though this almost always require explicit instrumentation.
The problem here is that the above items only cover a portion of what you really want to monitor. More importantly, services such as New Relic don't make it easy (or possible at all) to insert custom metrics. While New Relic has "Insights" this service is extremely limited (based on using it for over a year). It targets less technical users (at the cost of flexibility) and doesn't come even remotely close to a proper time series database.
Another drawback of common solutions is that they all assume the only thing you're building is web applications. In New Relic for example most graphs display the time in milliseconds without scaling it to seconds/minutes whenever needed. This means that for background workers it's quite common to see timings such as "40k milliseconds" instead of just "40 seconds".
Use Cases
Whatever system we pick should support at least the following features/use-cases:
- Custom dashboards and graphs, with support for more than just line graphs (histograms, stacked graphs, bar charts, hell maybe even pie charts)
- The ability to insert custom metrics, aggregate these upon insertion (e.g. counters), basically what you'd usually get from a proper time series database.
- Authentication: we should not expose our monitoring to the public as most systems don't come with a permission system that allows defining read-only vs admin roles. If the system does have this I'm fine with exposing graphs to the public in a read-only fashion.
- Alerting (Slack, Email, Pagerduty, etc), though this isn't a hard requirement (but definitely would be nice to have).
- FOSS and/or self hosting support, this makes it possible for other users to re-use the same setup if they so desire without having to send all their data to some third-party.
I'm specifically leaving error tracking out of this list as few services provide both and there are plenty of decent error tracking services out there (Rollbar, Bugsnag, etc).
Data to Monitor
From the top of my head I can think of the following we should monitor at some point:
- The total time per transaction (web or background).
- Memory usage over time, unrelated to the transactions (see below).
- Ruby memory statistics (heap, etc), unrelated to the transactions (see below).
- Time spent per Git command (shell command or rugged library call).
- Time spent per DB query, per transaction (we should also be able to visualize this unrelated to the originating transaction).
- Time spent in important system calls (fopen, fread, etc), mostly applies to the background workers.
- Time spent in any external API calls (e.g. for service notifications).
- Benchmarking statistics from the test suite.
- Total build times per project, ideally for all projects (allowing us to easily see any CI performance improvements).
There's probably a lot more, but this is what I can think of right now. Ultimately I never want to run into a case where something performs badly and nobody has any idea as to why. This system should be our magic globe giving us understanding of the unknown.
Possible Solutions
In no particular order and unrelated to how suitable they are, here are some possible solutions:
- http://prometheus.io/
- https://appsignal.com/
- https://www.datadoghq.com/
- https://github.com/github/brubeck combined with Graphite (Graphite is a disaster to get up and running)
- https://influxdb.com/ combined with Grafana (http://docs.grafana.org/datasources/influxdb/)
Personally I'm interested in the combination of Influxdb and Grafana. While it requires some manual work Grafana at least comes with authentication support (via Google, GitHub, LDAP and a bunch of other services). It however doesn't come with alerting as far as I can tell.
Datadog is another possible solution, though it's proprietary and as far as I can tell offers no self hosting solution.