We are getting to the point where we are not really happy with CheckMK.
The main reasons are:
it's not trivial to add metrics and/or the plugins offered are not giving enough control.
in a bad network weather day we will have a lot of false positives.
the alerting capabilities seem to be limited, too binary.
the UI is too complex and it's hard (or impossible) to build dashboards to check the system health in one view.
we are maintaining 2 monitoring tools - influxdb for performance and checkmk for host monitoring.
It would be interesting to try something else and see what the capabilities are. Initial conversations went in the way of considering hiring a service, @stanhu is handling this possibility.
The next possibility was to reconsider prometheus since it is a time series database that also includes alerting.
Nothing decided here, just gathering thoughts.
Possible systems to try (please gather pros and cons of each):
I like Sensu too, although it's job is mostly for alerting/checking rather than displaying a central dashboard for visualizing data. It's easy to add checks.
It works really well in dynamic environments because clients can self-register on-the-fly. Configuration is also easy - it's all JSON, and it also supports custom metadata so you can send additional details about a check. This includes Runbooks (Yelp does this), or links to graphs/dashboards, etc. Yelp has done lots of talks and open-sourced a lot of the stuff they use with Sensu. See http://www.slideshare.net/solarkennedy/sensu-yelp-a-guided-tour as a starting point. Uchiwa is the open-source dashboard, or there is an Enterprise dashboard, too. It looks like this has improved a lot recently and they have full RBAC now (plus a GitLab integration, somehow). Sensu supports all Nagios checks out of the box and writing new ones in Ruby is easy with their tools.
The main downside is that Sensu has quite a few moving parts. You need Sensu server, Redis, RabbitMQ and then the clients. Clients talk directly to RMQ and then Sensu processes and stores data in Redis. Most of these components shouldn't be a problem for us except RabbitMQ. In my experience, though, RMQ is really stable and overall easy to manage. Cookbook support was good when I last used it 7 months ago.
Great summary, @dblessing. One thing I will add is that Sensu doesn't provide a timeseries database to store metrics. You still need a solution for that, and it looks like Prometheus attempts to provide that.
I looked at Prometheus briefly last year. It was fairly easy to set up, but at the time it only supported their own dashboard, PromDash. They have since integrated with Grafana. However, at the time I also wasn't convinced their database could handle the amount of timeseries data that we send to InfluxDB.
I think the action item here is to set Prometheus up and see how it handles some basic data.
I do like pushing time series to the same system and having the ability to map application performance with systems metrics (a thing we can't do right now since we have 2 separate systems).
I'll try to investigate prometheus a bit deeper and will try to draft a plan on how can we start probing this.
Regarding handling the load, we needed to invest a lot of time into influxdb until we managed to actually write the metrics without loosing (too much) data, and are still having issues with it locking from time to time, so I'm not sure we can say that it is actually supporting the load
It's worth noting that Sensu does have the ability to send metrics like load in addition to simple availability. It does not store those in time-series but there are available handlers to ship to InfluxDB or Graphite. When I used Sensu previously we shipped those metrics to Graphite alongside all of our other time-series data. Worked pretty well. The limiting factor is the check interval. Sensu may only poll system load every 3 minutes, but a metric collector like CollectD polls every 10 seconds or something. Depends on what you want/need.
Prometheus is inspired in Borgmon - the monitoring system for Borg (internal google tool for fleet management) - and is built by ex google people, so it is designed to scale to insane amounts of hosts and services.
After a brief discussion with @jnijhof we got to the conclusion that we need to write which answers we need the pros/cons to reply to.
Initially it looks like we want to keep support for things like monitoring ssl certificates expiration. We just need to write down the whole list of what would make the perfect solution, which will fully cover checkmk and will also provide the feature set of grafana/influxdb.
I want to lay out some principals and have us try and decide.
My main gist is this: the fleet is small, checkmk gives me the willies (+ #33 (closed)), if you look at influx funny it will fail badly, the alerting must ultimately rely on your datastore (influx) and so it really shouldn't fail, infra metrics and applicative metrics are orthogonal.
So with all of that in mind I believe we want one well tuned influx server dedicated to infra metrics only, with collectd as the sole sender.
Uses minimal resources and is battle tested.
A few key relevant plugins and recent improvements (v5.5+) will take us very far. It's entirely possible we can get away with using one short and simple config across the entire fleet.
The measurements
The big ones: CPU, Memory, Disk - percentage used.
Everything else is gravy and should be added thoughtfully. Infra focused. Add stuff like inodes and process counts slowly one by one.
Why not a pull method like Prometheus?
In short: we can always try both simultaneously and decide later, this is just faster to start with.
The advantages of pull over push exist but are small and if you keep reading you'll see that there's an answer for everything.
The real consideration is not the model but the performance of the specific tools (influx vs Prometheus).
The Time-Series Store - Influxdb
Still a bit of a nightmare sometimes to run.
Upgrading versions is often an adventure
Upgrading versions is often necessary (#232 (closed))
Don't dump everything into one influx! - Infra metrics go into a dedicated influx infra server
Consider a buffer
I would go as far as using something like Riemann as a forwarder.
While it's a very interesting tool in and of itself, don't get distracted by its many duplicate features to the above setup, the idea is that you can transparently send to two or more influxdb's via Riemann. A repeater/splitter.
This way you can test an influxdb upgrade instead of rolling the dice or do an active-active thing, finer sharding control, you can even aggregate a bit before forwarding to ease the load on influx.
One can only assume one day this bit will be totally unnecessary but we are babying influx. Avoid if possible.
The big pro is that in case whatever sits downstream of Riemann fails you can still see what's going on and still have alerts.
Birds Eye View of Fleet
Templated Grafana outclasses checkmk in every way.
If a server disappears you will see it missing immediately, you can alert on the sudden absence of points as well.
Again, the fleet is small, all the azure, DO, AWS, foo, instances can and should be accurately reflected here and accessible to all. So at least in terms of infra you know the situation.
Alerting KISS
Gently query influx with a simple script, send to slack (and wherever else).
I actually think we can get this done this week and that we need this last week.
check_mk has problem when host node doesn't respond in specified amount of time. Node checks totally disappears, except ones checked outside of host. So in next solution to choose, we must also take into account such situations.
@maratkalibek Agreed, please get all the ideas that have been thrown around and update the description of this issue (which I'm assigning to you at this point)
Also, could you write a plan of how are we going to test different options, how we plan to change the system we have now for something better, how we plan to move what we have monitored right now and what are the main criterias that we need to focus on to make the final decision please?
Please use lists or check blocks to keep track of what is the next step and where are we standing right now.
"symptom-based monitoring," in contrast to "cause-based monitoring". Do your users care if your MySQL servers are down? No, they care if their queries are failing. (Perhaps you're cringing already, in love with your Nagios rules for MySQL servers? Your users don't even know your MySQL servers exist!) Do your users care if a support (i.e. non-serving-path) binary is in a restart-loop? No, they care if their features are failing. Do they care if your data push is failing? No, they care about whether their results are fresh.
I'm personally investigating prometheus and seems to be quire interesting and powerful.
Instead of clustering it offers federation, which could open the door to have a metrics server for us, and publish a set of pre-chewed metrics to the open so anyone can go and check, like this people is doing already: http://demo.robustperception.io:9090/consoles/index.html
The way we can write "plugins" is to just write metrics to a file and have them exported by the node-exporter by just watching a directory.
A different way is by pushing metrics to a pushgateway.
In any case, I think that this lowers the barrier for monitoring because just writing to a file is enough to add monitoring to a system.
This could also open the door to stop sending udp packets to influxdb for performance metrics and just write them to a file, which will allow us to enable this feature even in CI as it would simplify the setup.
It looks like prometheus in general is following the Unix idea of having really small tools that know how to do just one thing really well, then we can compose the system as we want.
All the systems are separated and only care about one thing (SRP), so we have the metrics database, something to build graphs (grafana or promdash, which is getting deprecated), push gateways or metrics exporters, and finally a really simple alerting system that is configured with rules in a file.
Ok, my notes on prometheus after setting it up at home:
Prometheus
TL;DR
We should be building a federated set of nodes, some for the workers, some for other services, a node aggregating and exposing sanitized metrics to a public node and 2 grafana servers, a public one consuming the sanitized metrics, and the other one with both influxdb and the prometheus source so we can build dashboards with application metrics and system metrics side by side.
There is endless power in the simplicity, flexibility and composability of this system.
The good
Composability - one prometheus server can be scraping another one building a federated hierarchical set of nodes. Alerting is a separate component. They pipe into each other like good old unix.
Simplicity - anything that can expose an HTML page can serve metrics, no configuration required.
Scale - it is achieved in a different way than clustering: by aggregating metrics before ingesting. By federating nodes and distributing the load naturally, and by using Google's LevelDB for storing the metrics.
Flexibility - by connecting to a node a browsing page is presented to you where you can start building graphs or ask whatever you can think of, querying and shaping the data live. The language is simple (reminds me of lisp) and can be dominated in a couple of minutes. The server will show you the metrics it has in plain HTML, which will also enable you to investigate for stuff that you didn't even thought about.
Multi-dimensional dataset - one metric has many labels, this way we can be storing the data of all the nodes separately from each other, building an aggregation for the quick dashboards, but allowing to query for specifics if we need to dive deep in a particular node, grouping data in many ways
Smells like unix - small components that know how to do one thing really well, alerting is handled by the alerting component, publishing metrics is handled by the node-exporter or the push-gateway (2 different specific set of tools depending on what the goal is) it all glued by as many prometheus servers you want.
Lots of metrics exporters already - but we could build our own just by serving plain http, language is not important, no low level networking knowledge required (boring! yay!)
Pull model - contrary to the push mode that we are so used to, if prometheus cannot scrape a server -> it is down, you will get an alert. Scraping will happen whenever you decide (5s by default) allowing you to reduce the network traffic instead of sending a gazillion UDP packets with nanosecond+random precision so they don't collide. The scraped server is the one that will be doing the aggregation and will be serving this data already digested. So no data loss.
File based configuration - not an outdated web interface that you can't figure out, if you want to add a rule, you code it in text, push and SIGHUP. This allows us to keep our monitoring configuration in a git repo, and to also share it out or add it to the GitLab omnibus package.
Tiny memory footprint - for one host, both the node-exporter and prometheus (running inside docker) with metrics for a week so far left a resident memory footprint of 8MB.
The bad
We are not using it, yet.
Still not 1.0, some changes in the APIs are expected to happen.
But it has been used in production already.
I can't honestly think of anything here.
Summary
Prometheus is a different kind of monitoring system. It is built using the unix phylosophy of having a small set of composable pieces that each one knows how to do one thing really well. Then you glue all the bits together with html.
It does not compare directly to things like nagios or checkmk because it works in a different way.
The way this is thought out is to use metrics as the source of truth, to aggregate these before ingestion, and to use rules to trigger alerting whenever a metric goes off a given threshold. This results in getting alerts whenever a system is degrading, not when it stopped or when it crashed 5 minutes ago. The whole spirit of this way of monitorization is to reduce the noise on a bad network weather day, and to simplify maintaining and evolving it.
The main components are:
Prometheus the service - works as a web server and a polling service that will scrape configured urls at the designated rythm. It will store these metrics in LevelDB, using record rules to aggregate and digest data.
LevelDB is a sorted key value store developed by Google. I think it will scale just fine.
This service can also expose selected metrics to another prometheus server enabling federation of servers.
The way to register services can be 2 ways: by changing the configuration file and sending a HUP signal, or using a service discovery mechanism that will enable dynamic registration.
Alerting - a different system handles forwarding and dedupe of alerts, prometheus will send alerts to this gateway and this service will handle notifying or silencing. This system is separate of prometheus the service and we could have more than one, each dealing with different metrics and different sets of rules on how to react.
whatever exports metrics - this can be done by any web application, there is a node exporter that supports even being fed with raw text files.
Alerting
Alerts should be clear and actionable.
and
keep alerting simple, alert on symptoms, have good consoles to allow pinpointing causes, and avoid having pages where there is nothing to do.
Alerting originally allowed to even push a runbook whenever an alert was triggered, pushing the right information up front, reducing the stress of whoever is on call.
Replies to concerns:
The concerns posted by @jnijhof, some replies based on my experimentations:
Q: Server pulls data from agents/clients, this will give us the possibility to secure the monitor server a bit better.
A: Yes, that is the model with prometheus, the server pulls from the clients so we can even use https to secure the metrics on traffic and shared tokens or even basic auth.
Q: Light weight agents/clients, as less possible memory footprint. We don't want to overload a server just for the monitor software.
A: Quite lightweight, we could even not have a client at all and just serve the metrics from the rails application that is already running. This should reduce the memory footprint to basically zero.
Q: Plugin support, easy to write/add plugins
A: As easy as serving http, pick your language, or just write stuff to a file and serve it with the node exporter. We could be serving metrics from CI, docker image server, workhorse and gitlab-rails without any configuration whatsoever, gitlab-shell should be pushing metrics to a local push gateway.
How could we build this system and take advantage while maintining current infrastructure?
Build one host that will be hosting grafana and prometheus, maybe use our "performance.gitlab.net" internal host (not the public one) so it starts doing something.
Install the basic node exporter in all the hosts that are used in GitLab.com and setup prometheus to scrape them.
Install the blackbox checker in the same host that holds prometheus and configure so they are scraped.
Install ceph and postgres metrics checks in the right servers, configure scraping.
Install and configure the alert manager to start notifying when things go wrong.
Build a second prometheus node that will be checking that the first one is available and up (up is one of the metrics, a gauge), if it's not there, trigger an alert.
Migrate influxdb into our network and setup the application, configure the prometheus grafana to use this new influxdb as a second source of data so we have both application and system metrics in the same endpoint.
Build sanitized version of our metrics so we can publish performance data.
Build another prometheus that will be scraping the sanitized metrics from our, add an insecure grafana, rinse and repeat.
@pcarranza LGTM, I'm not sure if we should start with the federated part or just one host, but I trust your judgement on this. Keep it simple and boring please. Also, this should be part of our Omnibus package. That would suggest one single server.
Federation is bonus points that we can get eventually, what I love is that is so boring to do it that it does not add any complexity, it's just composing things.
As we talked, we are going to work close with you to role this out bit by bit and not dump a huge set of load up front.
Since we already have grafana we have a simple last step (provide configuration), so probably the way to integrate with omnibus goes somewhere in this direction:
Build our metrics exporters from workers manually (non-omnibus)
Wire our grafana instance with the prometheus source (non-omnibus)
Pull these exporters into omnibus and replace the manual ones.
Pull the scraper into omnibus.
Switch our grafana source to the new omnibus based prometheus source.
Rinse and repeat with the alerting component.
So, no work for you initially, and we will make it simple to integrate bit by bit even with the feature "partially enabled" but not fully operational at the beginning.