To collect metrics, we use Prometheus, an open-source tool originally developed at SoundCloud. It uses a pull model to scrape metrics from many endpoints (normally specific pods in the cluster), allowing you to store and query this time series data (generally using a frontend like Grafana). Our cluster-level monitoring system provides basic metrics for all pods, (though developers are also encouraged to expose custom business-logic metrics which will also be scraped automatically). To do so, we deploy the node-exporter DaemonSet, which exposes metrics for each pod on that node: CPU and memory usage, disk and network I/O, etc. Alerts (using the Prometheus Alert Manager) are triggered once specific conditions have been met (e.g. if the number of replicas backing a given service goes below certain threshold or if the latency for a given SQL query is too high).
We quickly reached the limits of a single-instance Prometheus setup and have since federated metrics across multiple instances. This has allowed us to scale to many thousands of metrics, capturing everything from CPU usage to individual SQL query latencies or inter-service RPCs.