Documentation Index
Fetch the complete documentation index at: https://docs.skyhook.io/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Metrics provide numerical measurements of your system’s performance over time. The Observability Bundle includes four components that work together to collect, store, and visualize metrics from your Kubernetes clusters:- Prometheus: Collects and stores short-term metrics
- Grafana Mimir: Long-term metrics storage and querying
- Kube State Metrics: Exposes Kubernetes object state as metrics
- Prometheus Node Exporter: Exposes node hardware and OS metrics
Prometheus
Prometheus is the industry-standard monitoring system and time-series database for cloud-native environments. It automatically discovers services in Kubernetes and scrapes metrics from them.What Prometheus Monitors
Application Metrics:- Request rates, latencies, error rates (RED metrics)
- Custom business metrics from your applications
- Metrics from
/metricsendpoints
- Pod CPU and memory usage
- Container resource consumption
- Service endpoint availability
- Scrapes metrics every 15-30 seconds (configurable)
- Stores data locally for 15-30 days (typical configuration)
- Remote writes to Mimir for long-term retention
PromQL Basics
Prometheus uses PromQL (Prometheus Query Language) for querying metrics. Example queries:Configuration
Prometheus is configured via the Observability Bundle’s GitOps workflow. Key configuration options:Grafana Mimir
Mimir provides long-term, scalable storage for Prometheus metrics. It’s designed to handle billions of active time series across multiple tenants.Why Mimir?
Long-term Retention:- Stores metrics for months or years
- Prometheus typically keeps 15-30 days locally
- Historical analysis and capacity planning
- Scales to handle millions of samples per second
- Distributed architecture handles large metric volumes
- No single point of failure
- Receives data via Prometheus remote write
- Queries with PromQL
- Drop-in replacement for long-term storage
How It Works
Configuration
Kube State Metrics
Kube State Metrics generates metrics about the state of Kubernetes objects. Unlike metrics from the Kubernetes API server, kube-state-metrics focuses on the state of the objects (e.g., deployments, pods, nodes) rather than their resource consumption.What It Exposes
Deployment Metrics:kube_deployment_status_replicas: Number of desired replicaskube_deployment_status_replicas_available: Number of available replicaskube_deployment_status_replicas_unavailable: Number of unavailable replicas
kube_pod_status_phase: Pod phase (Pending, Running, Succeeded, Failed)kube_pod_status_ready: Whether pod is readykube_pod_container_status_restarts_total: Container restart count
kube_node_status_condition: Node conditions (Ready, MemoryPressure, DiskPressure)kube_node_status_allocatable: Allocatable resources per nodekube_node_status_capacity: Total capacity per node
Use Cases
Cluster Health Monitoring:Configuration
Kube State Metrics typically requires minimal configuration:Prometheus Node Exporter
The Node Exporter exposes hardware and OS metrics from Kubernetes nodes. It runs as a DaemonSet (one pod per node) to collect node-level performance data.What It Exposes
CPU Metrics:node_cpu_seconds_total: CPU time spent in different modes (user, system, idle)- Used to calculate CPU usage percentages
node_memory_MemTotal_bytes: Total memorynode_memory_MemAvailable_bytes: Available memorynode_memory_Cached_bytes: Cached memory
node_filesystem_size_bytes: Filesystem sizenode_filesystem_avail_bytes: Available spacenode_disk_io_time_seconds_total: Disk I/O time
node_network_receive_bytes_total: Bytes receivednode_network_transmit_bytes_total: Bytes transmittednode_network_receive_errors_total: Receive errors
Use Cases
Node Resource Monitoring:Configuration
Node Exporter runs as a DaemonSet with host network and pid namespace access:Visualizing Metrics in Grafana
Once metrics are being collected, Grafana provides powerful visualization capabilities.Pre-built Dashboards
The Observability Bundle includes dashboards for common monitoring needs:- Kubernetes Cluster Monitoring: Overview of cluster health and resource usage
- Node Metrics: Detailed node performance (from Node Exporter)
- Pod Metrics: CPU, memory, network by pod
- Deployment Status: Replica counts, pod states
Creating Custom Dashboards
- Navigate to Grafana (access via ingress URL configured during setup)
- Click Dashboards → New Dashboard
- Add a panel and select Prometheus or Mimir as data source
- Write PromQL queries to visualize your metrics
- Panel Title: “API Request Rate”
- Data Source: Prometheus
- Query:
rate(http_requests_total{job="api-service"}[5m]) - Visualization: Time series graph
Alerting
Grafana can alert based on metric thresholds:- Create an alert rule in a dashboard panel
- Define alert conditions (e.g., CPU > 80% for 5 minutes)
- Configure notification channels (email, Slack, PagerDuty)
Accessing Grafana
To access Grafana and view your metrics:- Get the Grafana URL from your Observability Bundle configuration (configured during setup)
- Log in with the credentials you configured:
- Username: admin (default)
- Password: Set during bundle configuration
Troubleshooting
Prometheus not scraping metrics
Prometheus not scraping metrics
Check Prometheus targets:
- Access Prometheus UI (typically at
http://prometheus:9090) - Go to Status → Targets
- Look for targets in “down” state
- Network policies blocking Prometheus
- Service selector not matching pods
- Pods don’t expose
/metricsendpoint
High memory usage in Prometheus
High memory usage in Prometheus
Prometheus memory usage scales with:
- Number of time series (unique label combinations)
- Scrape interval (more frequent = more data)
- Retention period (longer = more data stored)
- Reduce retention period (keep 7-15 days, use Mimir for long-term)
- Increase scrape interval (30s → 60s)
- Drop unnecessary metrics using relabeling
- Add more memory to Prometheus pods
Mimir remote write failing
Mimir remote write failing
Check Prometheus logs:Common issues:
- Mimir gateway not accessible (network/DNS issues)
- Authentication failure (check credentials)
- Rate limiting (ingestion rate too high)
Missing node or cluster metrics
Missing node or cluster metrics
Verify kube-state-metrics is running:Verify node-exporter is running on all nodes:Check Prometheus is scraping these exporters:
- Access Prometheus UI → Status → Targets
- Look for
kube-state-metricsandnode-exportertargets
Best Practices
Metric Naming
Follow Prometheus naming conventions:
- Use base unit (seconds, not milliseconds)
- Append
_totalfor counters - Append
_bucketfor histograms - Example:
http_request_duration_seconds
Label Usage
Keep cardinality low:
- Avoid high-cardinality labels (user IDs, timestamps)
- Use service name, environment, region as labels
- Don’t create unique label values for every request
Retention Strategy
Balance cost and utility:
- Prometheus: 7-15 days (recent, high-resolution)
- Mimir: months to years (long-term, downsampled)
- Adjust based on storage costs and query patterns
Resource Limits
Right-size resource allocations:
- Prometheus memory scales with active series
- Monitor and adjust based on actual usage
- Use horizontal scaling for very large deployments
Next Steps
- View Logs - Learn about log aggregation with Loki
- Configure Tracing - Set up distributed tracing
- Observability Overview - Return to bundle overview