Metrics - Skyhook

Overview

Metrics provide numerical measurements of your system’s performance over time. The Observability Bundle includes four components that work together to collect, store, and visualize metrics from your Kubernetes clusters:

Prometheus: Collects and stores short-term metrics
Grafana Mimir: Long-term metrics storage and querying
Kube State Metrics: Exposes Kubernetes object state as metrics
Prometheus Node Exporter: Exposes node hardware and OS metrics

Prometheus

Prometheus is the industry-standard monitoring system and time-series database for cloud-native environments. It automatically discovers services in Kubernetes and scrapes metrics from them.

What Prometheus Monitors

Application Metrics:

Request rates, latencies, error rates (RED metrics)
Custom business metrics from your applications
Metrics from /metrics endpoints

Kubernetes Metrics:

Pod CPU and memory usage
Container resource consumption
Service endpoint availability

Integration:

Scrapes metrics every 15-30 seconds (configurable)
Stores data locally for 15-30 days (typical configuration)
Remote writes to Mimir for long-term retention

PromQL Basics

Prometheus uses PromQL (Prometheus Query Language) for querying metrics. Example queries:

# CPU usage by pod
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)

# Memory usage by namespace
sum(container_memory_usage_bytes) by (namespace)

# Request rate for a service
rate(http_requests_total{job="my-service"}[5m])

# 95th percentile response time
histogram_quantile(0.95,
  rate(http_request_duration_seconds_bucket[5m])
)

Configuration

Prometheus is configured via the Observability Bundle’s GitOps workflow. Key configuration options:

enabled: true
valuesObject:
  # Scrape interval
  scrapeInterval: 30s

  # Retention period
  retention: 15d

  # Remote write to Mimir
  remoteWrite:
    - url: http://mimir-gateway/api/v1/push

  # Resource limits
  resources:
    limits:
      memory: 4Gi
      cpu: 2

For detailed Helm chart values, see the Prometheus chart documentation.

Grafana Mimir

Mimir provides long-term, scalable storage for Prometheus metrics. It’s designed to handle billions of active time series across multiple tenants.

Why Mimir?

Long-term Retention:

Stores metrics for months or years
Prometheus typically keeps 15-30 days locally
Historical analysis and capacity planning

Horizontal Scalability:

Scales to handle millions of samples per second
Distributed architecture handles large metric volumes
No single point of failure

Prometheus Compatible:

Receives data via Prometheus remote write
Queries with PromQL
Drop-in replacement for long-term storage

How It Works

Prometheus ──[remote write]──> Mimir ──[PromQL queries]──> Grafana
    │                                                           │
    └─────────────[PromQL queries for recent data]─────────────┘

Grafana can query both Prometheus (recent data) and Mimir (historical data) simultaneously, providing seamless access to both real-time and long-term metrics.

Configuration

enabled: true
valuesObject:
  # Storage backend (S3, GCS, Azure Blob)
  storage:
    backend: s3
    s3:
      endpoint: s3.amazonaws.com
      bucket: my-mimir-metrics

  # Retention period
  limits:
    compactor_blocks_retention_period: 1y

  # Ingestion limits
  ingester:
    ring:
      replication_factor: 3

For detailed configuration, see the Mimir chart documentation.

Kube State Metrics

Kube State Metrics generates metrics about the state of Kubernetes objects. Unlike metrics from the Kubernetes API server, kube-state-metrics focuses on the state of the objects (e.g., deployments, pods, nodes) rather than their resource consumption.

What It Exposes

Deployment Metrics:

kube_deployment_status_replicas: Number of desired replicas
kube_deployment_status_replicas_available: Number of available replicas
kube_deployment_status_replicas_unavailable: Number of unavailable replicas

Pod Metrics:

kube_pod_status_phase: Pod phase (Pending, Running, Succeeded, Failed)
kube_pod_status_ready: Whether pod is ready
kube_pod_container_status_restarts_total: Container restart count

Node Metrics:

kube_node_status_condition: Node conditions (Ready, MemoryPressure, DiskPressure)
kube_node_status_allocatable: Allocatable resources per node
kube_node_status_capacity: Total capacity per node

Use Cases

Cluster Health Monitoring:

# Pods not in Running state
count(kube_pod_status_phase{phase!="Running"})

# Nodes with conditions other than Ready
sum(kube_node_status_condition{condition!="Ready",status="true"})

Capacity Planning:

# Available CPU capacity across cluster
sum(kube_node_status_allocatable{resource="cpu"})

# Pod resource requests vs node capacity
sum(kube_pod_container_resource_requests{resource="memory"}) /
sum(kube_node_status_capacity{resource="memory"})

Deployment Monitoring:

# Deployments with unavailable replicas
kube_deployment_status_replicas_unavailable > 0

# Deployments not at desired replica count
kube_deployment_status_replicas != kube_deployment_spec_replicas

Configuration

Kube State Metrics typically requires minimal configuration:

enabled: true
valuesObject:
  # Resource limits
  resources:
    limits:
      memory: 256Mi
      cpu: 100m

  # Which resources to monitor (default: all)
  collectors:
    - deployments
    - pods
    - nodes
    - services
    - configmaps

For more details, see the Kube State Metrics chart.

Prometheus Node Exporter

The Node Exporter exposes hardware and OS metrics from Kubernetes nodes. It runs as a DaemonSet (one pod per node) to collect node-level performance data.

What It Exposes

CPU Metrics:

node_cpu_seconds_total: CPU time spent in different modes (user, system, idle)
Used to calculate CPU usage percentages

Memory Metrics:

node_memory_MemTotal_bytes: Total memory
node_memory_MemAvailable_bytes: Available memory
node_memory_Cached_bytes: Cached memory

Disk Metrics:

node_filesystem_size_bytes: Filesystem size
node_filesystem_avail_bytes: Available space
node_disk_io_time_seconds_total: Disk I/O time

Network Metrics:

node_network_receive_bytes_total: Bytes received
node_network_transmit_bytes_total: Bytes transmitted
node_network_receive_errors_total: Receive errors

Use Cases

Node Resource Monitoring:

# CPU usage by node
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage by node
100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))

# Disk usage by mount point
100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)

Network Monitoring:

# Network traffic by node
rate(node_network_receive_bytes_total[5m])

# Network errors
rate(node_network_receive_errors_total[5m]) > 0

Disk I/O:

# Disk I/O operations
rate(node_disk_io_time_seconds_total[5m])

Configuration

Node Exporter runs as a DaemonSet with host network and pid namespace access:

enabled: true
valuesObject:
  # Host network access for accurate metrics
  hostNetwork: true
  hostPID: true

  # Resource limits
  resources:
    limits:
      memory: 128Mi
      cpu: 100m

For more configuration options, see the Node Exporter chart.

Visualizing Metrics in Grafana

Once metrics are being collected, Grafana provides powerful visualization capabilities.

Pre-built Dashboards

The Observability Bundle includes dashboards for common monitoring needs:

Kubernetes Cluster Monitoring: Overview of cluster health and resource usage
Node Metrics: Detailed node performance (from Node Exporter)
Pod Metrics: CPU, memory, network by pod
Deployment Status: Replica counts, pod states

Creating Custom Dashboards

Navigate to Grafana (access via ingress URL configured during setup)
Click Dashboards → New Dashboard
Add a panel and select Prometheus or Mimir as data source
Write PromQL queries to visualize your metrics

Example panel configuration:

Panel Title: “API Request Rate”
Data Source: Prometheus
Query: rate(http_requests_total{job="api-service"}[5m])
Visualization: Time series graph

Alerting

Grafana can alert based on metric thresholds:

Create an alert rule in a dashboard panel
Define alert conditions (e.g., CPU > 80% for 5 minutes)
Configure notification channels (email, Slack, PagerDuty)

Accessing Grafana

To access Grafana and view your metrics:

Get the Grafana URL from your Observability Bundle configuration (configured during setup)
Log in with the credentials you configured:
- Username: admin (default)
- Password: Set during bundle configuration

If you need to retrieve the password:

kubectl get secret grafana -o jsonpath="{.data.admin-password}" -n observability | base64 --decode

For security, change the default admin password and create additional user accounts as needed through the Grafana UI.

Troubleshooting

Prometheus not scraping metrics

Check Prometheus targets:

Access Prometheus UI (typically at http://prometheus:9090)
Go to Status → Targets
Look for targets in “down” state

Common issues:

Network policies blocking Prometheus
Service selector not matching pods
Pods don’t expose /metrics endpoint

Verify metrics endpoint:

kubectl port-forward pod/my-app-pod 8080:8080
curl http://localhost:8080/metrics

High memory usage in Prometheus

Prometheus memory usage scales with:

Number of time series (unique label combinations)
Scrape interval (more frequent = more data)
Retention period (longer = more data stored)

Solutions:

Reduce retention period (keep 7-15 days, use Mimir for long-term)
Increase scrape interval (30s → 60s)
Drop unnecessary metrics using relabeling
Add more memory to Prometheus pods

Check cardinality:

# Access Prometheus UI
# Go to Status → TSDB Status
# Look for series with high cardinality

Mimir remote write failing

Check Prometheus logs:

kubectl logs -n observability prometheus-server-0 | grep "remote write"

Common issues:

Mimir gateway not accessible (network/DNS issues)
Authentication failure (check credentials)
Rate limiting (ingestion rate too high)

Verify Mimir is running:

kubectl get pods -n observability | grep mimir

Missing node or cluster metrics

Verify kube-state-metrics is running:

kubectl get pods -n observability | grep kube-state-metrics

Verify node-exporter is running on all nodes:

kubectl get pods -n observability -l app=prometheus-node-exporter -o wide
# Should show one pod per node

Check Prometheus is scraping these exporters:

Access Prometheus UI → Status → Targets
Look for kube-state-metrics and node-exporter targets

Best Practices

Metric Naming

Follow Prometheus naming conventions:

Use base unit (seconds, not milliseconds)
Append _total for counters
Append _bucket for histograms
Example: http_request_duration_seconds

Label Usage

Keep cardinality low:

Avoid high-cardinality labels (user IDs, timestamps)
Use service name, environment, region as labels
Don’t create unique label values for every request

Retention Strategy

Balance cost and utility:

Prometheus: 7-15 days (recent, high-resolution)
Mimir: months to years (long-term, downsampled)
Adjust based on storage costs and query patterns

Resource Limits

Right-size resource allocations:

Prometheus memory scales with active series
Monitor and adjust based on actual usage
Use horizontal scaling for very large deployments

Next Steps

View Logs - Learn about log aggregation with Loki
Configure Tracing - Set up distributed tracing
Observability Overview - Return to bundle overview

Documentation Index

​Overview

​Prometheus

​What Prometheus Monitors

​PromQL Basics

​Configuration

​Grafana Mimir

​Why Mimir?

​How It Works

​Configuration

​Kube State Metrics

​What It Exposes

​Use Cases

​Configuration

​Prometheus Node Exporter

​What It Exposes

​Use Cases

​Configuration

​Visualizing Metrics in Grafana

​Pre-built Dashboards

​Creating Custom Dashboards

​Alerting

​Accessing Grafana

​Troubleshooting

​Best Practices

Metric Naming

Label Usage

Retention Strategy

Resource Limits

​Next Steps

Overview

Prometheus

What Prometheus Monitors

PromQL Basics

Configuration

Grafana Mimir

Why Mimir?

How It Works

Configuration

Kube State Metrics

What It Exposes

Use Cases

Configuration

Prometheus Node Exporter

What It Exposes

Use Cases

Configuration

Visualizing Metrics in Grafana

Pre-built Dashboards

Creating Custom Dashboards

Alerting

Accessing Grafana

Troubleshooting

Best Practices

Next Steps