Documentation Index
Fetch the complete documentation index at: https://mintlify.com/sockudo/sockudo/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Sockudo provides comprehensive monitoring capabilities including Prometheus metrics, health checks, structured logging, and cluster health tracking.
Prometheus Metrics
Metrics Endpoint
Sockudo exposes Prometheus metrics on a dedicated port:
environment:
METRICS_ENABLED: "true"
METRICS_DRIVER: "prometheus"
METRICS_HOST: "0.0.0.0"
METRICS_PORT: "9601"
METRICS_PROMETHEUS_PREFIX: "sockudo_"
ports:
- "9601:9601"
Access metrics at: http://localhost:9601/metrics
Available Metrics
Sockudo exposes the following Prometheus metrics:
Connection Metrics
# Total active WebSocket connections
sockudo_connections_active{app_id="app-id"} 1523
# Total connections established
sockudo_connections_total{app_id="app-id"} 15230
# Connection errors
sockudo_connection_errors_total{app_id="app-id",reason="auth_failed"} 12
Channel Metrics
# Active channels
sockudo_channels_active{app_id="app-id"} 342
# Channel subscriptions
sockudo_channel_subscriptions_total{app_id="app-id",channel_type="private"} 891
# Channel subscription errors
sockudo_channel_subscription_errors_total{app_id="app-id",reason="invalid_auth"} 5
Message Metrics
# Messages sent
sockudo_messages_sent_total{app_id="app-id",event="user-update"} 4521
# Messages received
sockudo_messages_received_total{app_id="app-id",event="client-event"} 892
# Message send errors
sockudo_message_send_errors_total{app_id="app-id"} 3
# Message bytes sent
sockudo_message_bytes_sent_total{app_id="app-id"} 2840123
# Message bytes received
sockudo_message_bytes_received_total{app_id="app-id"} 428934
HTTP API Metrics
# HTTP requests
sockudo_http_requests_total{method="POST",path="/apps/:id/events",status="200"} 1823
# HTTP request duration (histogram)
sockudo_http_request_duration_seconds_bucket{method="POST",path="/apps/:id/events",le="0.1"} 1720
# HTTP errors
sockudo_http_errors_total{method="POST",path="/apps/:id/events",status="401"} 12
Presence Metrics
# Presence channel members
sockudo_presence_members{app_id="app-id",channel="presence-room"} 45
# Presence join events
sockudo_presence_joins_total{app_id="app-id"} 342
# Presence leave events
sockudo_presence_leaves_total{app_id="app-id"} 298
WebSocket Buffer Metrics
# Current buffer usage
sockudo_websocket_buffer_usage{socket_id="abc123"} 245
# Buffer full events (slow consumers)
sockudo_websocket_buffer_full_total{action="disconnect"} 8
# Messages dropped due to full buffer
sockudo_websocket_messages_dropped_total{socket_id="abc123"} 42
Rate Limiting Metrics
# Rate limit hits
sockudo_rate_limit_exceeded_total{app_id="app-id",limit_type="events_per_second"} 23
# Rate limiter errors
sockudo_rate_limiter_errors_total{app_id="app-id"} 1
System Metrics
# Process CPU usage
process_cpu_seconds_total 123.45
# Process memory
process_resident_memory_bytes 234567890
# Tokio runtime metrics
tokio_workers_count 4
tokio_blocking_queue_depth 0
Prometheus Setup
Docker Compose Configuration
services:
sockudo:
environment:
METRICS_ENABLED: "true"
METRICS_PORT: "9601"
ports:
- "9601:9601"
prometheus:
image: prom/prometheus:latest
container_name: sockudo-prometheus
restart: unless-stopped
ports:
- "9090:9090"
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./monitoring/rules:/etc/prometheus/rules:ro
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
networks:
- sockudo-network
volumes:
prometheus-data:
Prometheus Configuration
Create monitoring/prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'sockudo-prod'
environment: 'production'
rule_files:
- '/etc/prometheus/rules/*.yml'
scrape_configs:
# Scrape Sockudo metrics from all nodes
- job_name: 'sockudo'
static_configs:
- targets:
- 'sockudo-node1:9601'
- 'sockudo-node2:9602'
- 'sockudo-node3:9603'
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: '([^:]+):(\d+)'
replacement: '${1}'
# Scrape Redis metrics (if using redis_exporter)
- job_name: 'redis'
static_configs:
- targets:
- 'redis-exporter:9121'
# Scrape Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets:
- 'localhost:9090'
Multi-Node Scraping
For dynamic node discovery:
scrape_configs:
- job_name: 'sockudo'
dns_sd_configs:
- names:
- 'sockudo.service.consul'
type: 'A'
port: 9601
Alert Rules
Create monitoring/rules/sockudo-alerts.yml:
groups:
- name: sockudo
interval: 30s
rules:
# High connection count
- alert: HighConnectionCount
expr: sockudo_connections_active > 50000
for: 5m
labels:
severity: warning
annotations:
summary: "High connection count on {{ $labels.instance }}"
description: "Node {{ $labels.instance }} has {{ $value }} active connections"
# Connection errors
- alert: HighConnectionErrors
expr: rate(sockudo_connection_errors_total[5m]) > 10
for: 2m
labels:
severity: warning
annotations:
summary: "High connection error rate on {{ $labels.instance }}"
description: "{{ $value }} connection errors per second"
# Message send failures
- alert: MessageSendFailures
expr: rate(sockudo_message_send_errors_total[5m]) > 5
for: 2m
labels:
severity: critical
annotations:
summary: "High message send error rate"
description: "{{ $value }} message send errors per second on {{ $labels.instance }}"
# Rate limiting
- alert: HighRateLimitHits
expr: rate(sockudo_rate_limit_exceeded_total[5m]) > 100
for: 5m
labels:
severity: warning
annotations:
summary: "High rate limit hits for app {{ $labels.app_id }}"
description: "{{ $value }} rate limit hits per second"
# Slow consumers
- alert: SlowConsumerDisconnects
expr: rate(sockudo_websocket_buffer_full_total{action="disconnect"}[5m]) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Slow consumers being disconnected"
description: "{{ $value }} slow consumer disconnects per second on {{ $labels.instance }}"
# Node down
- alert: SockudoNodeDown
expr: up{job="sockudo"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Sockudo node {{ $labels.instance }} is down"
description: "Node has been down for more than 1 minute"
# High memory usage
- alert: HighMemoryUsage
expr: process_resident_memory_bytes > 3000000000 # 3GB
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage: {{ $value | humanize }}B"
# Redis connection errors
- alert: RedisConnectionErrors
expr: rate(sockudo_adapter_errors_total{adapter="redis"}[5m]) > 0
for: 2m
labels:
severity: critical
annotations:
summary: "Redis adapter errors on {{ $labels.instance }}"
description: "{{ $value }} Redis errors per second"
Health Checks
Application Health
Sockudo provides health check endpoints:
# General health check
curl http://localhost:6001/up
# App-specific health check
curl http://localhost:6001/up/my-app-id
Response:
{
"status": "ok",
"timestamp": 1709251234
}
Docker Health Check
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:6001/up/my-app"]
interval: 30s
timeout: 10s
retries: 3
start_period: 30s
Kubernetes Liveness/Readiness
apiVersion: v1
kind: Pod
metadata:
name: sockudo
spec:
containers:
- name: sockudo
image: sockudo:latest
ports:
- containerPort: 6001
- containerPort: 9601
livenessProbe:
httpGet:
path: /up
port: 6001
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /up
port: 6001
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
Logging
JSON Logging (Production)
Enable structured JSON logs:
environment:
LOG_OUTPUT_FORMAT: "json"
LOG_INCLUDE_TARGET: "true"
RUST_LOG: "info,sockudo=info"
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
Log format:
{
"timestamp": "2024-03-01T12:34:56.789Z",
"level": "INFO",
"target": "sockudo::websocket",
"fields": {
"message": "WebSocket connection established",
"socket_id": "abc123",
"app_id": "my-app",
"remote_addr": "192.168.1.100"
}
}
Log Levels
# Production: minimal logging
RUST_LOG="warn,sockudo=info"
# Development: detailed logging
RUST_LOG="debug,sockudo=debug"
# Troubleshooting: verbose logging
RUST_LOG="trace,sockudo=trace"
# Module-specific logging
RUST_LOG="info,sockudo::websocket=debug,sockudo::adapter=trace"
Log Aggregation
Fluentd
services:
sockudo:
logging:
driver: fluentd
options:
fluentd-address: localhost:24224
tag: sockudo.{{.Name}}
Loki
services:
sockudo:
logging:
driver: loki
options:
loki-url: "http://loki:3100/loki/api/v1/push"
loki-external-labels: "job=sockudo,environment=production"
Grafana Dashboards
Setup Grafana
services:
grafana:
image: grafana/grafana:latest
container_name: sockudo-grafana
restart: unless-stopped
environment:
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
GF_USERS_ALLOW_SIGN_UP: "false"
GF_ANALYTICS_REPORTING_ENABLED: "false"
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
- ./monitoring/grafana/dashboards:/etc/grafana/provisioning/dashboards:ro
- ./monitoring/grafana/datasources:/etc/grafana/provisioning/datasources:ro
networks:
- sockudo-network
volumes:
grafana-data:
Datasource Configuration
Create monitoring/grafana/datasources/prometheus.yml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
Dashboard Panels
Key panels to include:
- Connection Count:
sockudo_connections_active
- Message Rate:
rate(sockudo_messages_sent_total[1m])
- Error Rate:
rate(sockudo_connection_errors_total[5m])
- Channel Count:
sockudo_channels_active
- Memory Usage:
process_resident_memory_bytes
- CPU Usage:
rate(process_cpu_seconds_total[1m])
- Buffer Usage:
sockudo_websocket_buffer_usage
- Rate Limits:
sockudo_rate_limit_exceeded_total
Cluster Health Monitoring
Sockudo includes built-in cluster health tracking:
{
"adapter": {
"cluster_health": {
"enabled": true,
"heartbeat_interval_ms": 10000,
"node_timeout_ms": 30000,
"cleanup_interval_ms": 10000
}
}
}
Features:
- Automatic node discovery
- Heartbeat monitoring (every 10s)
- Failed node detection (30s timeout)
- Automatic cleanup of stale connections
Metrics:
# Active cluster nodes
sockudo_cluster_nodes_active 3
# Failed nodes detected
sockudo_cluster_nodes_failed_total 1
# Heartbeat failures
sockudo_cluster_heartbeat_failures_total 5
Alerting
Alertmanager Setup
services:
alertmanager:
image: prom/alertmanager:latest
ports:
- "9093:9093"
volumes:
- ./monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
- alertmanager-data:/alertmanager
networks:
- sockudo-network
volumes:
alertmanager-data:
Create monitoring/alertmanager.yml:
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'slack'
routes:
- match:
severity: critical
receiver: 'pagerduty'
continue: true
- match:
severity: warning
receiver: 'slack'
receivers:
- name: 'slack'
slack_configs:
- channel: '#sockudo-alerts'
title: 'Sockudo Alert'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
Monitoring Checklist
Next Steps