Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/sockudo/sockudo/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Sockudo provides comprehensive monitoring capabilities including Prometheus metrics, health checks, structured logging, and cluster health tracking.

Prometheus Metrics

Metrics Endpoint

Sockudo exposes Prometheus metrics on a dedicated port:
environment:
  METRICS_ENABLED: "true"
  METRICS_DRIVER: "prometheus"
  METRICS_HOST: "0.0.0.0"
  METRICS_PORT: "9601"
  METRICS_PROMETHEUS_PREFIX: "sockudo_"

ports:
  - "9601:9601"
Access metrics at: http://localhost:9601/metrics

Available Metrics

Sockudo exposes the following Prometheus metrics:

Connection Metrics

# Total active WebSocket connections
sockudo_connections_active{app_id="app-id"} 1523

# Total connections established
sockudo_connections_total{app_id="app-id"} 15230

# Connection errors
sockudo_connection_errors_total{app_id="app-id",reason="auth_failed"} 12

Channel Metrics

# Active channels
sockudo_channels_active{app_id="app-id"} 342

# Channel subscriptions
sockudo_channel_subscriptions_total{app_id="app-id",channel_type="private"} 891

# Channel subscription errors
sockudo_channel_subscription_errors_total{app_id="app-id",reason="invalid_auth"} 5

Message Metrics

# Messages sent
sockudo_messages_sent_total{app_id="app-id",event="user-update"} 4521

# Messages received
sockudo_messages_received_total{app_id="app-id",event="client-event"} 892

# Message send errors
sockudo_message_send_errors_total{app_id="app-id"} 3

# Message bytes sent
sockudo_message_bytes_sent_total{app_id="app-id"} 2840123

# Message bytes received
sockudo_message_bytes_received_total{app_id="app-id"} 428934

HTTP API Metrics

# HTTP requests
sockudo_http_requests_total{method="POST",path="/apps/:id/events",status="200"} 1823

# HTTP request duration (histogram)
sockudo_http_request_duration_seconds_bucket{method="POST",path="/apps/:id/events",le="0.1"} 1720

# HTTP errors
sockudo_http_errors_total{method="POST",path="/apps/:id/events",status="401"} 12

Presence Metrics

# Presence channel members
sockudo_presence_members{app_id="app-id",channel="presence-room"} 45

# Presence join events
sockudo_presence_joins_total{app_id="app-id"} 342

# Presence leave events
sockudo_presence_leaves_total{app_id="app-id"} 298

WebSocket Buffer Metrics

# Current buffer usage
sockudo_websocket_buffer_usage{socket_id="abc123"} 245

# Buffer full events (slow consumers)
sockudo_websocket_buffer_full_total{action="disconnect"} 8

# Messages dropped due to full buffer
sockudo_websocket_messages_dropped_total{socket_id="abc123"} 42

Rate Limiting Metrics

# Rate limit hits
sockudo_rate_limit_exceeded_total{app_id="app-id",limit_type="events_per_second"} 23

# Rate limiter errors
sockudo_rate_limiter_errors_total{app_id="app-id"} 1

System Metrics

# Process CPU usage
process_cpu_seconds_total 123.45

# Process memory
process_resident_memory_bytes 234567890

# Tokio runtime metrics
tokio_workers_count 4
tokio_blocking_queue_depth 0

Prometheus Setup

Docker Compose Configuration

services:
  sockudo:
    environment:
      METRICS_ENABLED: "true"
      METRICS_PORT: "9601"
    ports:
      - "9601:9601"

  prometheus:
    image: prom/prometheus:latest
    container_name: sockudo-prometheus
    restart: unless-stopped
    
    ports:
      - "9090:9090"
    
    volumes:
      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./monitoring/rules:/etc/prometheus/rules:ro
      - prometheus-data:/prometheus
    
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    
    networks:
      - sockudo-network

volumes:
  prometheus-data:

Prometheus Configuration

Create monitoring/prometheus.yml:
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'sockudo-prod'
    environment: 'production'

rule_files:
  - '/etc/prometheus/rules/*.yml'

scrape_configs:
  # Scrape Sockudo metrics from all nodes
  - job_name: 'sockudo'
    static_configs:
      - targets:
          - 'sockudo-node1:9601'
          - 'sockudo-node2:9602'
          - 'sockudo-node3:9603'
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '([^:]+):(\d+)'
        replacement: '${1}'

  # Scrape Redis metrics (if using redis_exporter)
  - job_name: 'redis'
    static_configs:
      - targets:
          - 'redis-exporter:9121'

  # Scrape Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets:
          - 'localhost:9090'

Multi-Node Scraping

For dynamic node discovery:
scrape_configs:
  - job_name: 'sockudo'
    dns_sd_configs:
      - names:
          - 'sockudo.service.consul'
        type: 'A'
        port: 9601

Alert Rules

Create monitoring/rules/sockudo-alerts.yml:
groups:
  - name: sockudo
    interval: 30s
    rules:
      # High connection count
      - alert: HighConnectionCount
        expr: sockudo_connections_active > 50000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High connection count on {{ $labels.instance }}"
          description: "Node {{ $labels.instance }} has {{ $value }} active connections"

      # Connection errors
      - alert: HighConnectionErrors
        expr: rate(sockudo_connection_errors_total[5m]) > 10
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High connection error rate on {{ $labels.instance }}"
          description: "{{ $value }} connection errors per second"

      # Message send failures
      - alert: MessageSendFailures
        expr: rate(sockudo_message_send_errors_total[5m]) > 5
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High message send error rate"
          description: "{{ $value }} message send errors per second on {{ $labels.instance }}"

      # Rate limiting
      - alert: HighRateLimitHits
        expr: rate(sockudo_rate_limit_exceeded_total[5m]) > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High rate limit hits for app {{ $labels.app_id }}"
          description: "{{ $value }} rate limit hits per second"

      # Slow consumers
      - alert: SlowConsumerDisconnects
        expr: rate(sockudo_websocket_buffer_full_total{action="disconnect"}[5m]) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Slow consumers being disconnected"
          description: "{{ $value }} slow consumer disconnects per second on {{ $labels.instance }}"

      # Node down
      - alert: SockudoNodeDown
        expr: up{job="sockudo"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Sockudo node {{ $labels.instance }} is down"
          description: "Node has been down for more than 1 minute"

      # High memory usage
      - alert: HighMemoryUsage
        expr: process_resident_memory_bytes > 3000000000  # 3GB
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage: {{ $value | humanize }}B"

      # Redis connection errors
      - alert: RedisConnectionErrors
        expr: rate(sockudo_adapter_errors_total{adapter="redis"}[5m]) > 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Redis adapter errors on {{ $labels.instance }}"
          description: "{{ $value }} Redis errors per second"

Health Checks

Application Health

Sockudo provides health check endpoints:
# General health check
curl http://localhost:6001/up

# App-specific health check
curl http://localhost:6001/up/my-app-id
Response:
{
  "status": "ok",
  "timestamp": 1709251234
}

Docker Health Check

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:6001/up/my-app"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 30s

Kubernetes Liveness/Readiness

apiVersion: v1
kind: Pod
metadata:
  name: sockudo
spec:
  containers:
  - name: sockudo
    image: sockudo:latest
    ports:
    - containerPort: 6001
    - containerPort: 9601
    
    livenessProbe:
      httpGet:
        path: /up
        port: 6001
      initialDelaySeconds: 30
      periodSeconds: 10
      timeoutSeconds: 5
      failureThreshold: 3
    
    readinessProbe:
      httpGet:
        path: /up
        port: 6001
      initialDelaySeconds: 10
      periodSeconds: 5
      timeoutSeconds: 3
      failureThreshold: 2

Logging

JSON Logging (Production)

Enable structured JSON logs:
environment:
  LOG_OUTPUT_FORMAT: "json"
  LOG_INCLUDE_TARGET: "true"
  RUST_LOG: "info,sockudo=info"

logging:
  driver: "json-file"
  options:
    max-size: "10m"
    max-file: "3"
Log format:
{
  "timestamp": "2024-03-01T12:34:56.789Z",
  "level": "INFO",
  "target": "sockudo::websocket",
  "fields": {
    "message": "WebSocket connection established",
    "socket_id": "abc123",
    "app_id": "my-app",
    "remote_addr": "192.168.1.100"
  }
}

Log Levels

# Production: minimal logging
RUST_LOG="warn,sockudo=info"

# Development: detailed logging
RUST_LOG="debug,sockudo=debug"

# Troubleshooting: verbose logging
RUST_LOG="trace,sockudo=trace"

# Module-specific logging
RUST_LOG="info,sockudo::websocket=debug,sockudo::adapter=trace"

Log Aggregation

Fluentd

services:
  sockudo:
    logging:
      driver: fluentd
      options:
        fluentd-address: localhost:24224
        tag: sockudo.{{.Name}}

Loki

services:
  sockudo:
    logging:
      driver: loki
      options:
        loki-url: "http://loki:3100/loki/api/v1/push"
        loki-external-labels: "job=sockudo,environment=production"

Grafana Dashboards

Setup Grafana

services:
  grafana:
    image: grafana/grafana:latest
    container_name: sockudo-grafana
    restart: unless-stopped
    
    environment:
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
      GF_USERS_ALLOW_SIGN_UP: "false"
      GF_ANALYTICS_REPORTING_ENABLED: "false"
    
    ports:
      - "3000:3000"
    
    volumes:
      - grafana-data:/var/lib/grafana
      - ./monitoring/grafana/dashboards:/etc/grafana/provisioning/dashboards:ro
      - ./monitoring/grafana/datasources:/etc/grafana/provisioning/datasources:ro
    
    networks:
      - sockudo-network

volumes:
  grafana-data:

Datasource Configuration

Create monitoring/grafana/datasources/prometheus.yml:
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true

Dashboard Panels

Key panels to include:
  1. Connection Count: sockudo_connections_active
  2. Message Rate: rate(sockudo_messages_sent_total[1m])
  3. Error Rate: rate(sockudo_connection_errors_total[5m])
  4. Channel Count: sockudo_channels_active
  5. Memory Usage: process_resident_memory_bytes
  6. CPU Usage: rate(process_cpu_seconds_total[1m])
  7. Buffer Usage: sockudo_websocket_buffer_usage
  8. Rate Limits: sockudo_rate_limit_exceeded_total

Cluster Health Monitoring

Sockudo includes built-in cluster health tracking:
{
  "adapter": {
    "cluster_health": {
      "enabled": true,
      "heartbeat_interval_ms": 10000,
      "node_timeout_ms": 30000,
      "cleanup_interval_ms": 10000
    }
  }
}
Features:
  • Automatic node discovery
  • Heartbeat monitoring (every 10s)
  • Failed node detection (30s timeout)
  • Automatic cleanup of stale connections
Metrics:
# Active cluster nodes
sockudo_cluster_nodes_active 3

# Failed nodes detected
sockudo_cluster_nodes_failed_total 1

# Heartbeat failures
sockudo_cluster_heartbeat_failures_total 5

Alerting

Alertmanager Setup

services:
  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - "9093:9093"
    volumes:
      - ./monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
      - alertmanager-data:/alertmanager
    networks:
      - sockudo-network

volumes:
  alertmanager-data:
Create monitoring/alertmanager.yml:
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'slack'
  
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
      continue: true
    
    - match:
        severity: warning
      receiver: 'slack'

receivers:
  - name: 'slack'
    slack_configs:
      - channel: '#sockudo-alerts'
        title: 'Sockudo Alert'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
  
  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_KEY'

Monitoring Checklist

  • Prometheus scraping all nodes
  • Alert rules configured
  • Grafana dashboards created
  • Health checks passing
  • JSON logging enabled
  • Log aggregation configured
  • Alertmanager routing set up
  • On-call rotation defined
  • Runbook documentation complete
  • Cluster health monitoring enabled

Next Steps