Monitoring and Metrics

Overview

CloudGaming provides multi-layer monitoring across WebRTC streaming, signaling infrastructure, and host health. This guide covers all available metrics, health checks, and monitoring best practices.

WebRTC Statistics

Real-Time Transport Metrics

The Go/Pion WebRTC implementation tracks comprehensive transport statistics: Packet Loss and Retransmission

// Tracked in gortc_main/main.go:236-245
type WebRTCStats struct {
    nackCount        uint32  // NACK (negative acknowledgment) count
    pliCount         uint32  // Picture Loss Indication count
    twccCount        uint32  // Transport-wide congestion control count
    pacerQueueLength uint32  // Pacer queue depth
    sendBitrateKbps  uint32  // Estimated send bitrate
}

Available Metrics:

Packet Loss - Percentage of lost RTP packets
RTT (Round-Trip Time) - Network latency in milliseconds
Jitter - Packet arrival time variance
NACK Count - Number of retransmission requests
PLI Count - Number of keyframe requests
Send Bitrate - Current video bitrate in kbps
Pacer Queue Length - Number of frames waiting to send

Stats Monitoring Implementation

// Stats updated every 500ms (gortc_main/main.go:248-268)
func startStatsMonitoring() {
    ticker := time.NewTicker(500 * time.Millisecond)
    defer ticker.Stop()
    
    audioHealthTicker := time.NewTicker(5 * time.Second)
    defer audioHealthTicker.Stop()
    
    for {
        select {
        case <-ticker.C:
            updatePacerQueueLength()
        case <-audioHealthTicker.C:
            reportAudioQueueHealth()
        }
    }
}

RTCP Feedback

RTCP (RTP Control Protocol) provides real-time feedback:

// Callback signature (gortc_main/main.go:16-19)
typedef void (*WebRTCStatsCallback)(
    double packetLoss, double rtt, double jitter,
    unsigned int nackCount, unsigned int pliCount, 
    unsigned int twccCount, unsigned int pacerQueueLength, 
    unsigned int sendBitrateKbps
);

Log Example:

[Go/Pion] WebRTC Stats: loss=0.5%, rtt=28ms, jitter=2.1ms, 
nack=12, pli=1, bitrate=8500kbps, queue=2

Audio Queue Monitoring

Queue Health Metrics

Audio queue depth indicates network congestion:

// Reported every 5 seconds (gortc_main/main.go:486-532)
func reportAudioQueueHealth() {
    avgDepth := getAverageAudioQueueDepth()
    currentDepth := len(audioSendQueue)
    
    healthStatus := "GOOD"
    if avgDepth > 2.0 {
        healthStatus = "WARNING"
    }
    if avgDepth > 2.8 {
        healthStatus = "CRITICAL"
    }
}

Health Status Thresholds:

GOOD: Average queue depth < 2.0 packets
WARNING: Average queue depth > 2.0 packets
CRITICAL: Average queue depth > 2.8 packets

Log Example:

[Go/Pion] Audio Queue Health [WARNING]: current=3, avg=2.4, 
min=1, max=4, samples=10
[Go/Pion] ⚠️  Audio queue consistently congested - 
consider bitrate reduction

Buffer Pool Health

Memory Management Monitoring

The tiered buffer pool tracks allocation efficiency:

// Health check (gortc_main/main.go:310-344)
func checkBufferPoolHealth() {
    totalHits := sum(sampleBufPool.hits)
    totalMisses := sum(sampleBufPool.misses)
    hitRate := float64(totalHits) / float64(totalHits + totalMisses)
    
    if hitRate < 85.0 {
        log.Printf("⚠️  Low hit rate %.1f%%", hitRate)
    }
}

Performance Indicators:

Hit Rate 95%+: Excellent - minimal allocations
Hit Rate 90-95%: Good - some allocations expected
Hit Rate 80-90%: Moderate - consider pool tuning
Hit Rate below 80%: Poor - high GC pressure

Log Example:

[Go/Pion] Buffer Pool Statistics:
  Tier 4 (4096 bytes): 1523 hits, 12 misses, 12 allocs (99.2% hit rate)
  Tier 7 (32768 bytes): 8901 hits, 45 misses, 45 allocs (99.5% hit rate)
  Overall: 15234 requests, 98.7% hit rate, 89 total allocations
  ✅ Excellent performance - minimal allocations

Signaling Server Metrics

Prometheus Metrics Endpoint

The signaling server exposes metrics at /metrics:

curl http://localhost:3002/metrics

Available Metrics

Connection Metrics:

# Active WebSocket connections
signaling_active_connections 42

# Rooms with local connections  
signaling_local_rooms 8

Message Processing:

# Total messages forwarded
signaling_messages_forwarded_total 15234

# Schema validation rejections
signaling_schema_rejections_total 3

# Rate limit drops
signaling_rate_limit_drops_total 12

# Backpressure connection closes
signaling_backpressure_closes_total 0

Redis Health:

# Redis connection status (1=up, 0=down)
signaling_redis_up 1

# Circuit breaker status (1=open, 0=closed)
signaling_circuit_breaker_open 0

# Redis command latency histogram
signaling_redis_cmd_latency_seconds_bucket{le="0.005"} 1234
signaling_redis_cmd_latency_seconds_bucket{le="0.01"} 1240
signaling_redis_cmd_latency_seconds_bucket{le="0.025"} 1245

Fanout Performance:

# Local message fanout latency
signaling_fanout_latency_seconds_bucket{le="0.001"} 5678
signaling_fanout_latency_seconds_bucket{le="0.005"} 5690

Implementation Reference

See Server/metrics.js:1-117 for the complete metrics implementation.

Matchmaker Monitoring

Host Health Tracking

The matchmaker monitors host heartbeats: Heartbeat Endpoint:

POST /api/host/heartbeat
Authorization: Bearer <HOST_SECRET>

{
  "hostId": "550e8400-e29b-41d4-a716-446655440000",
  "roomId": "game-room-1",
  "region": "us-west",
  "status": "idle",
  "capacity": 1,
  "availableSlots": 1
}

Response:

{
  "success": true,
  "ttl": 30
}

Host TTL Monitoring

GET /api/hosts/ttl

Response:

[
  {
    "hostId": "550e8400-e29b-41d4-a716-446655440000",
    "ttlSeconds": 28
  }
]

Stale Host Pruning:

// Runs every 10 seconds (mm_server/Matchmaker.js:189-206)
async function pruneStaleIdleHosts() {
    const stale = [];
    const ids = await redisClient.sMembers('idle_hosts');
    for (const id of ids) {
        const ttl = await redisClient.ttl(`host:${id}`);
        if (ttl === -2) {  // Key expired
            stale.push(id);
        }
    }
    if (stale.length > 0) {
        await redisClient.sRem('idle_hosts', stale);
    }
}

Health Check Endpoints

Signaling Server

Liveness Probe:

GET /healthz
# Returns: 200 OK

Readiness Probe:

GET /readyz
# Returns: 200 "ready" if Redis is connected
# Returns: 503 "not-ready" if Redis is down or draining

Matchmaker

Health Endpoints:

GET /healthz   # Liveness
GET /readyz    # Readiness  
GET /health    # General health
GET /          # Returns "ok"

All return 200 OK immediately to prevent Railway from killing the container.

Host Configuration Monitoring

Monitor these key settings from config.json:

Video Configuration

{
  "video": {
    "fps": 60,
    "bitrateStart": 8000000,
    "bitrateMin": 8000000,
    "bitrateMax": 12000000,
    "preset": "p2",
    "rc": "cbr"
  }
}

Capture Settings

{
  "capture": {
    "mmcss": { "enable": true, "priority": 4 },
    "maxQueueDepth": 2,
    "skipUnchanged": true
  }
}

Audio Configuration

{
  "audio": {
    "bitrate": 80000,
    "frameSizeMs": 10,
    "enableFec": true,
    "latency": {
      "enforceSingleFrameBuffering": true,
      "targetOneWayLatencyMs": 40
    }
  }
}

Redis Monitoring

Circuit Breaker

Protects against Redis failures:

// Server/ScalableSignalingServer.js:68-84
function noteRedisFailure() {
    redisFailureCount += 1;
    if (redisFailureCount >= config.cbErrorThreshold) {
        redisCircuitOpenUntil = Date.now() + config.cbOpenMs;
        setCircuitBreakerOpen(true);
    }
}

When circuit opens:

New connections rejected with 1013 Service unavailable
Existing connections continue working
Circuit auto-closes after timeout

Connection Status

// Check Redis connectivity
const pong = await redisClient.ping();
if (pong === 'PONG') {
    // Redis is healthy
}

Monitoring Best Practices

Alerting Thresholds

Critical Alerts:

WebRTC packet loss > 5%
RTT > 150ms for sustained period
Audio queue depth > 2.8 (CRITICAL)
Buffer pool hit rate < 80%
Redis circuit breaker open
Signaling server Redis disconnected

Warning Alerts:

WebRTC packet loss > 2%
RTT > 100ms
Audio queue depth > 2.0 (WARNING)
Buffer pool hit rate < 90%
Host heartbeat TTL < 10 seconds
Rate limit drops increasing

Log Aggregation

Key Log Patterns:

# WebRTC stats
grep "WebRTC Stats" logs.txt

# Audio health
grep "Audio Queue Health" logs.txt

# Buffer pool performance
grep "Buffer Pool" logs.txt

# Redis errors
grep "Redis" logs.txt | grep -E "error|failed"

# Connection issues
grep -E "ICE|connection state" logs.txt

Grafana Dashboard Example

Panels to Include:

Active connections (signaling_active_connections)
Message throughput (rate(signaling_messages_forwarded_total[1m]))
Redis latency (signaling_redis_cmd_latency_seconds)
WebRTC packet loss percentage
Audio queue depth over time
Buffer pool hit rate
Host heartbeat count

Next Steps

Performance Tuning - Optimize metrics based on monitoring data
Troubleshooting - Debug issues found in monitoring

​Overview

​WebRTC Statistics

​Real-Time Transport Metrics

​Stats Monitoring Implementation

​RTCP Feedback

​Audio Queue Monitoring

​Queue Health Metrics

​Buffer Pool Health

​Memory Management Monitoring

​Signaling Server Metrics

​Prometheus Metrics Endpoint

​Available Metrics

​Implementation Reference

​Matchmaker Monitoring

​Host Health Tracking

​Host TTL Monitoring

​Health Check Endpoints

​Signaling Server

​Matchmaker

​Host Configuration Monitoring

​Video Configuration

​Capture Settings

​Audio Configuration

​Redis Monitoring

​Circuit Breaker

​Connection Status

​Monitoring Best Practices

​Alerting Thresholds

​Log Aggregation

​Grafana Dashboard Example

​Next Steps

Overview

WebRTC Statistics

Real-Time Transport Metrics

Stats Monitoring Implementation

RTCP Feedback

Audio Queue Monitoring

Queue Health Metrics

Buffer Pool Health

Memory Management Monitoring

Signaling Server Metrics

Prometheus Metrics Endpoint

Available Metrics

Implementation Reference

Matchmaker Monitoring

Host Health Tracking

Host TTL Monitoring

Health Check Endpoints

Signaling Server

Matchmaker

Host Configuration Monitoring

Video Configuration

Capture Settings

Audio Configuration

Redis Monitoring

Circuit Breaker

Connection Status

Monitoring Best Practices

Alerting Thresholds

Log Aggregation

Grafana Dashboard Example

Next Steps