Performance Troubleshooting

This guide helps diagnose and address potential performance issues when running Retina, particularly the packetparser plugin on high-core-count systems.

Background

Community users have reported performance considerations when running the packetparser plugin (used in Advanced metrics mode) on systems with high CPU core counts under sustained network load. For detailed background, see the packetparser performance considerations.

Symptoms to Monitor

Watch for these indicators after deploying Retina:

Decreased network throughput compared to baseline
High CPU usage by Retina agent pods
Elevated context switches on nodes running Retina
Increased latency in network-intensive applications

Diagnostic Steps

Step 1: Identify Your Configuration

Check which plugins are enabled:

kubectl get configmap retina-config -n kube-system -o yaml | grep enabledPlugin

If packetparser is enabled, you're running Advanced metrics mode which is more resource-intensive.

Step 2: Check Node Specifications

# Check core count on nodes
kubectl get nodes -o custom-columns=NAME:.metadata.name,CPU:.status.capacity.cpu

# Identify nodes with high core counts (32+)
kubectl get nodes -o json | jq '.items[] | select((.status.capacity.cpu | tonumber) >= 32) | {name: .metadata.name, cpu: .status.capacity.cpu}'

Step 3: Monitor Retina Resource Usage

# Check CPU and memory usage of Retina pods
kubectl top pods -n kube-system -l app=retina

# For more detailed analysis, check specific pod on a node
RETINA_POD=$(kubectl get pods -n kube-system -l app=retina -o jsonpath='{.items[0].metadata.name}')
kubectl top pod $RETINA_POD -n kube-system

Step 4: Establish Performance Baseline

Before and after Retina deployment, measure:

Network throughput (using your application's metrics or tools like iperf3)
Application response times
CPU utilization on nodes

Mitigation Options

If you observe performance impact, consider these approaches:

Option 1: Use Basic Metrics Mode (Recommended)

Basic metrics mode provides node-level observability without the packetparser plugin:

# Reinstall or upgrade Retina without packetparser
helm upgrade retina oci://ghcr.io/microsoft/retina/charts/retina \
    --set enabledPlugin_linux="\[dropreason\,packetforward\,linuxutil\,dns\]" \
    --reuse-values

Trade-off: You'll have node-level metrics only, not pod-level metrics.

Option 2: Enable Data Sampling

Reduce event volume by sampling packets:

apiVersion: v1
kind: ConfigMap
metadata:
  name: retina-config
  namespace: kube-system
data:
  config.yaml: |
    dataSamplingRate: 10  # Sample 1 out of every 10 packets

Trade-off: Reduced data granularity, but lower overhead.

Option 3: Use High Data Aggregation Level

Reduce events at the eBPF level:

apiVersion: v1
kind: ConfigMap
metadata:
  name: retina-config
  namespace: kube-system
data:
  config.yaml: |
    dataAggregationLevel: "high"

Trade-off: Disables host interface monitoring; API server latency metrics may be less reliable.

Option 4: Selective Deployment

Deploy Retina only on nodes where you need detailed observability:

# Use node selectors or taints/tolerations
apiVersion: apps/v1
kind: DaemonSet
spec:
  template:
    spec:
      nodeSelector:
        retina-enabled: "true"

Advanced Diagnostics

Inspecting eBPF Maps

To see what data structures Retina is using:

# Access the node
kubectl debug node/<node-name> -it --image=ubuntu

# In the debug container, enter the host namespace
chroot /host

# List BPF maps (requires bpftool)
bpftool map list | grep retina

# Check the packetparser map type
bpftool map show name retina_packetparser_events

Currently, packetparser uses BPF_MAP_TYPE_PERF_EVENT_ARRAY.

Monitoring Event Rates (Advanced)

If you have bpftrace available on nodes:

# Monitor perf_event activity
sudo bpftrace -e '
  kprobe:perf_event_output { @events = count(); }
  interval:s:5 { print(@events); clear(@events); }
'

High event rates may correlate with increased CPU usage.

Reporting Issues

If you experience performance issues, please report them with:

Node specifications: CPU count, memory, kernel version
Retina configuration: Version, enabled plugins, configuration settings
Workload characteristics: Network throughput, number of pods, traffic patterns
Performance metrics: CPU usage, network throughput before/after, specific observations

Open an issue at: https://github.com/microsoft/retina/issues

Background​

Symptoms to Monitor​

Diagnostic Steps​

Step 1: Identify Your Configuration​

Step 2: Check Node Specifications​

Step 3: Monitor Retina Resource Usage​

Step 4: Establish Performance Baseline​

Mitigation Options​

Option 1: Use Basic Metrics Mode (Recommended)​

Option 2: Enable Data Sampling​

Option 3: Use High Data Aggregation Level​

Option 4: Selective Deployment​

Advanced Diagnostics​

Inspecting eBPF Maps​

Monitoring Event Rates (Advanced)​

Reporting Issues​

Further Resources​