Linux Performance Monitoring and Tuning with perf, eBPF and

Performance analysis and tuning are critical skills for Linux system administrators, DevOps engineers, and performance engineers. Understanding where bottlenecks occur and how to optimize system behavior requires deep knowledge of Linux performance tools. This comprehensive guide explores three powerful performance analysis frameworks: perf, eBPF (Extended Berkeley Packet Filter), and ftrace, demonstrating how to diagnose and resolve performance issues in production systems.

Understanding Linux Performance Analysis

Performance analysis in Linux involves understanding multiple subsystems: CPU, memory, disk I/O, network, and application behavior. The key to effective performance tuning is identifying bottlenecks through methodical observation and measurement.

Performance Analysis Methodology

Effective performance analysis follows a systematic approach:

Define the problem: Establish clear performance goals and metrics
Measure current performance: Gather baseline metrics
Identify bottlenecks: Determine limiting factors
Hypothesize causes: Form theories about performance issues
Test hypotheses: Use tools to validate or refute theories
Implement solutions: Apply optimizations
Verify improvements: Measure impact of changes
Repeat: Continue iterative improvement

Common Performance Bottlenecks

Understanding typical bottleneck patterns helps focus analysis efforts:

CPU saturation: All CPU cores fully utilized, tasks waiting for CPU time
Memory pressure: Insufficient RAM, excessive swapping
Disk I/O bottleneck: Storage subsystem cannot keep up with demand
Network saturation: Network bandwidth exhausted or high latency
Lock contention: Threads waiting for locks, reducing parallelism
Context switching overhead: Excessive thread switching degrading performance
Cache misses: Poor CPU cache utilization reducing efficiency

Introduction to perf: The Performance Analysis Framework

perf is a powerful performance analyzing tool built into the Linux kernel. It provides hardware and software event sampling, tracing, and profiling capabilities.

Installing perf

Installation varies by distribution:

# Debian/Ubuntu
sudo apt update
sudo apt install linux-tools-common linux-tools-generic linux-tools-$(uname -r)

## RHEL/CentOS/Fedora
sudo dnf install perf

## Arch Linux
sudo pacman -S perf

Verify installation:

perf --version

Basic perf Usage

System-wide CPU profiling:

## Record system-wide for 10 seconds
sudo perf record -a -g sleep 10

## View the recorded data
sudo perf report

## Record specific process
sudo perf record -p <PID> -g sleep 10

The -g flag enables call graph recording, providing stack traces that show function call relationships.

Real-time performance monitoring:

## Top-like interface showing hottest functions
sudo perf top

## Monitor specific CPU
sudo perf top -C 0

## Monitor specific process
sudo perf top -p <PID>

Performance Counter Statistics

perf stat provides high-level performance counter statistics:

## Measure command execution
perf stat ./my-application

## Detailed counter statistics
perf stat -d ./my-application

## Custom event selection
perf stat -e cycles,instructions,cache-references,cache-misses ./my-application

## System-wide statistics for duration
sudo perf stat -a sleep 10

Example output interpretation:

Performance counter stats for './my-application':

    1,234.56 msec task-clock                #    0.995 CPUs utilized
         123      context-switches          #    0.100 K/sec
          12      cpu-migrations            #    0.010 K/sec
       1,234      page-faults               #    1.000 K/sec
   4,567,890      cycles                    #    3.700 GHz
   6,789,012      instructions              #    1.49  insn per cycle
   1,234,567      branches                  #  1000.000 M/sec
      12,345      branch-misses             #    1.00% of all branches

Key metrics to understand:

Cycles: CPU clock cycles consumed
Instructions: Number of instructions executed
IPC (Instructions Per Cycle): Efficiency metric (higher is better, typically 1-4)
Cache references/misses: Memory access patterns
Branch misses: Branch prediction failures

CPU Flame Graphs with perf

Flame graphs visualize stack traces, making it easy to identify hot code paths:

## Record with call stacks
sudo perf record -F 99 -a -g -- sleep 30

## Generate flame graph (requires FlameGraph scripts)
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg

Clone FlameGraph tools:

git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph

Interpret flame graphs:

Width: Represents time spent in function (wider = more time)
Height: Call stack depth (top of stack is deepest)
Color: Typically random, sometimes indicates library/module
Plateaus: Functions consuming significant CPU time

Hardware Event Monitoring

Monitor hardware-level events:

## List available hardware events
perf list hardware

## Monitor cache events
sudo perf stat -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses ./my-app

## Monitor branch prediction
sudo perf stat -e branches,branch-misses ./my-app

## Memory access patterns
sudo perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses ./my-app

Tracepoint Analysis

perf can trace kernel and userspace tracepoints:

## List available tracepoints
perf list tracepoint

## Trace system calls
sudo perf trace -p <PID>

## Record specific tracepoints
sudo perf record -e syscalls:sys_enter_* -a sleep 5
sudo perf script

## Trace scheduling events
sudo perf record -e sched:* -a sleep 5

eBPF: Dynamic Kernel Instrumentation

eBPF (Extended Berkeley Packet Filter) allows running sandboxed programs in the kernel without modifying kernel source code or loading kernel modules. It’s revolutionized Linux observability and performance analysis.

Understanding eBPF Architecture

eBPF programs are:

Written in restricted C
Compiled to eBPF bytecode
Verified by kernel for safety
JIT-compiled to native code
Attached to kernel events (kprobes, uprobes, tracepoints, etc.)

Installing BCC Tools

BCC (BPF Compiler Collection) provides high-level tools for eBPF:

## Debian/Ubuntu
sudo apt update
sudo apt install bpfcc-tools linux-headers-$(uname -r)

## RHEL/CentOS/Fedora
sudo dnf install bcc-tools kernel-devel

## Arch Linux
sudo pacman -S bcc bcc-tools

Verify installation:

ls /usr/share/bcc/tools/

Essential BCC Tools

execsnoop: Trace new process execution:

sudo execsnoop-bpfcc

Shows every new process, useful for understanding system activity and detecting anomalies.

opensnoop: Trace file opens:

sudo opensnoop-bpfcc
sudo opensnoop-bpfcc -p <PID>
sudo opensnoop-bpfcc -n nginx

Identifies which files processes are accessing.

tcpconnect/tcpaccept: Trace TCP connections:

## Outbound connections
sudo tcpconnect-bpfcc

## Inbound connections
sudo tcpaccept-bpfcc

Essential for understanding network activity and debugging connectivity issues.

ext4slower: Trace slow ext4 filesystem operations:

## Show operations slower than 10ms
sudo ext4slower-bpfcc 10

Identifies I/O bottlenecks in filesystem operations.

biolatency: Block I/O latency histogram:

sudo biolatency-bpfcc

## Sample for 10 seconds with histogram
sudo biolatency-bpfcc 10 1

Provides distribution of I/O latencies, revealing storage performance characteristics.

cachestat: Page cache statistics:

sudo cachestat-bpfcc 1

Shows page cache hit ratio, indicating memory caching effectiveness.

funccount: Count function calls:

## Count kernel function calls
sudo funccount-bpfcc 'vfs_*'

## Count user-space function calls
sudo funccount-bpfcc 'c:malloc'

Useful for understanding function call frequency and hot paths.

trace: Trace arbitrary kernel and user functions:

## Trace kernel function with arguments
sudo trace-bpfcc 'do_sys_open "%s", arg2'

## Trace user function in library
sudo trace-bpfcc 'c:malloc "size = %d", arg1'

## Conditional tracing
sudo trace-bpfcc 'do_sys_open (arg2 & 0x40) "O_CREAT flag used"'

profile: CPU profiler using sampling:

## System-wide CPU profiling
sudo profile-bpfcc -F 49 -f 30

## Profile specific process
sudo profile-bpfcc -p <PID> 30

Creates frequency counts of stack traces, similar to perf but using eBPF.

Advanced BCC Tools

offcputime: Analyze off-CPU time (blocked tasks):

sudo offcputime-bpfcc 30

Shows time spent blocked (I/O wait, lock contention, etc.) rather than executing.

wakeuptime: Analyze thread wake-up sources:

sudo wakeuptime-bpfcc 30

Identifies what’s waking up threads, useful for investigating scheduling overhead.

llcstat: LLC (Last Level Cache) statistics:

sudo llcstat-bpfcc

Monitors LLC cache hit/miss rates per process.

tcpretrans: Trace TCP retransmissions:

sudo tcpretrans-bpfcc

Network performance issues often manifest as retransmissions.

runqlat: Run queue latency histogram:

sudo runqlat-bpfcc 10 1

Shows time tasks spend waiting in the CPU run queue before being scheduled.

Writing Custom eBPF Programs with BCC

Simple example tracing system calls:

#!/usr/bin/env python3
from bcc import BPF

## eBPF program
prog = """
#include <uapi/linux/ptrace.h>

BPF_HASH(counts, u32);

int count_syscalls(struct pt_regs *ctx) {
    u32 pid = bpf_get_current_pid_tgid() >> 32;
    u64 *count, zero = 0;
    
    count = counts.lookup_or_try_init(&pid, &zero);
    if (count) {
        (*count)++;
    }
    return 0;
}
"""

## Load and attach
b = BPF(text=prog)
b.attach_kprobe(event="sys_clone", fn_name="count_syscalls")

print("Tracing syscalls... Hit Ctrl-C to end")
try:
    sleep(30)
except KeyboardInterrupt:
    pass

## Print results
print("\nSyscall counts by PID:")
for k, v in b["counts"].items():
    print(f"PID {k.value}: {v.value} syscalls")

This demonstrates eBPF’s power to dynamically instrument the kernel with custom logic.

bpftrace: High-Level eBPF Scripting

bpftrace provides a high-level language for eBPF:

## Install bpftrace
sudo apt install bpftrace  # Debian/Ubuntu
sudo dnf install bpftrace  # Fedora

## One-liners
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { @[comm] = count(); }'
sudo bpftrace -e 'kprobe:vfs_read { @bytes = hist(arg2); }'

## Trace TCP connections
sudo bpftrace -e 'kprobe:tcp_connect { printf("%s connecting\n", comm); }'

## Profile user stacks
sudo bpftrace -e 'profile:hz:99 /pid == 1234/ { @[ustack] = count(); }'

bpftrace scripts can be saved to files for reuse:

## tcp-accept-lat.bt
#!/usr/bin/env bpftrace

kprobe:inet_csk_accept {
    @start[tid] = nsecs;
}

kretprobe:inet_csk_accept /@start[tid]/ {
    $dur = nsecs - @start[tid];
    @accept_lat_us = hist($dur / 1000);
    delete(@start[tid]);
}

END {
    clear(@start);
}

Run with:

sudo bpftrace tcp-accept-lat.bt

ftrace: Function Tracer Built Into the Kernel

ftrace is a kernel tracing framework built directly into the Linux kernel, providing powerful tracing capabilities without requiring additional tools.

Enabling and Using ftrace

ftrace is controlled through files in /sys/kernel/debug/tracing/ (requires debugfs mounted):

## Ensure debugfs is mounted
sudo mount -t debugfs none /sys/kernel/debug

## Change to tracing directory
cd /sys/kernel/debug/tracing

Basic ftrace Usage

List available tracers:

cat available_tracers

Common tracers include:

function: Traces all kernel functions
function_graph: Shows call graph with entry/exit
nop: No tracing (default)
irqsoff: Traces interrupt-disabled sections
preemptoff: Traces preemption-disabled sections

Enable function tracing:

## Set current tracer
echo function > current_tracer

## Start tracing
echo 1 > tracing_on

## Run workload
sleep 1

## Stop tracing
echo 0 > tracing_on

## View trace
cat trace | head -50

## Clear trace buffer
echo > trace

Function Graph Tracer

Shows function call hierarchy:

echo function_graph > current_tracer
echo 1 > tracing_on
sleep 1
echo 0 > tracing_on
cat trace | head -100

Output shows entry/exit of functions with duration:

 1)   0.123 us    |  mutex_unlock();
 1)   1.456 us    |  _cond_resched();
 1)               |  __alloc_pages_nodemask() {
 1)   0.234 us    |    get_page_from_freelist();
 1)   2.345 us    |  }

Filtering Functions

Set function filter:

## Trace only specific functions
echo do_sys_open > set_ftrace_filter
echo vfs_read >> set_ftrace_filter

## Use wildcards
echo 'tcp_*' > set_ftrace_filter

## Exclude functions
echo '!kfree' >> set_ftrace_filter

## Clear filter
echo > set_ftrace_filter

Event Tracing

ftrace provides access to kernel tracepoints:

## List available events
cat available_events

## Enable specific event
echo 1 > events/sched/sched_switch/enable

## Enable all events in subsystem
echo 1 > events/syscalls/enable

## View events
cat trace

## Disable events
echo 0 > events/enable

Function Profiling

Profile function execution time:

## Enable profiling
echo 1 > function_profile_enabled

## Run workload
sleep 5

## View results
cat trace_stat/function*

## Disable profiling
echo 0 > function_profile_enabled

Filtering with ftrace

PID filtering:

## Trace only specific PID
echo 1234 > set_ftrace_pid

## Trace multiple PIDs
echo 1234 > set_ftrace_pid
echo 5678 >> set_ftrace_pid

Event filtering:

## Filter events by condition
echo 'common_pid == 1234' > events/sched/sched_switch/filter
echo 'bytes > 1024' > events/syscalls/sys_enter_write/filter

trace-cmd: Front-end for ftrace

trace-cmd provides easier ftrace usage:

## Install trace-cmd
sudo apt install trace-cmd

## Record function trace
sudo trace-cmd record -p function -l do_sys_open sleep 1

## View recording
sudo trace-cmd report

## Record with function graph
sudo trace-cmd record -p function_graph sleep 1

## Record specific events
sudo trace-cmd record -e sched -e syscalls sleep 5

KernelShark: GUI for ftrace

KernelShark visualizes trace data:

## Install KernelShark
sudo apt install kernelshark

## Record trace
sudo trace-cmd record -e all sleep 5

## Open in GUI
kernelshark trace.dat

Practical Performance Analysis Scenarios

Scenario 1: High CPU Usage

Problem: Application consuming excessive CPU.

Analysis approach:

## 1. Identify hot functions with perf
sudo perf record -F 99 -p <PID> -g -- sleep 30
sudo perf report --stdio

## 2. Check CPU cache efficiency
perf stat -e cache-references,cache-misses,instructions,cycles -p <PID> sleep 10

## 3. Profile with eBPF
sudo profile-bpfcc -p <PID> 30

## 4. Examine function call frequency
sudo funccount-bpfcc 'p:<PID>:*' 10

Scenario 2: Slow I/O Performance

Problem: Application experiencing slow disk I/O.

Analysis approach:

## 1. Check I/O latency distribution
sudo biolatency-bpfcc 5 1

## 2. Identify slow operations
sudo ext4slower-bpfcc 10

## 3. Track file opens
sudo opensnoop-bpfcc -p <PID>

## 4. Monitor I/O patterns with ftrace
sudo trace-cmd record -e block sleep 10
sudo trace-cmd report

Scenario 3: Network Performance Issues

Problem: Network throughput lower than expected.

Analysis approach:

## 1. Monitor TCP connections
sudo tcpconnect-bpfcc
sudo tcpaccept-bpfcc

## 2. Check for retransmissions
sudo tcpretrans-bpfcc

## 3. Trace network syscalls with perf
sudo perf trace -e 'syscalls:sys_enter_send*,syscalls:sys_enter_recv*' -p <PID>

## 4. Profile network stack with eBPF
sudo profile-bpfcc -U -p <PID> 30

Scenario 4: Lock Contention

Problem: Application experiencing lock contention.

Analysis approach:

## 1. Analyze off-CPU time
sudo offcputime-bpfcc -p <PID> 30

## 2. Check futex operations with perf
sudo perf trace -e 'syscalls:sys_enter_futex' -p <PID>

## 3. Function-level contention analysis
sudo funclatency-bpfcc 'pthread_mutex_lock' -p <PID>

## 4. Stack traces of blocked threads
sudo bpftrace -e 'kprobe:finish_task_switch /pid == <PID>/ { @[kstack, ustack] = count(); }'

Performance Tuning Best Practices

Measurement and Baselines

Establish baselines: Measure normal performance before issues occur
Consistent methodology: Use same tools and metrics for comparison
Document findings: Record observations and analysis
Reproduce issues: Verify problems are consistent before optimization

Optimization Strategy

Measure first: Never optimize without measurement
Focus on bottlenecks: Optimize the slowest component first
Change one thing: Isolate impact of individual optimizations
Verify improvements: Measure after each change
Consider trade-offs: Balance performance against complexity and maintainability

Tool Selection Guidelines

Use perf when:

CPU profiling and optimization
Hardware counter analysis needed
System-wide performance analysis
Detailed call graphs required

Use eBPF/BCC when:

Dynamic tracing without system restart
Minimal overhead required
Custom metrics needed
Production system analysis with safety guarantees

Use ftrace when:

Kernel function-level tracing needed
No external tools available
Understanding kernel code paths
Low-level kernel debugging

Conclusion

Linux provides an exceptionally powerful suite of performance analysis tools. perf offers comprehensive CPU profiling and hardware event monitoring. eBPF enables safe, dynamic kernel instrumentation with minimal overhead. ftrace provides built-in kernel function tracing capabilities.

Mastering these tools requires practice and understanding of Linux internals, but the investment pays dividends in production troubleshooting and optimization. Modern performance analysis is no longer about guessing—these tools provide concrete data about system behavior, enabling evidence-based optimization decisions.

The key to effective performance analysis is methodical investigation: form hypotheses, gather data, test theories, and verify results. With perf, eBPF, and ftrace in your toolkit, you have the instrumentation needed to understand and optimize complex system behavior at every level from hardware to application.

References

perf Documentation: https://perf.wiki.kernel.org/
BCC Tools: https://github.com/iovisor/bcc
bpftrace Documentation: https://github.com/iovisor/bpftrace
Linux ftrace Documentation: https://www.kernel.org/doc/Documentation/trace/ftrace.txt
Brendan Gregg’s Performance Analysis: https://www.brendangregg.com/linuxperf.html

Understanding Linux Performance Analysis

Performance Analysis Methodology

Common Performance Bottlenecks

Introduction to perf: The Performance Analysis Framework

Installing perf

Basic perf Usage

Performance Counter Statistics

CPU Flame Graphs with perf

Hardware Event Monitoring

Tracepoint Analysis

eBPF: Dynamic Kernel Instrumentation

Understanding eBPF Architecture

Installing BCC Tools

Essential BCC Tools

Advanced BCC Tools

Writing Custom eBPF Programs with BCC

bpftrace: High-Level eBPF Scripting

ftrace: Function Tracer Built Into the Kernel

Enabling and Using ftrace

Basic ftrace Usage

Function Graph Tracer

Filtering Functions

Event Tracing

Function Profiling

Filtering with ftrace

trace-cmd: Front-end for ftrace

KernelShark: GUI for ftrace

Practical Performance Analysis Scenarios

Scenario 1: High CPU Usage

Scenario 2: Slow I/O Performance

Scenario 3: Network Performance Issues

Scenario 4: Lock Contention

Performance Tuning Best Practices

Measurement and Baselines

Optimization Strategy

Tool Selection Guidelines

Related Articles

Conclusion

References