🔗Perfomance checklist for SRE's

🔗Linux Perf Analysis in 60s

  • uptime load averages
  • dmesg -T | tail kernel errors
  • vmstat 1 overall stats by time
  • mpstat -P ALL 1 CPU balance
  • pidstat 1 process usage
  • iostat -xz 1 disk I/O
  • free -m memory usage
  • sar -n DEV 1 network I/O
  • sar -n TCP,ETCP 1 TCP stats
  • top check overview

🔗Linux Disk Checklist

  • iostat -xz 1 any disk I/O? if not, stop looking
  • vmstat 1 is this swapping? or, high sys time?
  • df -h are file systems nearly full?
  • ext4slower 10 (zfs*, xfs*, etc.) slow file system I/O?
  • bioslower 10 if so, check disks
  • ext4dist 1 check distribution and rate
  • biolatency 1 if interesting, check disks
  • cat /sys/devices/…/ioerr_cnt (if available) errors
  • smartctl -l error /dev/sda1 (if available) errors

🔗Linux Network Checklist

  • sar -n DEV,EDEV 1 at interface limits? or use nicstat
  • sar -n TCP,ETCP 1 active/passive load, retransmit rate
  • cat /etc/resolv.conf it's always DNS
  • mpstat -P ALL 1 high kernel time? single hot CPU?
  • tcpretrans what are the retransmits? state?
  • tcpconnect connecting to anything unexpected?
  • tcpaccept unexpected workload?
  • netstat -rnv any inefficient routes?
  • check firewall config anything blocking/throttling?
  • netstat -s play 252 metric pickup

🔗Linux CPU Checklist

  • uptime load averages
  • vmstat 1 system-wide utilization, run q length
  • mpstat -P ALL 1 CPU balance
  • pidstat 1 per-process CPU
  • CPU flame graph CPU profiling
  • CPU subsecond offset heat map look for gaps
  • perf stat -a -- sleep 10 IPC, LLC hit ratio

https://nbari.com/post/observability-tools/

{{< youtube zxCWXNigDpA>}}

Thanks to Brendan Gregg's for all this info http://www.brendangregg.com/ http://www.brendangregg.com/blog/2016-05-04/srecon2016-perf-checklists-for-sres.html

The Realities of the Job of Delivering Reliability

{{< youtube Lf4RwlOdppg>}}