Test and diagnosis bare metal servers

4 min readJan 25, 2018

It is 2018 year, but people who prefer the dedicated servers still alive :). I am working in hosting company and we make diagnosis for servers before providing to customer. I will open some hidden piece for you.

First point, we have an unattended server,it became free status or we got new equipment from vendor. Usually it’s intel and supermicro. Automatic system boots server via PXE with linux kernel and starts diagnostics script. Script check the health of every components (CPU,RAM,Storage,NET) and test the perfomance of entire server.

CPU

Check cpu overheating status
Check cpu is stable

For CPU stress we use mprime-bin program, and running it for 30 minutes.

/usr/bin/timeout 30m /opt/mprime -t 
/bin/grep -i error /root/result.txt

Every minute check CPU temp, from ipmi sensors. The allowed cpu temp is less than 60C. Also you need to check /proc/kmsg and mprime results.txt file for some complex CPU Error.

RAM

We should to check every RAM cells. Classical Memtet+ tool does not suitable, because work on bare metal and did not provide result in free versions. But memory test from operation system level, does not check the all RAM cells. It is weakness of our approach. We chose memtester tool. Usage:

memtester `cat /proc/meminfo |grep MemFree | awk '{print $2-1024}'`k 5

Script checks program output status, it should be 0. Zero means that the memory is fine.

We collect the ram read speed, but it is not indicative parameter. We stoped to use it for valuation.

Storage

Lookup for installed disks by bash (legacy, but still works):

Find any devices at /dev/sd? and /dev/cciss/c0d?, and check every element whether it is disk.

hdlist() {
  HDLIST=$(ls /dev/sd?)
  HDLIST="${HDLIST} $(ls /dev/cciss/c0d? 2>/dev/null)"
  REAL_HDLIST=""
  for disk in ${HDLIST}; do
    if head -c0 ${disk} 2>/dev/null; then
      REAL_HDLIST="${REAL_HDLIST} ${disk}"
    fi
  done
  echo "${REAL_HDLIST}"
}

HDD

we fully clear the hdd from previous customer:

for DISK in $(hdlist)
  do
    echo "Clearing ${DISK}"
    parted -s ${DISK} mklabel gpt
    dd if=/dev/zero of=${DISK} bs=512 count=1
  done
  if [ "($FULL_HDD_CLEAR)" = "YES" ]; then
  echo "Clearing disks full (very slow)"
  wget -O /dev/null -q --no-check-certificate "${STATEURL}&info=slowhddclear"
  for DISK in $(hdlist)
  do
    echo "Clearing ${DISK}"
    dd if=/dev/zero of=${DISK} bs=1M
  done
  fi

Check smart values

Reallocated Sectors Count must be less than 100
Power_On_Hours less than 43200, 5 year old hdd goes to trash
Current_Pending_Sector equals to 0

Check speed value for disk

The speed valued at 3 different place of disk. At place with offset 4Gb from start, at the middle and 4Gb from the end of disk. For each offset we run this function

sysctl -w vm.drop_caches=3 > /dev/null
                zcav -c 1 -s ${SKIP_COUNT} -r ${OFFSET} -l /tmp/zcav1.log -f ${DISK}
                if [ $? -ne 0 ]; then
                        echo err
                        exit
                fi
                SPEED=$(cat /tmp/zcav1.log | awk '! /^#/ {speed+=$2; count+=1}END{print int(speed/count)}')

Speed depends of HDD model, it’s not less than 100 MB/s usually.

SSD

Check smart values

MediaWearoutIndicator it is percentage of disk lifetime. For new disk it will be 100. Allowed all values > 10;
ReallocatedSector allowed values < 100

Raid status

Identify model raid and check the raid status. Must be optimal.

detect_raid_type() {
  RAIDSTR=$(lspci | grep -i raid)
  if echo ${RAIDSTR} | grep -iq adaptec; then
    # THis is adaptec
    echo "adaptec"
  elif echo ${RAIDSTR} | grep -iqE 'lsi|megaraid'; then
    # THis is LSI
    echo "lsi"
  elif echo ${RAIDSTR} | grep -iq '3ware'; then
    # THis is 3ware
    echo "3ware"
  elif echo ${RAIDSTR} | grep -iqE 'Hewlett-Packard.*Smart'; then
    # THis is HP Smart Array
    echo "HP-SmartArray"
  elif dmesg | grep -q cciss/ ; then
    echo cciss
  else
    echo "unknown"
  fi
}raid_status_lsi() {
  RSTATUS=$(megacli -LDInfo -Lall -aALL |awk -F: '$1 ~ /State/ {print $2}')
  if ! echo "${RSTATUS}" | grep -q 'Optimal' ;then
    echo "${RSTATUS}"
    return 1
  fi
}raid_status_unknown() {
  echo "Unknown RAID"
  return 0
}raid_status_cciss() {
  RSTATUS=$(cciss_vol_status /dev/cciss/c*d0)
  if ! echo ${RSTATUS} | grep -q "OK" ; then
    echo "${RSTATUS}"
    return 1
  fi
}

Network

Check network download speed — must be >300mbit

curl -k --progress-bar -w "%{speed_download}" -o /dev/null "($CGI_MGR_URLv4)/speedtest_cgi?id=($AUTH_ID)&func=server.speedtest"

Statistics

The script checks 323 servers per month, in average. 124 servers per month marked as broken. We do not provide server with problem for customers. We have pool for broken servers. And datacenter engineers changes disks, repairs the fans. If we have CPU or RAM problem, then we always managed to change them under the guarantee.