Test and diagnosis bare metal servers

Artem Artemev
4 min readJan 25, 2018

It is 2018 year, but people who prefer the dedicated servers still alive :). I am working in hosting company and we make diagnosis for servers before providing to customer. I will open some hidden piece for you.

First point, we have an unattended server,it became free status or we got new equipment from vendor. Usually it’s intel and supermicro. Automatic system boots server via PXE with linux kernel and starts diagnostics script. Script check the health of every components (CPU,RAM,Storage,NET) and test the perfomance of entire server.

CPU

  • Check cpu overheating status
  • Check cpu is stable

For CPU stress we use mprime-bin program, and running it for 30 minutes.

/usr/bin/timeout 30m /opt/mprime -t 
/bin/grep -i error /root/result.txt

Every minute check CPU temp, from ipmi sensors. The allowed cpu temp is less than 60C. Also you need to check /proc/kmsg and mprime results.txt file for some complex CPU Error.

RAM

We should to check every RAM cells. Classical Memtet+ tool does not suitable, because work on bare metal and did not provide result in free versions. But memory test from operation system level, does not check the all RAM cells. It is weakness of our approach. We chose memtester tool. Usage:

memtester `cat /proc/meminfo |grep MemFree | awk '{print $2-1024}'`k 5

Script checks program output status, it should be 0. Zero means that the memory is fine.

We collect the ram read speed, but it is not indicative parameter. We stoped to use it for valuation.

Storage

Lookup for installed disks by bash (legacy, but still works):

Find any devices at /dev/sd? and /dev/cciss/c0d?, and check every element whether it is disk.

hdlist() {
HDLIST=$(ls /dev/sd?)
HDLIST="${HDLIST} $(ls /dev/cciss/c0d? 2>/dev/null)"
REAL_HDLIST=""
for disk in ${HDLIST}; do
if head -c0 ${disk} 2>/dev/null; then
REAL_HDLIST="${REAL_HDLIST} ${disk}"
fi
done
echo "${REAL_HDLIST}"
}

HDD

we fully clear the hdd from previous customer:

for DISK in $(hdlist)
do
echo "Clearing ${DISK}"
parted -s ${DISK} mklabel gpt
dd if=/dev/zero of=${DISK} bs=512 count=1
done
if [ "($FULL_HDD_CLEAR)" = "YES" ]; then
echo "Clearing disks full (very slow)"
wget -O /dev/null -q --no-check-certificate "${STATEURL}&info=slowhddclear"
for DISK in $(hdlist)
do
echo "Clearing ${DISK}"
dd if=/dev/zero of=${DISK} bs=1M
done
fi

Check smart values

  • Reallocated Sectors Count must be less than 100
  • Power_On_Hours less than 43200, 5 year old hdd goes to trash
  • Current_Pending_Sector equals to 0

Check speed value for disk

The speed valued at 3 different place of disk. At place with offset 4Gb from start, at the middle and 4Gb from the end of disk. For each offset we run this function

sysctl -w vm.drop_caches=3 > /dev/null
zcav -c 1 -s ${SKIP_COUNT} -r ${OFFSET} -l /tmp/zcav1.log -f ${DISK}
if [ $? -ne 0 ]; then
echo err
exit
fi
SPEED=$(cat /tmp/zcav1.log | awk '! /^#/ {speed+=$2; count+=1}END{print int(speed/count)}')

Speed depends of HDD model, it’s not less than 100 MB/s usually.

SSD

Check smart values

  • MediaWearoutIndicator it is percentage of disk lifetime. For new disk it will be 100. Allowed all values > 10;
  • ReallocatedSector allowed values < 100

Raid status

Identify model raid and check the raid status. Must be optimal.

detect_raid_type() {
RAIDSTR=$(lspci | grep -i raid)
if echo ${RAIDSTR} | grep -iq adaptec; then
# THis is adaptec
echo "adaptec"
elif echo ${RAIDSTR} | grep -iqE 'lsi|megaraid'; then
# THis is LSI
echo "lsi"
elif echo ${RAIDSTR} | grep -iq '3ware'; then
# THis is 3ware
echo "3ware"
elif echo ${RAIDSTR} | grep -iqE 'Hewlett-Packard.*Smart'; then
# THis is HP Smart Array
echo "HP-SmartArray"
elif dmesg | grep -q cciss/ ; then
echo cciss
else
echo "unknown"
fi
}
raid_status_lsi() {
RSTATUS=$(megacli -LDInfo -Lall -aALL |awk -F: '$1 ~ /State/ {print $2}')
if ! echo "${RSTATUS}" | grep -q 'Optimal' ;then
echo "${RSTATUS}"
return 1
fi
}
raid_status_unknown() {
echo "Unknown RAID"
return 0
}
raid_status_cciss() {
RSTATUS=$(cciss_vol_status /dev/cciss/c*d0)
if ! echo ${RSTATUS} | grep -q "OK" ; then
echo "${RSTATUS}"
return 1
fi
}

Network

  • Check network download speed — must be >300mbit

curl -k --progress-bar -w "%{speed_download}" -o /dev/null "($CGI_MGR_URLv4)/speedtest_cgi?id=($AUTH_ID)&func=server.speedtest"

Statistics

The script checks 323 servers per month, in average. 124 servers per month marked as broken. We do not provide server with problem for customers. We have pool for broken servers. And datacenter engineers changes disks, repairs the fans. If we have CPU or RAM problem, then we always managed to change them under the guarantee.

HDD smart stats

I have 1800 smart reports from different HDD disks. There are 103 models:

Hdd with RawReadError_Rate attribute

Our script work in DCIManager panel for datacenter infrastructure.

How did you check the bare-metal server?

--

--