ixgbe Detected Tx Unit Hang

| 2019년 9월 26일 | 0 Comments

시스템 환경
OS : Ubuntu 16.04 LTS 
Kernel : 4.4.0-164-generic
Network Card : Intel Corporation 82599ES 10-Gigabit SFI/SFP+ 듀얼랜 / ixgbe

 

증상 및 메세지
네트워크 포트 링크 업/다운 반복되며 계속해서 아래와 같은 메세지가 로깅
메세지가 발생되는 주기는 특별한 패턴이 없으며 해당 로그가 발생후 네트워크 모듈을 다시 로딩하게 되면 증상은 사라지나 얼마되지 않아서 동일한 증상이 반복됨
또한 디바이스내 dropped 패킷이 상당히 많이 누적됨
…………………………………………………..
…………………………………………………..
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564889] ixgbe 0000:2e:00.0 eth2: Detected Tx Unit Hang
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564889]   Tx Queue             <2>
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564889]   TDH, TDT             <74>, <8e>
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564889]   next_to_use          <8e>
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564889]   next_to_clean        <74>
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564889] tx_buffer_info[next_to_clean]
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564889]   time_stamp           <100d400ce>
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564889]   jiffies              <100d405df>
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564894] ixgbe 0000:2e:00.0 eth2: Detected Tx Unit Hang
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564894]   Tx Queue             <7>
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564894]   TDH, TDT             <c9>, <e4>
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564894]   next_to_use          <e4>
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564894]   next_to_clean        <c9>
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564894] tx_buffer_info[next_to_clean]
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564894]   time_stamp           <100d400c1>
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564894]   jiffies              <100d405df>
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564901] ixgbe 0000:2e:00.0 eth2: Detected Tx Unit Hang
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564901]   Tx Queue             <4>
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564901]   TDH, TDT             <181>, <190>
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564901]   next_to_use          <190>
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564901]   next_to_clean        <181>
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564901] tx_buffer_info[next_to_clean]
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564901]   time_stamp           <100d400ce>
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564901]   jiffies              <100d405df>
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564907] ixgbe 0000:2e:00.0 eth2: Detected Tx Unit Hang
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564907]   Tx Queue             <3>
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564907]   TDH, TDT             <3b>, <57>
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564907]   next_to_use          <57>
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564907]   next_to_clean        <3b>
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564907] tx_buffer_info[next_to_clean]
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564907]   time_stamp           <100d400ce>
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564907]   jiffies              <100d405df>
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564912] ixgbe 0000:2e:00.0 eth2: Detected Tx Unit Hang
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564912]   Tx Queue             <0>
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564912]   TDH, TDT             <191>, <1a8>
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564912]   next_to_use          <1a8>
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564912]   next_to_clean        <191>
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564912] tx_buffer_info[next_to_clean]
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564912]   time_stamp           <100d400ce>
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564912]   jiffies              <100d405df>
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564924] ixgbe 0000:2e:00.0 eth2: tx hang 1 detected on queue 7, resetting adapter
Sep 25 06:54:26 XXXXXXXX kernel: [55884.564926] ixgbe 0000:2e:00.0 eth2: tx hang 1 detected on queue 4, resetting adapter
 
 
 
추가사항
본 시스템 구성은 듀얼 네트워크 디바이스를  Bonding으로 구성되었으며 MTU 패킷사이즈를 9000으로 사용
처음 접근은 이부분을 의심하고 확인하였으나….Bonding mode 및 단일 디바이스 , MTU 1500에서도 증상은 동일하였다. 
 
이미 오래전부터 Ubuntu 16.04 이하 및 CentOS 등에서도 버그로 올라온 내용들이며 구글링을 통해 확인해보면 대부분 조치방법은 아래 3가지 정도로 요약해볼수 있었다.
 
 
1. 네트워크 디바이스 옵션 변경
:: 아래 옵션들을 ON/OFF 변경하여 상태를 확인해보았으나 동일증상 지속
tso => tcp-segmentation-offload
gso => generic-segmentation-offload
gro => generic-receive-offload
sg => scatter-gather
ufo => udp-fragmentation-offload (Cannot change)
lro => large-receive-offload (Cannot change)
 
# ethtool -K eth2 gro off lro off
# ethtool -k eth2 | grep large-receive-offload
large-receive-offload: off
 
# ethtool -K eth2 gro on lro on

 
 
 
2. 네트워크 디바이스 드라이버 변경 (4.2.1-k –> 5.6.3)
:: 드라이버를 최신버전으로 변경이후에도 동일증상 지속
# ethtool -i eth2
driver: ixgbe
version: 4.2.1-k
firmware-version: 0x2b2c0001
expansion-rom-version: 
bus-info: 0000:2e:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no
# modinfo ixgbe
filename:       /lib/modules/4.4.0-62-generic/kernel/drivers/net/ethernet/intel/ixgbe/ixgbe.ko
version:        4.2.1-k
license:        GPL
description:    Intel(R) 10 Gigabit PCI Express Network Driver
author:         Intel Corporation, <linux.nics@intel.com>
srcversion:     F5568BA52A50F97CB589A09
alias:          pci:v00008086d000015ACsv*sd*bc*sc*i*
alias:          pci:v00008086d000015ADsv*sd*bc*sc*i*
alias:          pci:v00008086d000015ABsv*sd*bc*sc*i*
alias:          pci:v00008086d000015AAsv*sd*bc*sc*i*
alias:          pci:v00008086d00001563sv*sd*bc*sc*i*
alias:          pci:v00008086d00001560sv*sd*bc*sc*i*
alias:          pci:v00008086d0000154Asv*sd*bc*sc*i*
alias:          pci:v00008086d00001557sv*sd*bc*sc*i*
alias:          pci:v00008086d00001558sv*sd*bc*sc*i*
alias:          pci:v00008086d0000154Fsv*sd*bc*sc*i*
alias:          pci:v00008086d0000154Dsv*sd*bc*sc*i*
alias:          pci:v00008086d00001528sv*sd*bc*sc*i*
alias:          pci:v00008086d000010F8sv*sd*bc*sc*i*
alias:          pci:v00008086d0000151Csv*sd*bc*sc*i*
alias:          pci:v00008086d00001529sv*sd*bc*sc*i*
alias:          pci:v00008086d0000152Asv*sd*bc*sc*i*
alias:          pci:v00008086d000010F9sv*sd*bc*sc*i*
alias:          pci:v00008086d00001514sv*sd*bc*sc*i*
alias:          pci:v00008086d00001507sv*sd*bc*sc*i*
alias:          pci:v00008086d000010FBsv*sd*bc*sc*i*
alias:          pci:v00008086d00001517sv*sd*bc*sc*i*
alias:          pci:v00008086d000010FCsv*sd*bc*sc*i*
alias:          pci:v00008086d000010F7sv*sd*bc*sc*i*
alias:          pci:v00008086d00001508sv*sd*bc*sc*i*
alias:          pci:v00008086d000010DBsv*sd*bc*sc*i*
alias:          pci:v00008086d000010F4sv*sd*bc*sc*i*
alias:          pci:v00008086d000010E1sv*sd*bc*sc*i*
alias:          pci:v00008086d000010F1sv*sd*bc*sc*i*
alias:          pci:v00008086d000010ECsv*sd*bc*sc*i*
alias:          pci:v00008086d000010DDsv*sd*bc*sc*i*
alias:          pci:v00008086d0000150Bsv*sd*bc*sc*i*
alias:          pci:v00008086d000010C8sv*sd*bc*sc*i*
alias:          pci:v00008086d000010C7sv*sd*bc*sc*i*
alias:          pci:v00008086d000010C6sv*sd*bc*sc*i*
alias:          pci:v00008086d000010B6sv*sd*bc*sc*i*
depends:        mdio,ptp,dca,vxlan
intree:         Y
vermagic:       4.4.0-62-generic SMP mod_unload modversions 
parm:           max_vfs:Maximum number of virtual functions to allocate per physical function – default is zero and maximum value is 63. (Deprecated) (uint)
parm:           allow_unsupported_sfp:Allow unsupported and untested SFP+ modules on 82599-based adapters (uint)
parm:           debug:Debug level (0=none,…,16=all) (int)
 
금일기준(2019. 09.25)  최신 드라이버 설치 
# wget https://downloadmirror.intel.com/14687/eng/ixgbe-5.6.3.tar.gz
# tar zxvf ixgbe-5.6.3.tar.gz
# cd ixgbe-5.6.3/src/
# make install 
# rmmod ixgbe ; modprobe ixgbe RSS=8
 
# ethtool -i eth2
driver: ixgbe
version: 5.6.3
firmware-version: 0x2b2c0001, 1.1197.0
expansion-rom-version: 
bus-info: 0000:2e:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
 
 
 
3. Kernel 변경 
:: Kernel >= 4.18.0
Ubuntu 16.04 에서 apt 에서 지원되는 커널패키지는 4.15.0-64 까지라서 소스컴파일 
 
# apt-get install -y build-essential libncurses5 libncurses5-dev bin86 kernel-package libssl-dev bison flex libelf-dev
# cd /usr/local/src
# wget https://mirrors.edge.kernel.org/pub/linux/kernel/v4.x/linux-4.18.19.tar.gz
# tar zxvf linux-4.18.19.tar.gz 
# mv linux-4.18.19 /usr/src
# cd /usr/src/linux-4.18.19
# cp /boot/config-4.4.0-161-generic .config
# make menuconfig
# make-kpkg –J 8 –initrd –revision=1.0 kernel_image
# dpkg -i ../linux-image-4.18.19_1.0_amd64.deb
# reboot 
# uname -r
4.18.19
 
 
Kernel 4.18. 이상부터 패치된 내역 
 – ESP(Encapsulating Security Payload) 
This issue has this upstream thread about the problem and per this Archlinux forum post, setting 
CONFIG_INET_ESP_OFFLOAD=n 
CONFIG_INET6_ESP_OFFLOAD=n 
fixes the problem. I have built a kernel with these unset and verified that these changes work. 
 
Old Kernel 
# grep “INET[A-Za-z0-9]*_ESP” .config
CONFIG_INET_ESP=m
CONFIG_INET6_ESP=m
 
 
# uname -r
4.18.19
 
# grep “INET[A-Za-z0-9]*_ESP” .config
CONFIG_INET_ESP=m
# CONFIG_INET_ESP_OFFLOAD is not set
CONFIG_INET6_ESP=m
# CONFIG_INET6_ESP_OFFLOAD is not set
 
:: 기존에 없던 ESP 관련 네트워크 옵션이 생겼으며 unset 상태로 운영
# ethtool -k eth2 |grep -i esp
Cannot get device udp-fragmentation-offload settings: Operation not supported
tx-esp-segmentation: off [fixed]
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
 
# ethtool -i eth2
driver: ixgbe
version: 5.1.0-k
firmware-version: 0x00012b2c, 1.1197.0
expansion-rom-version: 
bus-info: 0000:2e:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
 
 
종합 
1,2번 조치로 해결된 사용자들도 있었지만….저같은 경우는 3번 Kernel 4.18 이상으로 업데이트함으로 조치
 
 

Tags: , , , , , , , ,

Category: HARDWARE, LINUX, 가상화/클라우드, 네트워크/관제, 솔루션/IT기타

이 경현

About the Author ()

http://www.cloudv.kr (주)스마일서브 제2연구소