摘要:DL_Active 跳变到 DL_Inactive(即从Physical LinkUp = 1b跳变到Physical LinkUp = 0b)。Surprise Down Error的定义强调了操作系统不知情,因此操作系统能感知的且会导致链路状态从link
什么是PCIe Surprise Down Error
PCIe Surprise Down Error 指的是在操作系统不知情的情况PCIe(Downstream Ports)设备下的物理层链路状态从
DL_Active 跳变到 DL_Inactive(即从Physical LinkUp = 1b跳变到Physical LinkUp = 0b)。Surprise Down Error的定义强调了操作系统不知情,因此操作系统能感知的且会导致链路状态从link up跳变到link down的事件则不会被当做Surprise Down Error处理。
Surprise Down Error记录在哪里
linux系统中在terminal中通过dmesg命令可以获取是否发生过Surprise Down Error相关的报错。
需要注意的是,dmesg显示的是曾经发生过的错误,即使目前环境因为重启或其他原因错误已经不存在了。
图1 :dmesg报错
通过lspci -s $target_dsp -vvv 命令查看downstream port的Advanced Error Reporting能力结构中的Uncorrectable Error Status Register也可以查看当前是否存在Surprise Down Error。
需要注意的是,Surprise Down Error是记录在downstream port的上,表示downstream port与对接的upstream port设备的链路异常。lspci显示的当前环境中的实时状态,如果曾经发生过错误,但是因为downstream port对接的设备已经发生过重启,错误状态会被清除。
Uncorrectable Error Status Register的Surprise Down Error Status字段(bit 5)为1表示存在Surprise Down Error(缩写为SDES)。
Capabilities: [150 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO+ CmpltAbrt- UnxCmplt+ RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol+ CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+ AERCap: First Error Pointer: 05, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 RootCmd: CERptEn+ NFERptEn+ FERptEn+ RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd- FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0 ErrorSrc: ERR_COR: c009 ERR_FATAL/NONFATAL: 0000出现Surprise Down Error的场景
导致Surprise Down Error的常见场景如下:
硬件故障: PCIe 接口卡接触不良且不小心触碰导致物理层link down。
电源问题:设备供电异常,如电压波动、突然断电等,可能使设备意外下电,产生 Surprise Down Error。
热插拔事件:未按照标准的热插拔流程进行操作,例如在不支持暴力拔出的设备上直接拔出 PCIe 设备,会导致操作系统无法及时感知设备的移除,进而触发Surprise Down Error
意外重启:PCIe设备(例如智能网卡)在某些异常场景下会主动重启从而导致 Surprise Down Error。
当智能网卡内嵌CPU因为异常而无法响应某些进程时,此时可能会主动触发网卡芯片复位,从而解决异常。
注意事项:
PCIe协议详细描述了如下hot reset、link disable、功耗管理、热插拔等事件不会被当做Surprise Down Error处理
• If the Secondary Bus Reset bit in the Bridge Control register has been Set by software, then the subsequent transition to DL_Inactive must not be considered an error.
• If the Link Disable bit has been Set by software, then the subsequent transition to DL_Inactive must not be considered an error.
• If a Switch Downstream Port transitions to DL_Inactive due to an event above that Port, that transition to DL_Inactive must not be considered an error. Example events include the Switch Upstream Port propagating
Hot Reset, the Switch Upstream Link transitioning to DL_Down, and the Secondary Bus Reset bit in the Switch Upstream Port being Set.
• If a PME_Turn_Off Message has been sent through this Port, then the subsequent transition to DL_Inactive must not be considered an error.
• Note that the DL_Inactive transition for this condition will not occur until a power off, a reset, or a request to restore the Link is sent to the Physical Layer.
• Note also that in the case where the PME_Turn_Off/PME_TO_Ack handshake fails to complete successfully, a Surprise Down error may be detected.
• If the Port is associated with a hot-pluggable slot (the Hot-Plug Capable bit in the Slot Capabilities register Set), and the Hot-Plug Surprise bit in the Slot Capabilities register is Set, then any transition to DL_Inactive
must not be considered an error.
• If the Port is associated with a hot-pluggable slot (Hot-Plug Capable bit in the Slot Capabilities register Set), and Power Controller Control bit in Slot Control register is Set (Power-Off), then any transition to DL_Inactive must not be considered an error.
特别声明:以上内容仅代表作者本人的观点或立场,不代表新浪财经头条的观点或立场。如因作品内容、版权或其他问题需要与新浪财经头条联系的,请于上述内容发布后的30天内进行。
来源:新浪财经
