Sunday, July 12, 2015

LSI MegaRaid HBA's, overheating and one ugly hack

Summer is here!
Like many others, I have LSI MegaRaid HBA's that I use in desktop machines.

These things are great but they tend to overheat and quite a few people have reported high temperature  findings (97C reported by the chip when idle'ing in both my Dell T410 and my Dell T5610):


sudo megaclisas-status

-- Controller information --
-- ID | H/W Model                  | RAM    | Temp | Firmware     
c0    | LSI MegaRAID SAS 9271-8i   | 1024MB | 97C  | FW: 23.32.0-0009 
c1    | LSI MegaRAID SAS 9280-4i4e | 512MB  | N/A  | FW: 12.15.0-0205 
[...]

I had never really bothered about the temperature but when I started to rebuild my T410's boot LD (Raid-1) to swap the 2Tb drives with 4Tb drives I had, things started to get complicated quickly.

As soon as the mirror started rebuilding, the ROC temp (sitting around 97C) skyrocketted to 102C and soon enough the card shut down itself, dropping off the PCI-E bus and resetting the server.

After machine reset, the mirroring process continued where it had left off, the temperature of the ROC increased and the whole system reset itself again.

Luckily, rebuilding a Logical Drive (LD) is a reliable process and it can recover successfully after a system reset.
I finished rebuilding the boot LD mirror with the computer case open so that ambient air would cool the card sufficiently to let the rebuild complete.

After things were back to normal, I started researching the issue and found others with similar problems.

Someone also attempted to fit a 40mm fan on top of the heatsink and described the specs with great detail:

There was even someone who had a business on e-bay selling overpriced fans for MegaRaid controllers (Talk about a known problem!!!):


So it seemed like a known and unacknowledged problem and I set out to find a solution. Alas I couldn't fit a fan to my 9271-8i's heatsink because I had no space left on top of it (all PCI-E slots were used in that machine).

After some experimentation with a spare 40mm fan I had, I came up with the following workaround:

1) find a small reliable 40mm fan to fit to the side of the overheating MegaRaid card.
 I went for the Noctua NF-A4x10 FLX fan (150,000 MTBF and no more than 20dBA)

2) Attach the fan to the heatsink of the 9271-8i card. Luckily, the heatsink of the other LSI card (a 9280-4i4e) was close enough to let me position the fan across both cards so that it would cool both heatsinks.

Here's a picture of the inside of the Dell T410 with the fan fitted on top of both cards.

With the fan attached, the cards runs much cooler (even in the hot summer weather):
sudo megaclisas-status
-- Controller information --
-- ID | H/W Model                  | RAM    | Temp | Firmware     
c0    | LSI MegaRAID SAS 9271-8i   | 1024MB | 71C  | FW: 23.32.0-0009 
c1    | LSI MegaRAID SAS 9280-4i4e | 512MB  | N/A  | FW: 12.15.0-0205 
[...]

So with a $8.99 fan, I experienced a drop of 28C in chip temperature. This is one crude and ugly hack (the fan was carefully attached/screwed to the heatsinks) but it does the job.

For the T5610, since I had more room in the PCI-E slots, I simply screwed the fan to the LSI's heatsink:

Again, this resulted in a significant temperature drop (down from 102C):

sudo megaclisas-status
-- Controller information --
-- ID | H/W Model                  | RAM    | Temp | Firmware     
c0    | LSI MegaRAID SAS 9271-8i   | 1024MB | 56C  | FW: 23.32.0-0009
[...]

Update (2016/07/22), I've removed the screws in the T410 and setteled for something a little cleaner (a magnetic arm holding the fan) so that it cools both HBAs in an appropriate way.




10 comments:

  1. The chassis temp of the T410 was around 26C at that time (room temp).
    It isn't exactly over specs.. :(

    ReplyDelete
  2. Wich kind of screws did you used? 9271-8i ?

    ReplyDelete
  3. Could you share where to acquire the magnetic arm? Thanks.

    ReplyDelete
    Replies
    1. The magnetic arms I used were those:
      http://www.akust.com/product/adjustable-magnetic-fan-bridge-mounting-kit
      No issues to report so far..
      Regards,
      Vincent

      Delete
  4. I was adding an LSI 9260-8i to my HP 420 Workstaion server. The HP420 has a green-click used to hold the default-issue video card inplace. I used the clip to attach a fan over the video and LSI card as a cooler. I have pictures if you would like to add to this post. contact me and I will send you the pix

    ReplyDelete
  5. Vincent since you have - obviously a 9361-8i could you do me a really big huge favor and measure the size of the existing heatsync as I am looking at doing a custom made job for one of these and running long dissipation lines down to the PCI slot cover but as I don't have my card yet I can't start work on the idea.

    ReplyDelete
  6. I am reading this blog with interest as maybe you experts can help us with a problem. We have a customer using a Dell server with MegaRAID® SAS 9271-8i 6Gb/s SAS and SATA RAID Controller Card.in a video strage application. For the third time we have had a disc failure in the same slot and occasional crashes of the computer with messages saying video storage is lower than it should be. The company who sold us the machine and computer has only changed the drive, but I am thinking that the raid controller board, the cables or maybe now overheating is the problem. Before I buy a new raid controller board SAS drive and cables, Have any of you have seen similar problems and could point to the probable cause.The system is in Portugal and quite hot in a snall room. I suppose there is a log of core temperature prior to failure of the disc. sorry I am not a computer engineer.
    Thanks
    Chris Price
    info@cinesonics.pt

    ReplyDelete
    Replies
    1. Hi Chris,
      I wouldn't be able to know where your problem comes from.
      I'd advise looking at the HBA's log to see if there's anything relevant there. Check the iDrac's log too.
      It could be several things: the backplane slot, the cables to/from the HBA.
      One thing I know for sure is that the 9271-8i will most likely overheat in a tower server during the summer without additional cooling.
      For my Dell PE T410 server, it resulted in the HBA shutting itself down and dropping off the PCI-E bus during a RAID rebuild.
      At the very least you should make sure the fans are functionning properly and that the small heatsink on the 9271-8i gets enough airflow.
      There's a temperature sensor on the 9271-8i, how hot is it? And check the drives' temperatures too. Some of them in the cage could
      be running hotter than others, especially if this is a tower server.
      The PERC cards usually have a larger heatsink and don't require that much airflow.
      Good luck,

      Delete
  7. Hi Vincent, thanks for you advice, I suspect it is overheating although the room is at 18 deg C. I will visit the customer next week and open the computer which i did not do now. The computer was provided as an OEM for by the manufacturers their film scanning machine.
    We have lost two drives and the computer crashes unexpectedly, this last time losing the config of the raid completely. I have seen battery backup modules for this raid. do you think it is worth having this
    thanks again for your help.

    ReplyDelete

LVM2 bootdisk encapsulation on RHEL7/Centos7

Introduction Hi everyone, Life on overcloud nodes was simple back then and everybody loved that single 'root' partition on th...