August 2014 updates

After some apparent RAID hardware issues earlier in the year we decided to perform some updates and reconfiguration on the ROSA DAS systems to increase resilience.

Background

The 'old' configuration on the DAS systems had a RAID-1 array for the OS (two disks mirroring each other exactly) and a RAID-0 array for observing data (8 disks with data striped across them for speed). The latter setup is needed for write performance but has no redundancy, so if a single drive fails then the entire RAID is lost. It transpired that various power glitches had likely caused the RAID-0 arrays on DAS3 and DAS4 to fall out of sync, with one or more drives being regarded as foreign by the RAID BIOS. Unfortunately the BIOS would warn about the foreign drives at boot time but continue to boot regardless, and the OS would then hang when trying to mount the drive as large parts of the partitions were missing. The situation was remedied by using the RAID BIOS to erase the 'foreign' drive setup and reconstruct the RAID-0, though this was at the cost of losing all data from the RAID-0 partition.

To guard against this problem recurring I reconfigured the RAID arrays from RAID-0 to RAID-10 - this configuration means that the data is striped across pairs of hard drives, giving redundancy and speed, though at the cost of capacity as one is now devoting half of the hard drives to a backup role. The capacity issue was mitigated by replacing the 147GB HDDs in DAS3-6 with 600GB units resulting in a net increase in available storage; DAS1/2 already used 600GB HDDs so have had their net capacity reduced from around 4TB to 2TB, but are of course now much more robust.

Current configuration

Each DAS machine still has a RAID-1 for the OS, and now has a RAID-10 for the data partition. On most machines the RAID-10 is composed of 8 600GB HDDs for a total formatted capacity of 2TB; DAS3 has a 6-drive RAID-10 due to an apparent hardware issue with one of the drive bays (see below) and so only has 1.5TB data space. Due to the transient hardware issue on DAS3 the redundancy of the RAID-10 was tested in actual operation, and it seemed to work as expected!

In addition to the hardware changes I have configured the OS on the machines to mount the data partition after the main boot process, so the system should not simply hang at a blank screen but boot to a usable state even if there is a problem.

Known issues

  • There is an issue with drive bay 4 on DAS3 - each drive tested in the bay failed, and I believe the issue is the bay, not the drives. The machine has been configured to avoid using bays 3 and 4 in the RAID-10, and seems stable in this setup.
  • On two occasions we noted anomalies with the camera controller cards - they would show up in the GUI but not all operations would complete. This was resolved by reseating the PCI cards in the server. As this has happened before on DAS3 we swapped that controller to the second PCI-X slot.
  • The battery backup units on the RAID cards are depleted - the machine will complain about this at boot time but it does not affect operations. The status LED on the server will flash orange and warn about this though with a ROMB error.

Suggestions for future operations

To avoid problems with future runs I make the following suggestions:

  • The servers should be brought up a day or two before the run, and the gateway connected to the network, so I can SSH in from QUB and check for any drive problems.
  • The script check_das_drives should be run daily from the gateway root account to ensure that all arrays report being in the optimal state. See below for details on this script.
  • In case of problems do not edit system configuration files - contact me (details below) immediately. I will make myself available during ROSA runs if at all possible and can access the systems remotely, and talk with observers via phone or Skype.
  • Given the age of the hardware, and the harsh lifecycle, we should expect problems and document them properly. I suggest we use this wiki to list any hardware issues (drive failures, controller glitches, etc) so we can spot patterns.

Checking DAS drive status

To be added

Contacting Robert

In case of problems, contact me immediately. As I don't normally check my QUB email off-site use my personal address - robertryans @ me.com - or phone me on +447837835852. Mail to that account will show up on my phone immediately, and that's my mobile number; my iPhone is rarely more than a meter from me.

Warning - use of these details for anything other than a ROSA emergency will result in severe offence to the caller.

public/research_areas/solar_physics/rosa_update_aug_2014.txt · Last modified: 2014/09/12 09:04 by Robert Ryans

Back to Top Sitemap News