LVM2 bootdisk encapsulation on RHEL7/Centos7

Introduction


Hi everyone,
Life on overcloud nodes was simple back then and everybody loved that single 'root' partition on the (currently less than 2Tb) bootdisk. This gave us overcloud nodes partitioned like this:

[root@msccld2-l-rh-cmp-12 ~]# df -h -t xfs 
 Filesystem Size Used Avail Use% Mounted on 
/dev/sda2 1.1T 4.6G 1.1T 1% /

The problem with this approach is that anything filling up any subdirectory on the boot disk will cause services to fail. This story is almost 30 years old.
For that reason, most security policies (Think SCAP) insist that /var, /tmp, /home be different logical volumes and that any disk uses LVM2 to allow additional logical volumes.

To solve this problem, whole-disk image support is coming to Ironic. It landed in 5.6.0 (See [1] ) but missed the OSP10 release. With whole-disk image support in Ironic, we could easily change overcloud-full.qcow2 to be a full-disk image with LVM and separate volumes. This work is a tremendous advance, thanks to Yolanda Robla. I hope it gets backported to stable/Newton (OSP10, our first LTS release).

I wanted to solve this issue for OSP10 (and maybe for previous versions too) and started working on some tool to 'encapsulate' the existing overcloud partition into LVM2 during deployment. This is now working reliably and I wanted to present the result here so this could be re-used for other purposes.

Resulting configuration

The resulting config is fully configurable and automated. It will make use of an arbitrary number of logical volumes from your freshly deployed overcloud node. 
Here's an example for a compute node with a 64gb boot disk and an 8Tb secondary disk:

[root@krynn-cmpt-1 ~]# df -t xfs
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/rootdg-lv_root 16766976 3157044 13609932 19% /
/dev/mapper/rootdg-lv_tmp 2086912 33052 2053860 2% /tmp
/dev/mapper/rootdg-lv_var 33538048 428144 33109904 2% /var
/dev/mapper/rootdg-lv_home 2086912 33056 2053856 2% /home

[root@krynn-cmpt-1 ~]# pvs
PV VG Fmt Attr PSize PFree
/dev/sda2 rootdg lvm2 a-- 63.99g 11.99g

[root@krynn-cmpt-1 ~]# vgs
VG #PV #LV #SN Attr VSize VFree
rootdg 1 4 0 wz--n- 63.99g 11.99g

Implementation

The tool (mostly a big fat shell script) will come into action at the end of firstboot and use a temporary disk to create the LVM2 structures and volumes. It will then set the root to this newly-created LV and will reboot the system.

When the system boots, it will wipe clean the partition the system was originally installed on. Then it will proceed to mirror back the LV's and VG to that single partition. Once finished, everything will be back to where it was before, except for the temporary disk which was wiped clean too..

Logs of all actions are kept on the nodes themselves:

root@krynn-cmpt-1 ~]# ls -lrt /var/log/ospd/*root*log
-rw-r--r--. 1 root root 15835 Mar 20 16:53 /var/log/ospd/firstboot-encapsulate_rootvol.log
-rw-r--r--. 1 root root 2645 Mar 20 17:02 /var/log/ospd/firstboot-lvmroot-relocate.log



The first log details the execution of the initial part of the encapsulation: creating the VG, the LV's, setting up GRUB, injecting the boot run-once service, etc..
The second log details the execution of the run-once service that mirrors back the Volumes to the original partition carved by tripleo during a deploy.

It is called by the global multi-FirstBoot template here:

Which we called from the main environment file:


Configuration

The tool provides you with the ability to change the names of the Volume Group, how many volumes are needed, what size they shall be, etc... The only way to change this is to edit your copy of the script and edit the lines marked as 'EDITABLE' at the top. E.g:

boot_dg=rootdg                                 # EDITABLE
boot_lv=lv_root                                # EDITABLE
# ${temp_disk} is the target disk. This disk will be wiped clean, be careful.
temp_disk=/dev/sdc                             # EDITABLE
temp_part="${temp_disk}1"
# Size the volume
declare -A boot_vols
boot_vols["${boot_lv}"]="16g"                   # EDITABLE
boot_vols["lv_var"]="32g"                       # EDITABLE
boot_vols["lv_home"]="2g"                       # EDITABLE
boot_vols["lv_tmp"]="2g"                        # EDITABLE
declare -A vol_mounts
vol_mounts["${boot_lv}"]="/"
vol_mounts["lv_var"]="/var"                     # EDITABLE
vol_mounts["lv_home"]="/home"                   # EDITABLE
vol_mounts["lv_tmp"]="/tmp"                     # EDITABLE


All of the fields marked 'EDITABLE' can be change. Any new LV can be added by inserting a new entry for both boot_vols and vol_mounts.

Warnings, Caveats and Limitations


Please be aware of the following warnings
  • The tool will WIPE/ERASE/DESTROY whatever temporary disk you give it. (I use /dev/sdc because /dev/sdb is used for something else). This is less than ideal but I haven't found something better yet.
  • This tool has only been used on RHEL7.3 and above. It should work fine on Centos7.
  • The tool -REQUIRES- a temporary disk. It will not function without it. It will WIPE THAT DISK.
  • This tool can be used outside of OSP-Director. In fact this is how I developed this script but you still REQUIRE a temporary disk. 
  • This tool can be used with OSP-Director but it MUST be invoked in firstboot and it MUST execute last. One way to do this is to make it 'depend' on all of the previous first boot scripts. For my templates, it involved doing the following:
  • It lengthens your deployment time and causes an I/O storm on your machines as the data blocks are copied back and forth. If you do it in a virtual environment, I have added 'rootdelay' and 'scsi_mod.scan=sync' to help the nodes find their 'root' after reboot. If some nodes complain that they couldn't mount 'root' on unknown(0,0) this is likely caused by that issue and resetting the node manually should get everything back on track.
  • The resulting final configuration is fully RHEL-supported, nothing specific there.

  • THIS IS A WORK IN PROGRESS, feel free to report back success and/or failure.

Comments

  1. Funny. I was just talking about this today. I can from HPUX to Linux. So We have always setup /opt /var/ /tmp. etc as separate logical volumes. I cannot tell you how many times something like rabbitmq has filled up /var in a matter of days.

    ReplyDelete

Post a Comment

Popular posts from this blog

LSI MegaRaid HBA's, overheating and one ugly hack

Some Tips about running a Dell PowerEdge Tower Server as your workstation

VMWare Worksation 12 on Fedora Core 23/24 (fc23 and fc24)