Sunday, 1 July 2012


Finding and fixing a corrupt ODM install

Good day. Recently it was discovered that one of the AIX servers is having an issue with a multitude of powerpath devices. When issuing a lsdev |grep hdiskpower | wc -l I was surprised to see over 3000 finds. Upon looking at what was currently being used with lspv |grep power I noticed there was like half a dozen maybe in use.
Upgrading the ODM to a newer version didn’t help much. It took over 2.5 hours to remove all of the hdiskpower devices, followed by installing 3 additional ones. A reboot of the AIX system didn’t help either. Upon scouring the web, I have found a few places which indicate the following procedure should fix up the issue (at the moment this is untested). I’ll be validating this information within the next week.
* Shutdown the Application(s), Database(s), etc and varyoff all Volume Groups (VGs) except rootvg. This can be confirmed with lsvg -o
* If EMC Solutions Enabler is running, disable with stordaemon shutdown all -immediate
* Remove paths from the PowerPath Configuration –> powermt remove hba=all
* Delete all Symmetrix Disks –> lsdev -CtSYMM* -Fname |xargs -n1 rmdev -dl
* Delete all hdiskpower devices –> rmdev -dl powerpath0
* Confirm they’re gone with –> lsdev -Cc disk (no symmextrix nor hdiskpower devices should exist)
* Remove all fibre devices instances -> rmdev -Rdl fscsi0 (repeat for others like fscsi1 etc)
* Verify fibre adapters are gone –> lsdev -Cc adapter (no fscsi should exist)
* Put the hba devices into a defined state –> rmdev -l fcsX (replace x with 0, 1 etc)
* Scan the bus –> emc_cfgmgr or cfgmgr -vl fcsX NOTE: emc_cgrmgr is a script downloadable from EMC’s website
* Configure all of the EMC devices into PowerPath –> powermt config
* Some final checks –> powermt display & powermt display dev=all & lsdev -Cc disk
* Finally save your changes with –> powermt save
MPIO settings (if applicable) may have to be put in again. If so, they can be changed like so:
chdev -l fscsiX -a dyntrk=yes -a fc_err_recov=fast_fail (repeat for other adapters)
A reboot should NOT be necessary. However, I’ll confirm and update within a week.
= Varying degrees of success =
No issues up until “rmdev -dl powerpath0″. Got this response instead:
rmdev -dl powerpath0
Method error (/etc/methods/ucfgpower):
0514-062 Cannot perform the requested function because the
specified device is busy.
Hence, done the lsdev -Cc disk option. It listed the two local SAS drives, and the 3000+ hdiskpower devices (all of the hdiskpower devices were in a Defined state). Hence, attempted a manual removal of those with the following line of code:
lsdev -Cc disk | grep hdiskpower | awk {'print "rmdev -dl " $1'} | sh
This slowly started to delete each of them one at a time. Time for a coffee break apparently!
Once the 3135 hdiskpower devices were deleted, the rmdev -dl powerpath0command worked as expected. Rest of the procedure worked as planned. Lastly set the MPIO settings with the command:
chdev -l fscsi0 -a dyntrk=yes -a fc_err_recov=fast_fail -P
chdev -l fscsi1 dyntrk=yes -a fc_err_recov=fast_fail -P
MPIO settings took effect after reboot.

Creating LPAR from command line from HMC


Creating LPAR from command line from HMC

Create new LPAR using command line

mksyscfg -r lpar -m MACHINE -i name=LPARNAME, profile_name=normal, lpar_env=aixlinux, shared_proc_pool_util_auth=1,min_mem=512, desired_mem=2048, max_mem=4096, proc_mode=shared, min_proc_units=0.2, desired_proc_units=0.5,max_proc_units=2.0, min_procs=1, desired_procs=2, max_procs=2, sharing_mode=uncap, uncap_weight=128,boot_mode=norm, conn_monitoring=1, shared_proc_pool_util_auth=1


Note :- Use man mksyscfg command for all flag information.

Onother method of creating LPAR through configuration file we need to create more than one lPAR at same time

Here is an example for 2 LPARs, each definition starting at new line:

name=LPAR1,profile_name=normal,lpar_env=aixlinux,all_resources=0,min_mem=1024,desired_mem=9216,max_mem=9216,proc_mode=shared,min_proc_units=0.3,desired_proc_units=1.0,max_proc_units=3.0,min_procs=1,desired_procs=3,max_procs=3,sharing_mode=uncap,uncap_weight=128,lpar_io_pool_ids=none,max_virtual_slots=10,"virtual_scsi_adapters=6/client/4/vio1a/11/1,7/client/9/vio2a/11/1","virtual_eth_adapters=4/0/3//0/1,5/0/4//0/1",boot_mode=norm,conn_monitoring=1,auto_start=0,power_ctrl_lpar_ids=none,work_group_id=none,shared_proc_pool_util_auth=1
name=LPAR2,profile_name=normal,lpar_env=aixlinux,all_resources=0,min_mem=1024,desired_mem=9216,max_mem=9216,proc_mode=shared,min_proc_units=0.3,desired_proc_units=1.0,max_proc_units=3.0,min_procs=1,desired_procs=3,max_procs=3,sharing_mode=uncap,uncap_weight=128,lpar_io_pool_ids=none,max_virtual_slots=10,"virtual_scsi_adapters=6/client/4/vio1a/12/1,7/client/9/vio2a/12/1","virtual_eth_adapters=4/0/3//0/1,5/0/4//0/1",boot_mode=norm,conn_monitoring=1,auto_start=0,power_ctrl_lpar_ids=none,work_group_id=none,shared_proc_pool_util_auth=1

Copy this file to HMC and run:

mksyscfg -r lpar -m SERVERNAME -f /tmp/profiles.txt

where profiles.txt contains all LPAR informations as mentioned above.

To change setting of your Lpar use chsyscfg command as mentioned below.

Virtual scsi creation & Mapping Slots
#chsyscfg -m Server-9117-MMA-SNXXXXX -r prof -i 'name=server_name,lpar_id=xx,"virtual_scsi_adapters=301/client/4/vio01_server/301/0,303/client/4/vio02/303/0,305/client/4/vio01_server/305/0,307/client/4/vio02_server/307/0"'

IN Above mentioned command we are creating Virtual scsi adapter for client LPAR & doing Slot mapping with VIO servers. In above scenario there is two VIO servers for redundancy.


Slot Mapping

Vio01_server ( VSCSI server slot) Client ( Vscsi client Slot)
Slot 301 Slot 301
Slot 303 Slot 303

VIO02_server (VSCSI sever Slot) Client ( VSCSI client Slot)
Slot 305 Slot 305
Slot 307 Slot 307


These Slot are mapped in such a way if Any disk or logical volume are mapped to Virtuals scsi adapter through VIO command "mkvdev".

Syntax for Virtual scsi adapter


virtual-slot-number/client-or-server/supports-HMC/remote-lpar-ID/remote-lpar-name/remote-slot-number/is-required


As in command above mentioned command mksyscfg"virtual_scsi_adapters=301/client/4/vio01_server/301/0"

means

301 - virtual-slot-number
client-or-server - client (Aix_client)
4 -- Partiotion Id ov VIO_01 server (remote-lpar-ID)
vio01_server - remote-lpar-name
301 -- remote-slot-number (VIO server_slot means virtual server scsi slot)
1 -- Required slot in LPAR ( It cannot be removed from DLPAR operations )
0 --means desired ( it can be removed by DLPAR operations)


To add Virtual ethernet adapter & slot mapping for above created profile

#chsyscfg -m Server-9117-MMA-SNxxxxx -r prof -i 'name=server_name,lpar_id=xx,"virtual_eth_adapters=596/1/596//0/1,506/1/506//0/1,"'

Syntax for Virtual ethernet adapter


slot_number/is_ieee/port_vlan_id/"additional_vlan_id,additional_vlan_id"/is_trunk(number=priority)/is_required

means

So the adapter with this setting 596/1/596//0/1 would say it is in slot_number 596, Its is ieee, the port_vlan_id is 1, it has no VLAN id assigned, It is not a trunk adapter and it is required.

Recovering a Failed VIO Disk




Here is a recovery procedure for replacing a failed client disk on a Virtual IO 
server. It assumes the client partitions have mirrored (virtual) disks. The 
recovery involves both the VIO server and its client partitions. However, 
it is non disruptive for the client partitions (no downtime), and may be 
non disruptive on the VIO server (depending on disk configuration). This
procedure does not apply to Raid5 or SAN disk failures.

The test system had two VIO servers and an AIX client. The AIX client had two 
virtual disks (one disk from each VIO server). The two virtual disks 
were mirrored in the client using AIX's mirrorvg. (The procedure would be 
the same on a single VIO server with two disks.) 

The software levels were:


p520: Firmware SF230_145 VIO Version 1.2.0 Client: AIX 5.3 ML3 


We had simulated the disk failure by removing the client LV on one VIO server. The 
padmin commands to simulate the failure were:


#rmdev -dev vtscsi01 # The virtual scsi device for the LV (lsmap -all)
#rmlv -f aix_client_lv # Remove the client LV


This caused "hdisk1" on the AIX client to go "missing" ("lsvg -p rootvg"....The
"lspv" will not show disk failure...only the disk status at the last boot..)

The recovery steps included:

VIO Server 


Fix the disk failure, and restore the VIOS operating system (if necessary)mklv -lv aix_client_lv rootvg 10G # recreate the client LV mkvdev -vdev aix_client_lv -vadapter vhost1 # connect the client LV to the appropriate vhost 


AIX Client 


# cfgmgr # discover the new virtual hdisk2 
replacepv hdisk1 hdisk2 
# rebuild the mirror copy on hdisk2 
# bosboot -ad /dev/hdisk2 ( add boot image to hdisk2)
# bootlist -m normal hdisk0 hdisk2 ( add the new disk to the bootlist)

# rmdev -dl hdisk1 ( remove failed hdisk1)


The "replacepv" command assigns hdisk2 to the volume group, rebuilds the mirror, and 
then removes hdisk1 from the volume group. 

As always, be sure to test this procedure before using in production.
Virtual SCSI Server Adapter and Virtual Target Device.
The mkvdev command will error out if the same name for both is used.

$ mkvdev -vdev hdiskpower0 -vadapter vhost0 -dev hdiskpower0
Method error (/usr/lib/methods/define -g -d):
0514-013 Logical name is required.

The reserve attribute is named differently for an EMC device than the attribute
for ESS or FasTt storage device. It is “reserve_lock”.

Run the following command as padmin for checking the value of the attribute.
lsdev -dev hdiskpower# -attr reserve_lock

Run the following command as padmin for changing the value of the attribute.
chdev -dev hdiskpower# -attr reserve_lock=no

Commands to change the Fibre Channel Adapter attributes And also change the following attributes of the fscsi#, fc_err_recov to “fast_fail” and dyntrk to “yes”


$ chdev -dev fscsi# -attr fc_err_recov=fast_fail dyntrk=yes –perm

The reason for changing the fc_err_recov to “fast_fail” is that if the Fibre
Channel adapter driver detects a link event such as a lost link between a storage
device and a switch, then any new I/O or future retries of the failed I/Os will be
failed immediately by the adapter until the adapter driver detects that the device
has rejoined the fabric. The default setting for this attribute is 'delayed_fail’.
Setting the dyntrk attribute to “yes” makes AIX tolerate cabling changes in the
SAN.

The VIOS needs to be rebooted for fscsi# attributes to take effect.