当前位置：首页 > news >正文

虚拟机RAC的ASM磁盘组坏块导致重建DB

news 来源：原创 2024/5/18 20:03:44

2011.11.23虚拟机RAC的ASM磁盘组坏块导致重建DB
刚刚在公司的一台PC机器上用vmware workstation8搭建了一套10gr2的rac环境，用的是裸设备+ASM搭建，在安装成功后，不小心被直接重启了下主机，结果再次启动虚拟机的时候提示到有磁盘损坏，也没有在意。但是在启动RAC的时候出现了问题，一开始的现象是如下几个个资源没办法随着其他资源一起启动：
ora.node1.LISTENER_NODE1.lsnr
ora.node2.LISTENER_NODE2.lsnr
ora.RAC.RAC1.inst
ora.RAC.RAC2.inst
ora.RAC.db
看具体的启动过程：
[oracle@node1 bin]$ crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora....C1.inst application OFFLINE OFFLINE
ora....C2.inst application OFFLINE OFFLINE
ora.RAC.db application OFFLINE OFFLINE
ora....SM1.asm application OFFLINE OFFLINE
ora....E1.lsnr application OFFLINE OFFLINE
ora.node1.gsd application OFFLINE OFFLINE
ora.node1.ons application OFFLINE OFFLINE
ora.node1.vip application OFFLINE OFFLINE
ora....SM2.asm application OFFLINE OFFLINE
ora....E2.lsnr application OFFLINE OFFLINE
ora.node2.gsd application OFFLINE OFFLINE
ora.node2.ons application OFFLINE OFFLINE
ora.node2.vip application OFFLINE OFFLINE
[oracle@node1 bin]$ crs_start -all
Attempting to start `ora.node1.vip` on member `node1`
Attempting to start `ora.node2.vip` on member `node2`
Start of `ora.node1.vip` on member `node1` succeeded.
Start of `ora.node2.vip` on member `node2` succeeded.
Attempting to start `ora.node1.ASM1.asm` on member `node1`
Attempting to start `ora.node2.ASM2.asm` on member `node2`
Start of `ora.node2.ASM2.asm` on member `node2` succeeded.
Attempting to start `ora.RAC.RAC2.inst` on member `node2`
Start of `ora.RAC.RAC2.inst` on member `node2` failed.
node1 : CRS-1018: Resource ora.node2.vip (application) is already running on node2

node1 : CRS-1018: Resource ora.node2.vip (application) is already running on node2

Start of `ora.node1.ASM1.asm` on member `node1` succeeded.
Attempting to start `ora.RAC.RAC1.inst` on member `node1`
Start of `ora.RAC.RAC1.inst` on member `node1` failed.
node2 : CRS-1018: Resource ora.node1.vip (application) is already running on node1

node2 : CRS-1018: Resource ora.node1.vip (application) is already running on node1

CRS-1002: Resource 'ora.node1.ons' is already running on member 'node1'

CRS-1002: Resource 'ora.node2.ons' is already running on member 'node2'

Attempting to start `ora.node1.gsd` on member `node1`
Attempting to start `ora.RAC.db` on member `node1`
Attempting to start `ora.node2.gsd` on member `node2`
Start of `ora.node1.gsd` on member `node1` succeeded.
Start of `ora.node2.gsd` on member `node2` succeeded.
Start of `ora.RAC.db` on member `node1` failed.
Attempting to start `ora.RAC.db` on member `node2`
Start of `ora.RAC.db` on member `node2` failed.
CRS-1006: No more members to consider

CRS-0215: Could not start resource 'ora.RAC.RAC1.inst'.

CRS-0215: Could not start resource 'ora.RAC.RAC2.inst'.

CRS-0215: Could not start resource 'ora.RAC.db'.

CRS-0223: Resource 'ora.node1.LISTENER_NODE1.lsnr' has placement error.

CRS-0223: Resource 'ora.node1.ons' has placement error.

CRS-0223: Resource 'ora.node2.LISTENER_NODE2.lsnr' has placement error.

CRS-0223: Resource 'ora.node2.ons' has placement error.

[oracle@node1 bin]$ crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora....C1.inst application ONLINE OFFLINE
ora....C2.inst application ONLINE OFFLINE
ora.RAC.db application ONLINE OFFLINE
ora....SM1.asm application ONLINE ONLINE node1
ora....E1.lsnr application OFFLINE OFFLINE
ora.node1.gsd application ONLINE ONLINE node1
ora.node1.ons application ONLINE ONLINE node1
ora.node1.vip application ONLINE ONLINE node1
ora....SM2.asm application ONLINE ONLINE node2
ora....E2.lsnr application OFFLINE OFFLINE
ora.node2.gsd application ONLINE ONLINE node2
ora.node2.ons application ONLINE ONLINE node2
ora.node2.vip application ONLINE ONLINE node2
尝试先把lsnr起来：
[oracle@node1 bin]$ crs_start ora.node1.LISTENER_NODE1.lsnr
Attempting to start `ora.node1.LISTENER_NODE1.lsnr` on member `node1`
Start of `ora.node1.LISTENER_NODE1.lsnr` on member `node1` succeeded.
[oracle@node1 bin]$ crs_start ora.node2.LISTENER_NODE2.lsnr
Attempting to start `ora.node2.LISTENER_NODE2.lsnr` on member `node2`
Start of `ora.node2.LISTENER_NODE2.lsnr` on member `node2` succeeded.
接着启动两个inst，接着出现问题了，inst无法拉起来：
[oracle@node1 bin]$ crs_start ora.RAC.RAC1.inst
Attempting to start `ora.RAC.RAC1.inst` on member `node1`
Start of `ora.RAC.RAC1.inst` on member `node1` failed.
node2 : CRS-1018: Resource ora.node1.vip (application) is already running on node1

CRS-0215: Could not start resource 'ora.RAC.RAC1.inst'.
检查相关的日志：
首先查看了下asm的日志：
alert_+ASM1.log
ed Nov 23 15:14:12 2011
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Interface type 1 eth1 192.168.91.0 configured from OCR for use as a cluster interconnect
Interface type 1 eth0 192.168.88.0 configured from OCR for use as a public interface
Picked latch-free SCN scheme 2
Using LOG_ARCHIVE_DEST_1 parameter default value as /opt/app/product/10.2.0/db_1/dbs/arch
Autotune of undo retention is turned off.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.1.0.
System parameters with non-default values:
large_pool_size = 12582912
instance_type = asm
cluster_database = TRUE
instance_number = 1
remote_login_passwordfile= EXCLUSIVE
background_dump_dest = /opt/app/admin/+ASM/bdump
user_dump_dest = /opt/app/admin/+ASM/udump
core_dump_dest = /opt/app/admin/+ASM/cdump
asm_diskgroups = DATA1
Cluster communication is configured to use the following interface(s) for this instance
192.168.91.100
Wed Nov 23 15:14:13 2011
cluster interconnect IPC version:Oracle UDP/IP
IPC Vendor 1 proto 2
PMON started with pid=2, OS id=25132
DIAG started with pid=3, OS id=25134
PSP0 started with pid=4, OS id=25136
LMON started with pid=5, OS id=25138
LMD0 started with pid=6, OS id=25140
LMS0 started with pid=7, OS id=25142
MMAN started with pid=8, OS id=25152
DBW0 started with pid=9, OS id=25154
LGWR started with pid=10, OS id=25156
CKPT started with pid=11, OS id=25158
SMON started with pid=12, OS id=25160
RBAL started with pid=13, OS id=25162
GMON started with pid=14, OS id=25164
Wed Nov 23 15:14:13 2011
lmon registered with NM - instance id 1 (internal mem no 0)
Wed Nov 23 15:14:13 2011
Reconfiguration started (old inc 0, new inc 1)
ASM instance
List of nodes:
0 1
Global Resource Directory frozen
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Wed Nov 23 15:14:14 2011
LMS 0: 0 GCS shadows cancelled, 0 closed
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Wed Nov 23 15:14:14 2011
LMS 0: 0 GCS shadows traversed, 0 replayed
Wed Nov 23 15:14:14 2011
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Reconfiguration complete
LCK0 started with pid=15, OS id=25208
Wed Nov 23 15:14:15 2011
SQL> ALTER DISKGROUP ALL MOUNT
Wed Nov 23 15:14:15 2011
NOTE: cache registered group DATA1 number=1 incarn=0x6f877cd9
* allocate domain 1, invalid = TRUE
freeing rdom 1
Received dirty detach msg from node 1 for dom 1
Wed Nov 23 15:14:22 2011
Loaded ASM Library - Generic Linux, version 2.0.4 (KABI_V2) library for asmlib interface
Wed Nov 23 15:14:22 2011
ORA-15186: ASMLIB error function = [asm_open], error = [1], mesg = [Operation not permitted]
Wed Nov 23 15:14:22 2011
ORA-15186: ASMLIB error function = [asm_open], error = [1], mesg = [Operation not permitted]
Wed Nov 23 15:14:23 2011
NOTE: Hbeat: instance first (grp 1)
Wed Nov 23 15:14:27 2011
NOTE: start heartbeating (grp 1)
NOTE: cache opening disk 0 of grp 1: DATA1_0000 path:/dev/raw/raw3
Wed Nov 23 15:14:27 2011
NOTE: F1X0 found on disk 0 fcn 0.0
NOTE: cache opening disk 1 of grp 1: DATA1_0001 path:/dev/raw/raw4
NOTE: cache mounting (first) group 1/0x6F877CD9 (DATA1)
* allocate domain 1, invalid = TRUE
kjbdomatt send to node 1
Wed Nov 23 15:14:27 2011
NOTE: attached to recovery domain 1
Wed Nov 23 15:14:27 2011
NOTE: starting recovery of thread=1 ckpt=3.315
NOTE: starting recovery of thread=2 ckpt=3.50
WARNING: cache failed to read fn=3 indblk=0 from disk(s): 1
ORA-15196: invalid ASM block header [kfc.c:7910] [endian_kfbh] [3] [2147483648] [0 != 1]
NOTE: a corrupted block was dumped to the trace file
System State dumped to trace file /opt/app/admin/+ASM/udump/+asm1_ora_25219.trc
NOTE: cache initiating offline of disk 1 group 1
WARNING: offlining disk 1.3914828841 (DATA1_0001) with mask 0x3
NOTE: PST update: grp = 1, dsk = 1, mode = 0x6
Wed Nov 23 15:14:27 2011
ERROR: too many offline disks in PST (grp 1)
Wed Nov 23 15:14:27 2011
NOTE: halting all I/Os to diskgroup DATA1
NOTE: active pin found: 0x0x2427ccd0
NOTE: active pin found: 0x0x2427cc64
Abort recovery for domain 1
NOTE: crash recovery signalled OER-15130
ERROR: ORA-15130 signalled during mount of diskgroup DATA1
NOTE: cache dismounting group 1/0x6F877CD9 (DATA1)
Wed Nov 23 15:14:28 2011
kjbdomdet send to node 1
detach from dom 1, sending detach message to node 1
Wed Nov 23 15:14:28 2011
Dirty detach reconfiguration started (old inc 1, new inc 1)
List of nodes:
0 1
Global Resource Directory partially frozen for dirty detach
* dirty detach - domain 1 invalid = TRUE
0 GCS resources traversed, 0 cancelled
Dirty Detach Reconfiguration complete
Wed Nov 23 15:14:28 2011
freeing rdom 1
Wed Nov 23 15:14:28 2011
WARNING: dirty detached from domain 1
Wed Nov 23 15:14:28 2011
ERROR: diskgroup DATA1 was not mounted
Wed Nov 23 15:14:28 2011
WARNING: PST-initiated MANDATORY DISMOUNT of group DATA1 not performed - group not mounted
Wed Nov 23 15:14:28 2011
Errors in file /opt/app/admin/+ASM/bdump/+asm1_b000_25521.trc:
ORA-15001: diskgroup "DATA1" does not exist or is not mounted
[oracle@node1 bdump]$

从下面2段内容可以看到asm在mount diskgroup的时候出现错误了：
。。。。。
WARNING: cache failed to read fn=3 indblk=0 from disk(s): 1
ORA-15196: invalid ASM block header [kfc.c:7910] [endian_kfbh] [3] [2147483648] [0 != 1]
NOTE: a corrupted block was dumped to the trace file
System State dumped to trace file /opt/app/admin/+ASM/udump/+asm1_ora_25219.trc
NOTE: cache initiating offline of disk 1 group 1
WARNING: offlining disk 1.3914828841 (DATA1_0001) with mask 0x3
。。。。
Wed Nov 23 15:14:28 2011
WARNING: dirty detached from domain 1
Wed Nov 23 15:14:28 2011
ERROR: diskgroup DATA1 was not mounted
Wed Nov 23 15:14:28 2011
WARNING: PST-initiated MANDATORY DISMOUNT of group DATA1 not performed - group not mounted
Wed Nov 23 15:14:28 2011
Errors in file /opt/app/admin/+ASM/bdump/+asm1_b000_25521.trc:
ORA-15001: diskgroup "DATA1" does not exist or is not mounted
查看具体的trace：
cat /opt/app/admin/+ASM/udump/+asm1_ora_25219.trc | less
找到如下错误提示
******************************************************
*** 2011-11-23 15:14:28.703
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [723], [529336], [529336], [memory leak], [], [], [], []
Current SQL information unavailable - no SGA.

cat /opt/app/admin/+ASM/bdump/+asm1_b000_25521.trc| less
/opt/app/admin/+ASM/bdump/+asm1_b000_25521.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
ORACLE_HOME = /opt/app/product/10.2.0/db_1
System name: Linux
Node name: node1
Release: 2.6.18-164.el5
Version: #1 SMP Tue Aug 18 15:51:54 EDT 2009
Machine: i686
Instance name: +ASM1
Redo thread mounted by this instance: 0 <none>
Oracle process number: 17
Unix process pid: 25521, image: oracle@node1 (B000)

*** SERVICE NAME:() 2011-11-23 15:14:28.679
*** SESSION ID:(33.1) 2011-11-23 15:14:28.679
ORA-15001: diskgroup "DATA1" does not exist or is not mounted

怎么看都是没有成功mount磁盘组，还是先收工mount下磁盘组看下：
[oracle@node1 bdump]$ sqlplus /nolog

SQL*Plus: Release 10.2.0.1.0 - Production on Wed Nov 23 15:33:36 2011

Copyright (c) 1982, 2005, Oracle. All rights reserved.

SQL> exit
[oracle@node1 bdump]$ export ORACLE_SID=+ASM1
[oracle@node1 bdump]$ sqlplus /nolog

SQL*Plus: Release 10.2.0.1.0 - Production on Wed Nov 23 15:33:41 2011

Copyright (c) 1982, 2005, Oracle. All rights reserved.

SQL> conn /as sysdba
Connected.
SQL> desc v$asm_diskgroup;
Name Null? Type
----------------------------------------- -------- ----------------------------
GROUP_NUMBER NUMBER
NAME VARCHAR2(30)
SECTOR_SIZE NUMBER
BLOCK_SIZE NUMBER
ALLOCATION_UNIT_SIZE NUMBER
STATE VARCHAR2(11)
TYPE VARCHAR2(6)
TOTAL_MB NUMBER
FREE_MB NUMBER
REQUIRED_MIRROR_FREE_MB NUMBER
USABLE_FILE_MB NUMBER
OFFLINE_DISKS NUMBER
UNBALANCED VARCHAR2(1)
COMPATIBILITY VARCHAR2(60)
DATABASE_COMPATIBILITY VARCHAR2(60)

SQL> set linesize 150
SQL> column name format a30;
SQL> column state format a10;
SQL> select name,state from v$asm_diskgroup;

NAME STATE
------------------------------ ----------
DATA1 DISMOUNTED

果然磁盘组没有加载成功，尝试收工mount磁盘组：
SQL> alter diskgroup data1 mount;
alter diskgroup data1 mount
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15130: diskgroup "DATA1" is being dismounted
ORA-15066: offlining disk "DATA1_0001" may result in a data loss

SQL>
报错了，看看日志：
[oracle@node1 bdump]$ tail -50 alert_+ASM1.log
NOTE: F1X0 found on disk 0 fcn 0.0
NOTE: cache opening disk 1 of grp 1: DATA1_0001 path:/dev/raw/raw4
NOTE: cache mounting (first) group 1/0x26277CDE (DATA1)
* allocate domain 1, invalid = TRUE
kjbdomatt send to node 1
Wed Nov 23 15:37:49 2011
NOTE: attached to recovery domain 1
Wed Nov 23 15:37:49 2011
NOTE: starting recovery of thread=1 ckpt=3.315
NOTE: starting recovery of thread=2 ckpt=3.50
WARNING: cache failed to read fn=3 indblk=0 from disk(s): 1
ORA-15196: invalid ASM block header [kfc.c:7910] [endian_kfbh] [3] [2147483648] [0 != 1]
NOTE: a corrupted block was dumped to the trace file
System State dumped to trace file /opt/app/admin/+ASM/udump/+asm1_ora_21931.trc
NOTE: cache initiating offline of disk 1 group 1
WARNING: offlining disk 1.3914828843 (DATA1_0001) with mask 0x3
NOTE: PST update: grp = 1, dsk = 1, mode = 0x6
Wed Nov 23 15:37:49 2011
ERROR: too many offline disks in PST (grp 1)
Wed Nov 23 15:37:49 2011
NOTE: halting all I/Os to diskgroup DATA1
NOTE: active pin found: 0x0x2427ccd0
NOTE: active pin found: 0x0x2427cc64
Abort recovery for domain 1
NOTE: crash recovery signalled OER-15130
ERROR: ORA-15130 signalled during mount of diskgroup DATA1
NOTE: cache dismounting group 1/0x26277CDE (DATA1)
Wed Nov 23 15:37:51 2011
kjbdomdet send to node 1
detach from dom 1, sending detach message to node 1
Wed Nov 23 15:37:51 2011
Dirty detach reconfiguration started (old inc 1, new inc 1)
List of nodes:
0 1
Global Resource Directory partially frozen for dirty detach
* dirty detach - domain 1 invalid = TRUE
0 GCS resources traversed, 0 cancelled
Wed Nov 23 15:37:51 2011
freeing rdom 1
Dirty Detach Reconfiguration complete
Wed Nov 23 15:37:51 2011
WARNING: dirty detached from domain 1
Wed Nov 23 15:37:51 2011
ERROR: diskgroup DATA1 was not mounted
Wed Nov 23 15:37:52 2011
WARNING: PST-initiated MANDATORY DISMOUNT of group DATA1 not performed - group not mounted
Wed Nov 23 15:37:52 2011
Errors in file /opt/app/admin/+ASM/bdump/+asm1_b000_25521.trc:
ORA-15001: diskgroup "DATA1" does not exist or is not mounted
ORA-15001: diskgroup "DATA1" does not exist or is not mounted

致命的ORA-15196: invalid ASM block header ，提示磁盘坏块了。
[oracle@node1 bdump]$ oerr ora 15196
15196, 00000, "invalid ASM block header [%s:%s] [%s] [%s] [%s] [%s != %s]"
// *Cause: ASM encountered an invalid metadata block.
// *Action: Contact Oracle Support Services.
//
[oracle@node1 bdump]$ oerr ora 15001
15001, 00000, "diskgroup \"%s\" does not exist or is not mounted"
// *Cause: An operation failed because the diskgroup specified does not
// exist or is not mounted by the current ASM instance.
// *Action: Verify that the diskgroup name used is valid, that the
// diskgroup exists, and that the diskgroup is mounted by
// the current ASM instance.
//
没辙了，好在是测试环境，重建吧：
先dbca卸载DB，然后重建diskgroup，最后重建db。
在两个在节点上root用户操作，注意raw3和raw4是要创建磁盘组的设备：
dd if=/dev/zero of=/dev/raw/raw3 bs=1024 count=4
dd if=/dev/zero of=/dev/raw/raw4 bs=1024 count=4
接着重建磁盘组：
SQL> column header_status format a15;
SQL> column path format a30;
SQL> select header_status,path from v$asm_disk;

HEADER_STATUS PATH
--------------- ------------------------------
CANDIDATE /dev/raw/raw3
CANDIDATE /dev/raw/raw4
UNKNOWN ORCL:VOL2
FOREIGN /dev/raw/raw1
UNKNOWN ORCL:VOL1
FOREIGN /dev/raw/raw2

6 rows selected.

SQL> create diskgroup datadisk1 external redundancy disk '/dev/raw/raw3' name d1 disk '/dev/raw/raw4' name d2;

Diskgroup created.
最后重新dbca重建db。
重建之后，重启了虚拟机和主机几把，还有再次发现问题，ok。
-The End-