http://fcp.surfsite.org/modules/smartfaq/faq.php?faqid=1255
http://www.redhatmagazine.com/2007/08/15/a-quick-overview-of-linux-kernel-crash-dump-analysis/
http://people.redhat.com/anderson/crash_whitepaper/
Saving Crashdumps can be useful many times for postmortem ananlysis on linux panics.
Linux is particularly not very popular when we (read kernel engineers) talk about kernel debugging. Linus always believed in writing perfect software, huh!. Anyways, here is one way of dumping the memory and analyzing it at later point and learn something useful :)
We can configure diskdump for dumping the memory incases of crash dumps in linux. The dumps generated by diskdump are kept in /var/crash directory, by name vmcore, and these dumps can be analyzed through crash utility. The only requirement of this whole setup is that crash requires 'vmlinux' file (the kernel which has panic'ed).
Setup diskdump:-
Install diskdump packages - ftp://rpmfind.net/linux/fedora/core/updates/3/i386/diskdumputils-1.1.7-3.i386.rpm
Compile kernel with debug info ON <-- google can give
Install the new kernel and save the vmlinux file in a known location
Boot with the compiled kernel.
modprobe diskdump <-------- load the diskdump module
configure a dump device by giving diskdumpfmt command
One can configure a swap partition for diskdump ...
$ diskdumpfmt -cv /dev/sda3 <------- Will check if device can be configured or not
$ diskdumpfmt -fv /dev/sda3 <-------- Will format the device for diskdump
$ vi /etc/sysconfig/diskdump and put DEVICE=/dev/sda3
$ service diskdump restart
$ cat /proc/diskdump <---- will show all parameters of diskdump configuration (it should show the device also) like -
# sample_rate: 8
# block_order: 2
# fallback_on_err: 1
# allow_risky_dumps: 1
# dump_level: 0
# compress: 0
# total_blocks: 259722
#
sda3 128520063 4016187
Now since diskdump is configured, next step is to test whether dump is really collected or not.
$ vi /etc/sysctl.conf
# This will enable the sysrq keys for forcibly panicing the system etc.
kernel.sysrq = 1
# following line will force the system to reboot after 5 seconds if the system crashes after some panic. By default the system halts after it panics.
kernel.panic=5
* Crash the system -
echo c > /proc/sysrq-triggered <-- This will crash the system forcibly
OR
press the sysrq key --
Sysrq C # all keys pressed together.
This will crash the system. Note that by putting “kernel.panic=5” the system will reboot automatically after 5 seconds.
Now while booting up, since diskdump is configured, it will dump the crash stored in the dump device (swap in our case), and comes up.
Analyzing the Crash
Install crash package -- ftp://rpmfind.net/linux/fedora/core/4/i386/os/Fedora/RPMS/crash-3.10-13.i386.rpm
go to /var/crash/
directory.. you will find vmcore file here... file vmcore ---- will show that its a core file
strings vmcore | grep 'Linux' <-- will show which vmlinux file has generated this dump.
run crash /boot/vmlinux /var/crash/
/vmcore This will give you enough information about the core --
Some commands are -
Commonly Used Crash Commands
There are many commands in crash. It is also possible to extend crash by adding new commands, by writing new code and compiling it into the crash executable, or creating a shared object library that can be dynamically loaded by using the extend command. The following are some commonly used crash commands that you will likely use:
help – get help
crashhas a readily available help information built into the utility, by typinghelp. Each command has its ownman-like page, which can be viewed by typinghelp command-name.crash> help * files mod runq union alias foreach mount search vm ascii fuser net set vtop bt gdb p sig waitq btop help ps struct whatis dev irq pte swap wr dis kmem ptob sym q eval list ptov sys exit log rd task extend mach repeat timer crash version: 4.0-3.3 gdb version: 6.1 For help on any command above, enter "help “. For help on input options, enter “help input”. For help on output options, enter “help output”.
Tip: all the
crashcommands can be piped to external programs or redirected to files:crash> log > log.txt
This will send the in-kernel log to a local file called
log.txt.crash> ps | fgrep bash | wc -l
This will count the number of
bashtasks that were running.sys – system data
crash> sys KERNEL: /usr/lib/debug/lib/modules/2.6.9-22.EL/vmlinux DUMPFILE: /home/eteo/crash/127.0.0.1-2007-04-30-21:38/vmcore CPUS: 1 DATE: Mon Apr 30 21:38:40 2007 UPTIME: 00:04:04 LOAD AVERAGE: 0.36, 0.23, 0.08 TASKS: 36 NODENAME: localhost.localdomain RELEASE: 2.6.9-22.EL VERSION: #1 Mon Sep 19 18:20:28 EDT 2005 MACHINE: i686 (1862 Mhz) MEMORY: 1 GB PANIC: "Oops: 0002 [#1]" (check log for details)
The sys messages have information of the system (e.g. kernel release, kernel version, number of CPUs, amount of memory, etc), the time of
vmcoretaken, the operating period, and the panic (e.g. oops type, panic task/PID/command, etc).bt – backtrace
crash> bt PID: 2857 TASK: f7b677f0 CPU: 0 COMMAND: "bash" #0 [f7191e04] start_disk_dump at f89d7bb3 #1 [f7191e18] die at c010682e #2 [f7191e48] do_page_fault at c011ab00 [...] #9 [f7191fc0] system_call at c030f918 EAX: 00000004 EBX: 00000001 ECX: b7de7000 EDX: 00000002 DS: 007b ESI: 00000002 ES: 007b EDI: b7de7000 SS: 007b ESP: bfe01650 EBP: bfe01670 CS: 0073 EIP: 003297a2 ERR: 00000004 EFLAGS: 00000246
log – dump system message buffer
crash> log [...] SysRq : Crashing the kernel by request Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: c0233fa7 *pde = 3e9f3067 Oops: 0002 [#1] Modules linked in: md5 ipv6 autofs4 i2c_dev i2c_core sunrpc scsi_dump diskdump dm_mirror dm_mod button battery ac yenta_socket pcmcia_core uhci_hcd ehci_hcd shpchp snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore ipw2200 ieee80211 ieee80211_crypt tg3 floppy ext3 jbd ata_piix libata sd_mod scsi_mod CPU: 0 EIP: 0060:[] Not tainted VLI EFLAGS: 00010246 (2.6.9-22.EL) EIP is at sysrq_handle_crash+0×0/0×8 eax: 00000063 ebx: c0370db4 ecx: 00000000 edx: 00000000 esi: 00000063 edi: 00000000 ebp: 00000000 esp: f7191f60 ds: 007b es: 007b ss: 0068 Process bash (pid: 2857, threadinfo=f7191000 task=f7b677f0) Stack: c02342d8 c032dc4e c032f105 00000003 00000002 f7b6adc0 00000002 f7191fac c01a8a13 c0362740 c0168205 f7191fac b7de7000 f7b6adc0 fffffff7 b7de7000 f7191000 c01682cf f7191fac 00000000 00000000 00000000 00000001 00000002 Call Trace: [] __handle_sysrq+0×58/0xc6 [] write_sysrq_trigger+0×23/0×29 [] vfs_write+0xb6/0xe2 [] sys_write+0×3c/0×62 [] syscall_call+0×7/0xb Code: 4c 11 42 c0 05 00 00 00 c7 05 50 11 42 c0 2f cc 31 c0 c7 05 54 11 42 c0 00 00 00 00 c7 05 58 11 42 c0 00 00 00 00 e9 e5 0b f0 ff 05 00 00 00 00 00 c3 e9 e1 59 f3 ff e9 1e bc f3 ff 85 d2 89
The log command dumps the kernel log buffer contents in chronological order. This is similar to what you would see when you type
dmesgon a running machine. This is useful when you want to look at the panic or oops message. An oops is triggered by some exception. It is a dump of the CPU register’s state and kernel stack at that instant. From the panic message, we can find hints as to how the panic was triggered (e.g. the function or process or pid or command or address that triggered the panic), the register’s information, kernel module list, whether the kernel is tainted with proprietary kernel modules loaded, and so on. Let’s walk through the panic message to see what we can learn from it. See the comments below each section within the log:crash> log [...] SysRq : Crashing the kernel by request <-- this panic is intentional Unable to handle kernel NULL pointer dereference at virtual address 00000000
This is the address to which reference was attempted.
printing eip: c0233fa7
This is the address at which the failure occurred.
*pde = 3e9f3067 Oops: 0002 [#1]
Often one oops will trigger more; only the first is reliable.
Modules linked in: md5 ipv6 autofs4 i2c_dev i2c_core sunrpc scsi_dump diskdump dm_mirror dm_mod button battery ac yenta_socket pcmcia_core uhci_hcd ehci_hcd shpchp snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore ipw2200 ieee80211 ieee80211_crypt tg3 floppy ext3 jbd ata_piix libata sd_mod scsi_mod CPU: 0 EIP: 0060:[] Not tainted VLI
The first part is the code segment and instruction address. If tainted, it will be followed by:
G – All modules loaded have a GPL or compatible license P – Proprietary modules loaded F – Module forcibly loaded S – Oops on hardware that are not SMP capable R – Module forcibly unloaded M - Machine Check Exception (MCE) occurred etc (see Further readings section). EFLAGS: 00010246 (2.6.9-22.EL)
This line denotes program status, registers information.
f7191fac f7191000 c01682cf f7191fac 00000000 00000000 00000000 00000001 00000002 Call Trace:
This is the backtrace of function calls.
[] __handle_sysrq+0×58/0xc6 [] write_sysrq_trigger+0×23/0×29 [] vfs_write+0xb6/0xe2 [] sys_write+0×3c/0×62 [] syscall_call+0×7/0xb Code: 4c 11 42 c0 05 00 00 00 c7 05 50 11 42 c0 2f cc 31 c0 c7 05 54 11 42 c0 00 00 00 00 c7 05 58 11 42 c0 00 00 00 00 e9 e5 0b f0 ff 05 00 00 00 00 00 c3 e9 e1 59 f3 ff e9 1e bc f3 ff 85 d2 89
From the line c0233fa7, we can see that this is the address at which the failure occurred. Issuing the following command can give us more hints as to which function or source code or assembly statement in the kernel triggered that:
crash> dis -lr c0233fa7 /usr/src/build/614745-i686/BUILD/kernel-2.6.9/linux- 2.6.9/drivers/char/sysrq.c: 115 0xc0233fa7 : movb $0×0,0×0
ps – display process status information
crash> ps PID PPID CPU TASK ST %MEM VSZ RSS COMM 0 0 0 c0358be0 RU 0.0 0 0 [swapper] 1 0 0 f7e01770 IN 0.1 1680 684 init [...] 2380 1 0 f7ac2800 IN 0.0 1604 504 mingetty 2769 2371 0 f7ac3970 IN 0.2 5740 1636 bash 2852 1 0 f7b1a880 IN 0.2 4240 2012 sshd 2855 2852 0 f7b66680 IN 0.3 8316 2756 sshd > 2857 2855 0 f7b677f0 RU 0.2 6260 1628 bash Sometimes it is useful to know which process belongs to which parent or vice versa.pshas-cand-pto show the child and parent processes. crash> ps -p 2857 PID: 0 TASK: c0358be0 CPU: 0 COMMAND: "swapper" PID: 1 TASK: f7e01770 CPU: 0 COMMAND: "init" PID: 2852 TASK: f7b1a880 CPU: 0 COMMAND: "sshd" PID: 2855 TASK: f7b66680 CPU: 0 COMMAND: "sshd" PID: 2857 TASK: f7b677f0 CPU: 0 COMMAND: "bash"
files – open files
crash> files PID: 2857 TASK: f7b677f0 CPU: 0 COMMAND: "bash" ROOT: / CWD: /root FD FILE DENTRY INODE TYPE PATH 0 f7a6e7c0 f7790198 f7b0fdcc CHR /dev/pts/0 1 f7b6adc0 f7190130 f7b9ca4c REG /proc/sysrq-trigger 2 f7a6e7c0 f7790198 f7b0fdcc CHR /dev/pts/0 10 f7a6e7c0 f7790198 f7b0fdcc CHR /dev/pts/0 255 f7a6e7c0 f7790198 f7b0fdcc CHR /dev/pts/0 crash> files 2852 PID: 2852 TASK: f7b1a880 CPU: 0 COMMAND: "sshd" ROOT: / CWD: / FD FILE DENTRY INODE TYPE PATH 0 f7b336c0 f78001d8 f7cb1ba4 CHR /dev/null 1 f7b336c0 f78001d8 f7cb1ba4 CHR /dev/null 2 f7b336c0 f78001d8 f7cb1ba4 CHR /dev/null 3 f7b69600 f7bf5280 f7aadafc SOCK socket:/[6277]
dev – device data
crash> help dev [...] If no argument is entered, this command dumps the contents of the chrdevs and blkdevs arrays. crash> dev CHRDEV NAME OPERATIONS 1 mem (none) 4 /dev/vc/0 (none) 4 tty (none) [...] BLKDEV NAME OPERATIONS 1 ramdisk c0376d08 2 fd (unknown) 8 sd f880e070
Read the white paper of core for more commands about crash -- http://people.redhat.com/anderson/crash_whitepaper/
0 comments:
Post a Comment