Basic linux system troubleshooting

Because sometimes turning it on and off isn’t enough

There’s a saying that I heard that goes something along the lines of, “If you have to turn something on and off it means you don’t understand what’s going on.” I’m not saying that turning something on and off doesn’t fix it, because it does it some cases. But I’m going to be talking those cases where a system crashed and crashing it again isn’t going to magically make it better. So I want to review two common issues and resolutions that every linux admin should know.

Rotate your logs!

First we need need to talk about the basics of filesystems in linux. While there are a handful of root level directories that each serve a specific and equally important purpose, we’re really only going to focus on /var for this article. For simplicity’s sake, /var is the directory that stores all system level and most application logs. Because of this, all logs are going to be written to /var/log.

A problem that tends to occur frequently is that /var will run out of space because it’s writing logs quicker than they can be rotated (we’ll get to what this means in a second). From an application perspective, this happens when an application runs into an error and continually churns out error messages.

I mentioned rotating and what I mean by this is logrotate. This built-in tool is essential to maintaining space in your log directory or any directory which files are written to. While many programs will install with a default rotate configuration, you may have to tweak it depending on the load your application will be taking. Below we will examine what a log-rotate entry will look like.

/var/log/nginx/*.log {
hourly
size 256M
rotate 3
copytruncate
dateformat -%s
compress
missingok
}

This file is located in /etc/logrotate.d/nginx. What’s happening in this configuration is the following:

  • /var/log/nginx/*.log — This is the log we want rotated. In this case we are using a wild card to say, rotate everything in /var/log/nginx that ends in .log
  • hourly — Rotate the logs hourly
  • size — The maximum size the log file can be before it will be rotated
  • rotate # — The number of logs to hold before being deleted
  • copytruncate — instead of creating a new file, this will copy the original log file and then truncate it. This is for systems that cannot close out of the log file and must continuously write
  • dateformat — This will add the date to the end of the copied log file. In this case %s uses the system clock
  • compress — This will gzip the log
  • missingok — This will move tell logrotate to move to the next file if there is not log file present

If we were going to write out the pseudo logic of it, it’d go something like, “Every hour check to make sure the size of x.log is less than 256M if it exists. If it’s greater, copy and truncate the log, append the system date to the end of the copied log, compress it and delete the oldest copied log if there are more than 3 copies. If it doesn’t exist, move on to the next one”

There are many more options that you can use in your configuration, but you can also use the one above and be okay. Check the man page for more information.

Single user mode and fstab triage

If your a linux admin, you’re going to eventually have to mount disks to filesystems and one of the most important files when it comes to filesystems is fstab. /etc/fstab tells the operating systems about disks/paritions that the system can mount. If a filesystem declaration is wrong and you reboot the system, the system will not come back up. Before we dive in, lets quickly review what an entry of fstab looks like before we break and recover our test system.

/dev/mapper/VolGroup-lv_root /               ext4   defaults  1 1
UUID=04720a62-47dd-4077-bd8a-6731e31b7abb /boot ext4 defaults 1 2

These two entries have 6 fields that are required for an fstab entry, each separated by a space/multiple tabs. Let’s break it down.

  • Disk location- This is the location of the filesystem on disk. In this case it’s the logical volume named lv_root. You can also use the disks UUID which we use is the second example for /boot. Best practice dictates to use the UUID.
  • Filesystem target — This is where the disk specified should be mounted to.
  • Filesystem type — This describes the type of filesystem that is created. Examples of this are ext2/3/4, xfs, swap
  • Mount options — In most cases you will selection defaults, but in cases where you’re mounting disk for a database you may want to specify options for read/write performance improvements (noatime, nodirtime. See mount for more info.)
  • Filesystem Freq — Specifies the dump level. Used for taking backups of the filesystems. If no option is set then it defaults to zero.
  • Filesystem Passno — Used by fsck to run filesystem checks on boot. By default, the root level should be set to 1 and all other filesystems set to 2. If you do not want to run on fsck on boot then set this option to zero. By default, if there is no option given then it will be set to zero.

So that’s how an fstab entry is laid out. Now if you tried to add a disk that doesn’t exist or even make a typo in your entry, you will break the system when it reboots. Let’s walk through repairing this issue as it is fairly common in the linux sysadmin world.

For this example I’m going to make an entry into fstab of a filesystem and disk that does not exist.

The last line we created:

/dev/mapper/VolGroup-lv_breakme /new/dir      ext4 defaults 1 2

which is technically a valid entry, however the disk entry does not exist so it will break the system on reboot.

After we saved the file we will reboot our test system and as expected we see this on boot.

What this is saying is that it cannot find /dev/mapper/VolGroup-lv_breakme to mount to /new/dir. Moving forward there are two way to fix this issue. You can either enter the root password to drop into maintenance mode OR if you do not know the root password, we can force the operating system to drop into a shell via the GRUB boot loader. For most production systems you will not know the root password as it’s probably locked down by security, so I will show you the second option.

First we need to reboot the system so we can access the GRUB boot loader. When you reboot, at this screen press any key.

Next we are going to edit our kernel options by pressing ‘e’ on our OS entry and edit the kernel by pressing ‘e’ again. This usually is the most updated kernel/the top option if you have more than one listed.

Doing so will drop you into a window that looks like this. You want to press space and add rw init=/bin/bash . When you finish press enter and it will take you back to the previous screen.

My screen cuts off the rest of the command but this should give you the gist

Now continue with booting by pressing ‘b’. This way we can triage the fstab file.

If you did this correctly you will be dropped into a window that looks like this.

Here’s whats happening. When linux boots, the first process that spawns is init and is responsible for running all of the init scripts. When we run init=/bin/bash, we are telling the system to run a command shell instead of executing it’s start up scripts. By doing this we are dropping into a shell before the system attempts to mount the disks.

Note: If you did not add the ‘rw’ flag you will be dropped into a readonly shell. To fix this we can run the following command to remount the system as read/write:mount -o remount,rw /

Now we can edit the fstab and comment out or fix our typo by running vi /etc/fstab

Now we will reboot the system (you might have to hard reset it as typing reboot/shutdown/exit might cause a kernel panic) and viola! we are back in business. Fun fact: if you do forget your root password, you can use this method to change it.

These are just two common issues that can occur when managing linux systems and I’ll cover more in a future write up. If you have any questions or if I missed something leave a comment!

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.