How “earlycon” saved the day

While bringing up a new platform recently, I was surprised by what I saw (really what I didn’t see) on the serial console – which was nothing. I saw BIOS printouts, I saw the usual GRUB mumblings, but absolutely nothing after GRUB handed control over to the kernel. Being an embedded system, connecting to the BMC remote console wasn’t an option, there were no display ports, and the network wasn’t up. How could one know whether the kernel bootup up, or if hit any problems?

One suspicion, which I was quiet quick to rule out, was that the USB stick was malformed or malfunctioning somehow. The stick worked well on other systems, and it contained a popular distribution (Ubuntu).

Some research seemed to suggest that earlycon may help, I reviewed the kernel docs, which stated the following:

earlycon=       [KNL] Output early console device and options.

                    When used with no options, the early console is determined by the stdout-path property in device tree's chosen node.

            cdns,<addr>[,options]
                    Start an early, polled-mode console on a Cadence (xuartps) serial port at the specified address. Only supported option is baud rate. If baud rate is not specified, the serial port must already be setup and configured.

            uart[8250],io,<addr>[,options]
            uart[8250],mmio,<addr>[,options]
            uart[8250],mmio32,<addr>[,options]
            uart[8250],mmio32be,<addr>[,options]
            uart[8250],0x<addr>[,options]
                    Start an early, polled-mode console on the 8250/16550 UART at the specified I/O port or MMIO address. MMIO inter-register address stride is either 8-bit (mmio) or 32-bit (mmio32 or mmio32be). If none of [io|mmio|mmio32|mmio32be], <addr> is assumed to be equivalent to 'mmio'. 'options' are specified in the same format described for "console=ttyS<n>"; if unspecified, the h/w is not initialized.

<... snip ...>
https://www.kernel.org/doc/html/v4.14/admin-guide/kernel-parameters.html

I wasn’t sure what earlycon options were needed for my platform, so I tried my luck without any options. The results were encouraging, I got a part of kernel bootup on the serial console, but not the whole thing. At some point the kernel switches over from boot console to the real console – it was after this switch that I lost the boot messages.

Luckily, looking through the code/docs, I learned about another kernel command line option that is specifically there to debug these problems. It is, keep_bootcon, the kernel docs state that:

keep_bootcon [KNL] Do not unregister boot console at start. This is only useful for debugging when something happens in the window between unregistering the boot console and initializing the real console.

I quickly appended this option to the kernel command line, which resulted in the kernel spitting out more messages to the boot console. With that, I was able to peel another layer of this onion, and I was finally able to see what was happening during kernel bootup. I saw the kernel complaining about an “attempt to kill init“, followed by a kernel panic. I knew that an attempt to kill init just meant that the very first PID (which is a special process) was dying for some or the other reason. Since, this was happening in initramfs, it meant that the init process was dying. Adding more debugging to initramfs did not help, so I did the natural thing, which was to override the first process that is launched in initramfs (init) with a bash shell. Luckily, there is also an option for that!

init= [KNL] Format: <full_path> Run specified binary instead of /sbin/init as init process.

I appended init=/bin/bash to the kernel command line, but even that resulted in the kernel complaining about an “attempt to kill init“. It made me a little happier because I could conclude that there was nothing wrong with init or any scripts in my initramfs (not that I needed that confirmation because this stick worked on other computers). This lead me to conclude that something fundamental was broken which was causing processes to die.

My suspicion was it had something to do with the kernel console because it is the most basic device required by almost every service, and because I had been struggling to get anything on mine. I started combing through the kernel messages especially looking for serial console related messages. Upon close examination of which, I found the root cause of all problems. The kernel 8250 serial console driver was creating a ttyS4 serial console, instead of the expected ttyS0. Why could this be happening? Review of the kernel code lead me to the following piece of console allocation code (drivers/tty/serial/8250/8250_core.c):

static struct uart_8250_port *serial8250_find_match_or_unused(const struct uart_port *port)
{
int i;

/*
* First, find a port entry which matches.
*/
for (i = 0; i < nr_uarts; i++)
if (uart_match_port(&serial8250_ports[i].port, port))
return &serial8250_ports[i];

/* try line number first if still available */
i = port->line;
if (i < nr_uarts && serial8250_ports[i].port.type == PORT_UNKNOWN &&
serial8250_ports[i].port.iobase == 0)
return &serial8250_ports[i];
/*
* We didn't find a matching entry, so look for the first
* free entry. We look for one which hasn't been previously
* used (indicated by zero iobase).
*/
for (i = 0; i < nr_uarts; i++)
if (serial8250_ports[i].port.type == PORT_UNKNOWN &&
serial8250_ports[i].port.iobase == 0)
return &serial8250_ports[i];

/*
* That also failed. Last resort is to find any entry which
* doesn't have a real port associated with it.
*/
for (i = 0; i < nr_uarts; i++)
if (serial8250_ports[i].port.type == PORT_UNKNOWN)
return &serial8250_ports[i];

return NULL;
}

It seemed as if first 4 uarts were skipped over because the serial8250_ports[i].port.type was not PORT_UNKNOWN. But what were those first 4 ports that were skipped over? I reviewed how nr_uarts and serial8250_ports get set. Which lead me to the following lines (arch/x86/include/asm/serial.h):

#define SERIAL_PORT_DFNS \
/* UART CLK PORT IRQ FLAGS */ \
{ .uart = 0, BASE_BAUD, 0x3F8, 4, STD_COMX_FLAGS }, /* ttyS0 */ \
{ .uart = 0, BASE_BAUD, 0x2F8, 3, STD_COMX_FLAGS }, /* ttyS1 */ \
{ .uart = 0, BASE_BAUD, 0x3E8, 4, STD_COMX_FLAGS }, /* ttyS2 */ \
{ .uart = 0, BASE_BAUD, 0x2E8, 3, STD_COM4_FLAGS }, /* ttyS3 */

This is where the first 4 serial consoles were coming from leading to my real serial console getting the wrong index (ttyS4). The obvious solution was to force the “last resort” in serial8250_find_match_or_unused by changing a kernel compilation option, which was, CONFIG_SERIAL_8250_RUNTIME_UARTS. Setting this to any number <= 4 ensured that the predefined consoles in arch/x86/include/asm/serial.h get ignored by the 8250 allocation code, and that my real console gets the desired index.

Author’s note: These “quick and dirty” articles are written to help a forlorn and shipwrecked brother. It is hoped that any spelling or grammatical errors will be generously overlooked by the gracious reader.


Posted

in

by

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *