ChibiOS Debug guide

One of the most important features in an operating system is the support to developers for debugging applicative code. ChibiOS RT and NIL offer several mechanisms that can help in the debug phase of the development cycle.

Problems with debugging Embedded Code

Problems with debugging Embedded Code

Debugging embedded code is, in general, difficult because the inherent nature of the task:

  • Often the system dies together with the application under debug.
  • Problems are usually intermittent and hard to catch.
  • A problem is not necessarily caused by obvious causes, more likely there is an hidden reason.

Kinds of Malfunctions

There are several ways for a system to fail. First lets define some categories for the system anomalies.

System Crashed

A crash is defined as the system going into an exception vector where it is usually stopped.

RT and NIL are static, this means that it is inherently robust, if you are experiencing a crash then the cause is probably external to the system. In a static system the most common causes of a crash are:

  • Memory Violations. This is the most obvious, if you access invalid addresses or use an invalid memory alignment then the CPU goes directly into an exception.
  • Stack Overflows. The system has a stack for each thread and a stack for exceptions, all stacks must be sized correctly. The overflowing stack can corrupt critical data structures or other stacks, this can make the CPU jump to unmapped addresses and get trapped into an exception.
  • Invalid Calls. Calling an OS function from the wrong context can lead to a crash. Not all OS functions can be invoked from any context, for example wait-like functions cannot be called from an ISR. Each OS has rules to follow. Calling functions from an invalid context is a very common cause of errors, often random and hard to catch errors.
  • Function Pointers. If you have function pointers in the code, for example callbacks, then this can be a way for the system to go out of control. A pointer could contain NULL or an invalid address.
  • Undefined ISRs. If an interrupt is triggered for which there isn't an ISR then the system considers it an exception and goes into the unhandled exceptions code (a closed loop).

System Stuck

This condition occurs when the system (application and/or OS) always executes the same code over and over without exiting. For example you could see the system traversing the threads ready list without exiting the loop because a list corruption.

If you verify such a condition within the system code then one of the following causes should be considered.

  • Invalid Calls. Calling OS functions in an improper way can lead to corruption of the internal data structures. The effect is a deadlock in the system code but the cause is external.
  • Infinite Interrupts. When writing an ISR it is important to reset the IRQ source or the IRQ will re-trigger the ISR as soon it exits, blocking the system.

System Halted

Halting is a voluntary action of the OS or the application. When an anomalous condition is detected the system calls a special handler that enters the halt state. This is done by calling chSysHalt() in RT or NIL.

Halts are a good thing because:

  • It means that and error has been detected and not gone unnoticed.
  • From the halt handler it is possible to examine the stack trace and the various other system structures and get a good idea about the nature of the problem.

All ChibiOS debug mechanisms trigger halts in order to signal the developer that a problem has been detected.

System Misbehaving

This happens when the system does not die but behaves in an unexpected way. This could be either a good or a bad thing. Good because it is possible to use the debugger and try to understand the issue. Bad because the system itself has not found a problem.

Debug First Actions

During development you are supposed to configure the OS with debug options enabled and compile the code with the least optimizations, -O0 for GCC users.

Debug Configuration Settings

Now lets see all the available debug options and how those are supposed to help us. All options are located in the file chconf.h usually located in the ./cfg directory of your project source tree.

CH_DBG_SYSTEM_STATE_CHECK

This is probably the most important debug option, it makes sure that all the RTOS functions are called from the proper context. If a invalid function call is detected then the system is stopped and the global variable ch.dbg.panic_msg points to an error string. The possible error codes are:

  • SV#1. The function chSysDisable() has been called from an ISR or from within a critical section.
  • SV#2. The function chSysSuspend() has been called from an ISR or from within a critical section.
  • SV#3. The function chSysEnable() has been called from an ISR or from within a critical section.
  • SV#4. The function chSysLock() has been called from an ISR or from within a critical section. This can happen, for example, if you call chSysLock() twice.
  • SV#5. The function chSysUnlock() has been called from an ISR or from within a critical section. This can happen, for example, if you call chSysUnlock() twice or without calling chSysLock() beforehand.
  • SV#6. The function chSysLockFromISR() has not been called from an ISR or has been called from within a critical section. This can happen, for example, if you call chSysLockFromISR() twice.
  • SV#7. The function chSysUnlockFromISR() has not been called from an ISR or it has been called outsize a critical section. This can happen, for example, if you call chSysUnlockFromISR() twice or without calling chSysLockFromISR() beforehand.
  • SV#8. The macro CH_IRQ_PROLOGUE() has not been placed at very beginning of an ISR or it has been placed within a critical section.
  • SV#9. The macro CH_IRQ_EPILOGUE() has not been placed at very end of an ISR or it has been placed within a critical section or CH_IRQ_PROLOGUE() is missing from the ISR.
  • SV#10. An I-class function has been called from outside a critical section.
  • SV#11. An S-class function has been called from outside a critical section or has been called from an ISR.

You can see that this option is extremely useful because it allows to catch very common usage errors very early during the development process. Just a note, “SV#” stands for “State Violation”. System states are very strongly checked in ChibiOS RT and NIL.

Also see the article RT State Checker.

CH_DBG_ENABLE_CHECKS

This option enables the check of API function parameters, in case of error the system is halted and a message is pointed by ch.dbg.panic_msg.

CH_DBG_ENABLE_ASSERTS

This option enable system assertions. Assertions are checks placed in critical places able to detect anomalous conditions. In case of error the system halted and a message is pointed by ch.dbg.panic_msg.

CH_DBG_TRACE_MASK

This option is an “or” of various option flags, each option selects an event to be traced:

  • CH_DBG_TRACE_MASK_NONE. No events traced by default (but tracing can be activated at runtime for any event).
  • CH_DBG_TRACE_MASK_SWITCH. Context switches are traced by default.
  • CH_DBG_TRACE_MASK_ISR. ISR enter and leave events are traced by default.
  • CH_DBG_TRACE_MASK_HALT. The halt event is traced, of course it is the last event recorded.
  • CH_DBG_TRACE_MASK_USER. User events are recorded. Application code can trace events using the chDbgWriteTrace() API.
  • CH_DBG_TRACE_MASK_SLOW. All events are enabled except IRQ-related ones.
  • CH_DBG_TRACE_MASK_ALL. All events are enabled.
  • CH_DBG_TRACE_MASK_DISABLED. The trace subsystem is removed entirely from the OS image.

The trace buffer stores the last N context switch operations. It can be used to determine the sequence of operations that lead to an halt condition. The trace buffer is a memory structure but the ChibiOS/RT Eclipse Debug plugin is able to show the content in a table. Note. this option is only present in RT, not in NIL.

CH_DBG_ENABLE_STACK_CHECK

This option enables port-defined stacks checking. Note that the check is usually performed at context switch time and does not necessarily catch all the overflow conditions.

CH_DBG_FILL_THREADS

Stacks are filled with a fixed pattern (0x55555555) before running threads. The patter allows to determine how much of a stack area has been really used.

CH_CFG_ST_TIMEDELTA

This option disables the tickless mode. This could be useful in case your problem is caused by use of virtual timers exceeding the system ability to serve all the timers.

Generic Suggestions

For the less expert users, there are several things you may do in order to minimize the need for debugging:

  • Read carefully the documentation first.
  • Enable the various debug options while developing your application.
  • Try to find a code examples for things are you going to do, good sources are:
    • The documentation.
    • The kernel test code, under “./test” you will find examples for almost any API in the ChibiOS RT and NIL kernels and most common RTOS related tasks.
    • The HAL test code, under “./testhal” there are examples regarding the various device drivers.
    • Demo applications.
  • Start your application from an existing demo, add things one at a time and test often, if you add too many things at once then finding a small problem can become a debugging nightmare. Follow the cycle: think, implement, test, repeat.
  • If you are stuck for too much time then consider asking for advice.
  • Report bugs and problems, bugs can be fixed, problems can become new articles in the documentation (this and other documentation articles spawned from questions in the forum or in the tracker).
  • Never give up :-)