Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calling ChipLogProgress from FreeRTOS timer thread causes stack overflow. (CON-1508) #1239

Open
jonsmirl opened this issue Jan 12, 2025 · 6 comments

Comments

@jonsmirl
Copy link
Contributor

jonsmirl commented Jan 12, 2025

I was not using the debugger when I hit this:
ESP-IDF v5.2.3
JTAG serial console was runnning.

cluBLE 8 d4 ae f1 34 a4 
cluBLE 8 d4 ae f1 34 a4 
Guru Meditation Error: Core  1 panic'ed (Unhandled debug exception). 
Debug exception reason: BREAK instr 
Core  1 register dump:
PC      : 0x403743c0  PS      : 0x00040536  A0      : 0x00000000  A1      : 0x3fcc55b0  
--- 0x403743c0: _DoubleExceptionVector at /home/jonsmirl/esp/esp-idf/components/xtensa/xtensa_vectors.S:564

A2      : 0x00000000  A3      : 0x00000000  A4      : 0x00000001  A5      : 0x3fca531c  
A6      : 0x00000001  A7      : 0x3fcc5754  A8      : 0x821b25e1  A9      : 0x3fcc5590  
A10     : 0x00060023  A11     : 0x00000003  A12     : 0x00000000  A13     : 0x00000008  
A14     : 0x00000001  A15     : 0x3fca50a8  SAR     : 0x00000000  EXCCAUSE: 0x00000001  
EXCVADDR: 0xffffffe0  LBEG    : 0x00000000  LEND    : 0x00000000  LCOUNT  : 0x00000000  


Backtrace: 0x403743bd:0x3fcc55b0
--- 0x403743bd: _UserExceptionVector at ??:?





ELF file SHA256: 7ac7a9cfbb8a5ad0

Rebooting...
ESP-ROM:esp32s3-20210327
Build:Mar 27 2021
rst:0xc (RTC_SW_CPU_RST),boot:0x29 (SPI_FAST_FLASH_BOOT)
Saved PC:0x42156ce8
--- 0x42156ce8: usb_serial_jtag_ll_ena_intr_mask at /home/jonsmirl/esp/esp-idf/components/hal/esp32s3/include/hal/usb_serial_jtag_ll.h:50
 (inlined by) usb_serial_jtag_write_bytes at /home/jonsmirl/esp/esp-idf/components/driver/usb_serial_jtag/usb_serial_jtag.c:228

SPIWP:0xee
mode:DIO, clock div:1
load:0x3fce3820,len:0x19fc
load:0x403c9700,len:0x4
load:0x403c9704,len:0xdd0
load:0x403cc700,len:0x3120
entry 0x403c992c

It is likely related to me redirecting the log

#define LOG_BUFFER_SIZE 8192
#define LOG_LINE_MAX 1024U
static vprintf_like_t old_logprintf;
RingbufHandle_t log_handle;
static char *log_line_buffer;
/* FIXME 
https://github.com/espressif/esp-matter/issues/999
*/
static int log_callback(const char *format, va_list args)
{
    size_t n = 0;
    size_t len = vsnprintf(log_line_buffer, LOG_LINE_MAX, format, args);
    if (len != 0) {
        if (xRingbufferSend(log_handle, log_line_buffer, len, 0) == pdFALSE) {
            void *discard = xRingbufferReceiveUpTo(log_handle, &n, 1, len);
            if (discard == NULL) {
                discard = xRingbufferReceiveUpTo(log_handle, &n, 1, n);
                discard = xRingbufferReceiveUpTo(log_handle, &n, 1, len);
            }
            vRingbufferReturnItem(log_handle, discard);
            xRingbufferSend(log_handle, log_line_buffer, len, 0);
        }
    }
    return old_logprintf(format, args);
}

extern "C" void app_main()
{
    //Allocate ring buffer data structure and storage area into external RAM
    log_line_buffer = (char *)heap_caps_malloc(LOG_LINE_MAX, MALLOC_CAP_SPIRAM);
    StaticRingbuffer_t *buffer_struct = (StaticRingbuffer_t *)heap_caps_malloc(sizeof(StaticRingbuffer_t), MALLOC_CAP_SPIRAM);
    uint8_t *buffer_storage = (uint8_t *)heap_caps_malloc(sizeof(uint8_t)*LOG_BUFFER_SIZE, MALLOC_CAP_SPIRAM);

    //Create a ring buffer with manually allocated memory
    log_handle = xRingbufferCreateStatic(LOG_BUFFER_SIZE, RINGBUF_TYPE_BYTEBUF, buffer_storage, buffer_struct);
    old_logprintf = esp_log_set_vprintf(log_callback);

    StackType_t* mem_task_stack = (StackType_t*)heap_caps_calloc(1, 4096, MALLOC_CAP_SPIRAM);
    static StaticTask_t mem_task;
    xTaskCreateStatic(memTask, "mem", 4096, NULL, 0, mem_task_stack, &mem_task);

    app_matter();
}
#define LOG_CHUNK_SIZE 1024
extern RingbufHandle_t log_handle;
static esp_err_t log_handler(httpd_req_t *req)
{
    esp_err_t err = ESP_OK;
    httpd_resp_set_type(req, "text/plain");
    httpd_resp_set_hdr(req, "Access-Control-Allow-Origin", "*");
    size_t n = 0;
    char *entries;
    while (true)
    {
        entries = (char *)xRingbufferReceiveUpTo(log_handle, &n, 0, LOG_CHUNK_SIZE);
        if (entries == NULL)
            break;
        err = httpd_resp_send_chunk(req, entries, n);
        vRingbufferReturnItem(log_handle, (void *)entries);
        if (err != ESP_OK)
            break;
    }
    httpd_resp_send_chunk(req, NULL, 0);
    return err;
}

@github-actions github-actions bot changed the title _DoubleExceptionVector _DoubleExceptionVector (CON-1508) Jan 12, 2025
@jonsmirl
Copy link
Contributor Author

Another variation on same fault

log_handler3
Guru Meditation Error: Core  1 panic'ed (Unhandled debug exception). 
Debug exception reason: BREAK instr 
Core  1 register dump:
PC      : 0x403743c0  PS      : 0x00040236  A0      : 0x00000000  A1      : 0x3fcc55b0  
--- 0x403743c0: _DoubleExceptionVector at /home/jonsmirl/esp/esp-idf/components/xtensa/xtensa_vectors.S:564

A2      : 0x00000000  A3      : 0x00000000  A4      : 0x00000001  A5      : 0x3fca531c  
A6      : 0x00000001  A7      : 0x3fcc5754  A8      : 0x821b261d  A9      : 0x3fcc5590  
A10     : 0x00060023  A11     : 0x00000003  A12     : 0x00000000  A13     : 0x00000008  
A14     : 0x00000001  A15     : 0x3fca50a8  SAR     : 0x00000000  EXCCAUSE: 0x00000001  
EXCVADDR: 0xffffffe0  LBEG    : 0x00000000  LEND    : 0x00000000  LCOUNT  : 0x00000000  


Backtrace: 0x403743bd:0x3fcc55b0
--- 0x403743bd: _UserExceptionVector at ??:?





ELF file SHA256: c471ed8de2220af2

Rebooting...
ESP-ROM:esp32s3-20210327
Build:Mar 27 2021
rst:0xc (RTC_SW_CPU_RST),boot:0x8 (SPI_FAST_FLASH_BOOT)
Saved PC:0x403848c6
--- 0x403848c6: xPortEnterCriticalTimeout at /home/jonsmirl/esp/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:488

SPIWP:0xee
mode:DIO, clock div:1
load:0x3fce3820,len:0x19fc
load:0x403c9700,len:0x4
load:0x403c9704,len:0xdd0
load:0x403cc700,len

@jonsmirl
Copy link
Contributor Author

I am still testing, but my preliminary assessment is that this is due to the HTTP server overrunning the FreeRTOS timer stack.

@jonsmirl
Copy link
Contributor Author

jonsmirl commented Jan 14, 2025

This is very suspicious, is that NVS set causing the timer stack to overflow? I set my timer stack size up to 10K and I am rerunning the test. It is very suspicious because my timer overflow almost exactly at (3601073) each time.

I (3601073) chip[DL]: NVS set: chip-counters/total-hours = 316 (0x13C)

***ERROR*** A stack overflow in task Tmr Svc has been detected.


Backtrace: 0x40375dd6:0x3fcc5b60 0x40384399:0x3fcc5b80 0x403849b1:0x3fcc5ba0 0x4038525a:0x3fcc5c20 0x40384498:0x3fcc5c50 0x4038448e:0x00060023 |<-CORRUPTED
--- 0x40375dd6: panic_abort at /home/jonsmirl/esp/esp-idf/components/esp_system/panic.c:466
0x40384399: esp_system_abort at /home/jonsmirl/esp/esp-idf/components/esp_system/port/esp_system_chip.c:93
0x403849b1: vApplicationStackOverflowHook at /home/jonsmirl/esp/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:558
0x4038525a: vTaskSwitchContext at /home/jonsmirl/esp/esp-idf/components/freertos/FreeRTOS-Kernel/tasks.c:3664 (discriminator 7)
0x40384498: _frxt_dispatch at /home/jonsmirl/esp/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/portasm.S:451
0x4038448e: _frxt_int_exit at /home/jonsmirl/esp/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/portasm.S:246


@jonsmirl
Copy link
Contributor Author

jonsmirl commented Jan 14, 2025

This is an old error which I have hit previously. You can't do ChipLogProgress() from an interrupt routine; it sometimes causes a stack overflow. I commented out the NVS Set debug messages in this file:
https://github.com/project-chip/connectedhomeip/blob/master/src/platform/ESP32/ESP32Config.cpp

Well.... you can do it, but then I have to make 10K timer stacks.

I (3601073) chip[DL]: NVS set: chip-counters/total-hours = 316 (0x13C)
ERROR A stack overflow in task Tmr Svc has been detected.

That timer service routine has to switch off from the FreeRTOS timer stack and onto the CHIP stack. Why it is using a FreeRTOS timer instead of a ChipTimer? please search around for more of these, this is the second one I have hit.

For sure that is the bug, with 10K timer stack I can get pass the progress log message....

cluBLE 8 d4 ae f1 34 a4 
cluBLE 8 d4 ae f1 34 a4 
log_handler
log_handler1 0
log_handler2 0
I (3601076) chip[DL]: NVS set: chip-counters/total-hours = 317 (0x13D)
log_handler
log_handler1 96
log_handler1 96
log_handler2 96

@jonsmirl jonsmirl changed the title _DoubleExceptionVector (CON-1508) calling ChipLogProgress from FreeRTOS timer thread causes stack overflow. (CON-1508) Jan 14, 2025
@shubhamdp
Copy link
Contributor

@jonsmirl Thanks for raising and drilling down on the problem. I did not hit this during my tests so I assumed the logs would go-through.

Since total-operational-hours is a critical information, it intentionally uses the FreeRTOS timer to increment the values, so that it should not be a victim of PostEvent failures.

Here is the PR for the fix: project-chip/connectedhomeip#37058

@jonsmirl
Copy link
Contributor Author

You can print with ESP_LOG you just can't use ChipLogProgress since it is a stack pig.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants