Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion failure with new image and eng+chi_tra fast #4362

Open
marcreichman-pfi opened this issue Nov 25, 2024 · 5 comments
Open

Assertion failure with new image and eng+chi_tra fast #4362

marcreichman-pfi opened this issue Nov 25, 2024 · 5 comments

Comments

@marcreichman-pfi
Copy link

marcreichman-pfi commented Nov 25, 2024

Current Behavior

This is in the recent main (9f17a3fd) I receive a SIGABRT in Release (SIGILL in Debug) with the eng and chi_tra langages. Both are fast and official.

(gdb) set args ~/dev/testimages/ACCDEE72E33B2C425E597A4411009466.jpg - --tessdata-dir <snip>/tessdata/ -l eng+chi_tra
(gdb) r
Starting program: /root/dev/tesseract/build-debug/bin/tesseract ~/dev/testimages/ACCDEE72E33B2C425E597A4411009466.jpg - --tessdata-dir <snip>/tessdata/ -l eng+chi_tra
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Estimating resolution as 261
Detected 12 diacritics
[New Thread 0x7ffff73c6640 (LWP 5374)]
[New Thread 0x7ffff6bc5640 (LWP 5375)]
[New Thread 0x7ffff63c4640 (LWP 5376)]
!w_it.cycled_list():Error:Assert failed:in file /root/dev/tesseract/src/ccstruct/pageres.cpp, line 1502

Thread 1 "tesseract" received signal SIGILL, Illegal instruction.
tesseract::ERRCODE::error (this=this@entry=0x5555558a1340 <tesseract::ASSERT_FAILED>, caller=caller@entry=0x5555557f9123 "!w_it.cycled_list()", action=action@entry=tesseract::ABORT, format=format@entry=0x5555557f8900 "in file %s, line %d") at /root/dev/tesseract/src/ccutil/errcode.cpp:78
78            __builtin_trap();
(gdb) bt
#0  tesseract::ERRCODE::error (this=this@entry=0x5555558a1340 <tesseract::ASSERT_FAILED>, caller=caller@entry=0x5555557f9123 "!w_it.cycled_list()", action=action@entry=tesseract::ABORT,
    format=format@entry=0x5555557f8900 "in file %s, line %d") at /root/dev/tesseract/src/ccutil/errcode.cpp:78
#1  0x000055555558485c in tesseract::PAGE_RES_IT::DeleteCurrentWord (this=this@entry=0x7fffffffdc00) at /root/dev/tesseract/src/ccstruct/pageres.cpp:1502
#2  0x000055555561a972 in tesseract::Tesseract::recog_all_words (this=0x7ffff73c7010, page_res=0x5555558e18e0, monitor=monitor@entry=0x0, target_word_box=target_word_box@entry=0x0,
    word_config=word_config@entry=0x0, dopasses=dopasses@entry=0) at /root/dev/tesseract/src/ccmain/control.cpp:446
#3  0x00005555555d5553 in tesseract::TessBaseAPI::Recognize (this=this@entry=0x7fffffffe2d0, monitor=monitor@entry=0x0) at /root/dev/tesseract/src/api/baseapi.cpp:833
#4  0x00005555555d57e3 in tesseract::TessBaseAPI::ProcessPage (this=this@entry=0x7fffffffe2d0, pix=0x5555558e2230, page_index=page_index@entry=0,
    filename=filename@entry=0x7fffffffe774 "/root/dev/testimages/ACCDEE72E33B2C425E597A4411009466.jpg", retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0,
    renderer=0x5555558d2740) at /root/dev/tesseract/src/api/baseapi.cpp:1218
#5  0x00005555555d68e4 in tesseract::TessBaseAPI::ProcessPagesInternal (this=this@entry=0x7fffffffe2d0,
    filename=0x7fffffffe774 "/root/dev/testimages/ACCDEE72E33B2C425E597A4411009466.jpg", retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0,
    renderer=0x5555558d2740) at /root/dev/tesseract/src/api/baseapi.cpp:1181
#6  0x00005555555d69ea in tesseract::TessBaseAPI::ProcessPages (this=this@entry=0x7fffffffe2d0, filename=<optimized out>, retry_config=retry_config@entry=0x0,
    timeout_millisec=timeout_millisec@entry=0, renderer=<optimized out>) at /root/dev/tesseract/src/api/baseapi.cpp:998
#7  0x000055555556d6c3 in main (argc=<optimized out>, argv=<optimized out>) at /usr/include/c++/11/bits/unique_ptr.h:173

Expected Behavior

No sig abort

Suggested Fix

No response

tesseract -v

tesseract 5.5.0-26-g9f17a
 leptonica-1.82.0
  libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
 Found AVX
 Found SSE4.1
 Found OpenMP 201511

Operating System

Ubuntu 22.04 Jammy

Other Operating System

WSL

uname -a

Linux hostname 5.10.16.3-microsoft-standard-WSL2 #1 SMP Fri Apr 2 22:23:49 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Compiler

GCC 11.4

CPU

Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz

Virtualization / Containers

No response

Other Information

I'm sure this is related to the random generator-covered series of issues (#4361 #4146 #4148 #4270). This is also reproducible in 5.5.0, unlike #4361 which worked on in 5.5.0.

@marcreichman-pfi
Copy link
Author

ACCDEE72E33B2C425E597A4411009466

Here is the image for this one, sorry.

@stweil
Copy link
Member

stweil commented Nov 25, 2024

There is a heap-use-after-free before the assertion:

Estimating resolution as 261
Detected 12 diacritics
=================================================================
==31201==ERROR: AddressSanitizer: heap-use-after-free on address 0x6080000034b8 at pc 0x55a73474bd12 bp 0x7fffbe0cdab0 sp 0x7fffbe0cdaa8
READ of size 8 at 0x6080000034b8 thread T0
    #0 0x55a73474bd11 in std::__cxx1998::_Base_bitset<1ul>::_M_getword(unsigned long) const /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bitset:415:16
    #1 0x55a73474bc82 in std::__cxx1998::bitset<16ul>::_Unchecked_test(unsigned long) const /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bitset:1066:24
    #2 0x55a73474bc00 in std::__cxx1998::bitset<16ul>::operator[](unsigned long) const /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bitset:1168:16
    #3 0x55a73474bba2 in std::__debug::bitset<16ul>::operator[](unsigned long) const /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/debug/bitset:282:16
    #4 0x55a73474b2df in tesseract::WERD::flag(tesseract::WERD_FLAGS) const /tesseract/build/../src/ccstruct/werd.h:129:12
    #5 0x55a7349c0280 in tesseract::Tesseract::recog_all_words(tesseract::PAGE_RES*, tesseract::ETEXT_DESC*, tesseract::TBOX const*, char const*, int) /tesseract/build/../src/ccmain/control.cpp:350:37
    #6 0x55a7346a24af in tesseract::TessBaseAPI::Recognize(tesseract::ETEXT_DESC*) /tesseract/build/../src/api/baseapi.cpp:833:21
    #7 0x55a7346a4b99 in tesseract::TessBaseAPI::ProcessPage(Pix*, int, char const*, char const*, int, tesseract::TessResultRenderer*) /tesseract/build/../src/api/baseapi.cpp:1218:14
    #8 0x55a7346a92b8 in tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*, int, tesseract::TessResultRenderer*) /tesseract/build/../src/api/baseapi.cpp:1181:16
    #9 0x55a7346a61f1 in tesseract::TessBaseAPI::ProcessPages(char const*, char const*, int, tesseract::TessResultRenderer*) /tesseract/build/../src/api/baseapi.cpp:998:17
    #10 0x55a7346262f2 in main /tesseract/build/../src/tesseract.cpp:867:24
    #11 0x7f8a62f23249 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #12 0x7f8a62f23304 in __libc_start_main csu/../csu/libc-start.c:360:3
    #13 0x55a734563450 in _start (/tesseract/build/tesseract+0x17d9450) (BuildId: 76aacbbd0f98892a9872e3f978f3ed72519cf4ee)

0x6080000034b8 is located 24 bytes inside of 96-byte region [0x6080000034a0,0x608000003500)
freed by thread T0 here:
    #0 0x55a7346218cd in operator delete(void*) (/tesseract/build/tesseract+0x18978cd) (BuildId: 76aacbbd0f98892a9872e3f978f3ed72519cf4ee)
    #1 0x55a734fb773e in tesseract::WERD_RES::Clear() /tesseract/build/../src/ccstruct/pageres.cpp:1130:5
    #2 0x55a734fcb438 in tesseract::WERD_RES::~WERD_RES() /tesseract/build/../src/ccstruct/pageres.cpp:1125:3
    #3 0x55a734fd0bee in tesseract::PAGE_RES_IT::ReplaceCurrentWord(tesseract::PointerVector<tesseract::WERD_RES>*) /tesseract/build/../src/ccstruct/pageres.cpp:1483:3
    #4 0x55a7349b840b in tesseract::Tesseract::classify_word_and_language(int, tesseract::PAGE_RES_IT*, tesseract::WordData*) /tesseract/build/../src/ccmain/control.cpp:1367:14
    #5 0x55a7349bbe84 in tesseract::Tesseract::RecogAllWordsPassN(int, tesseract::ETEXT_DESC*, tesseract::PAGE_RES_IT*, std::__debug::vector<tesseract::WordData, std::allocator<tesseract::WordData> >*) /tesseract/build/../src/ccmain/control.cpp:255:5
    #6 0x55a7349c0125 in tesseract::Tesseract::recog_all_words(tesseract::PAGE_RES*, tesseract::ETEXT_DESC*, tesseract::TBOX const*, char const*, int) /tesseract/build/../src/ccmain/control.cpp:345:10
    #7 0x55a7346a24af in tesseract::TessBaseAPI::Recognize(tesseract::ETEXT_DESC*) /tesseract/build/../src/api/baseapi.cpp:833:21
    #8 0x55a7346a4b99 in tesseract::TessBaseAPI::ProcessPage(Pix*, int, char const*, char const*, int, tesseract::TessResultRenderer*) /tesseract/build/../src/api/baseapi.cpp:1218:14
    #9 0x55a7346a92b8 in tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*, int, tesseract::TessResultRenderer*) /tesseract/build/../src/api/baseapi.cpp:1181:16
    #10 0x55a7346a61f1 in tesseract::TessBaseAPI::ProcessPages(char const*, char const*, int, tesseract::TessResultRenderer*) /tesseract/build/../src/api/baseapi.cpp:998:17
    #11 0x55a7346262f2 in main /tesseract/build/../src/tesseract.cpp:867:24
    #12 0x7f8a62f23249 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16

previously allocated by thread T0 here:
    #0 0x55a73462106d in operator new(unsigned long) (/tesseract/build/tesseract+0x189706d) (BuildId: 76aacbbd0f98892a9872e3f978f3ed72519cf4ee)
    #1 0x55a734fb5302 in tesseract::ROW_RES::ROW_RES(bool, tesseract::ROW*) /tesseract/build/../src/ccstruct/pageres.cpp:171:21
    #2 0x55a734fb3c97 in tesseract::BLOCK_RES::BLOCK_RES(bool, tesseract::BLOCK*) /tesseract/build/../src/ccstruct/pageres.cpp:109:31
    #3 0x55a734fb32aa in tesseract::PAGE_RES::PAGE_RES(bool, tesseract::BLOCK_LIST*, tesseract::WERD_CHOICE**) /tesseract/build/../src/ccstruct/pageres.cpp:84:13
    #4 0x55a73469f93e in tesseract::TessBaseAPI::Recognize(tesseract::ETEXT_DESC*) /tesseract/build/../src/api/baseapi.cpp:783:13
    #5 0x55a7346a4b99 in tesseract::TessBaseAPI::ProcessPage(Pix*, int, char const*, char const*, int, tesseract::TessResultRenderer*) /tesseract/build/../src/api/baseapi.cpp:1218:14
    #6 0x55a7346a92b8 in tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*, int, tesseract::TessResultRenderer*) /tesseract/build/../src/api/baseapi.cpp:1181:16
    #7 0x55a7346a61f1 in tesseract::TessBaseAPI::ProcessPages(char const*, char const*, int, tesseract::TessResultRenderer*) /tesseract/build/../src/api/baseapi.cpp:998:17
    #8 0x55a7346262f2 in main /tesseract/build/../src/tesseract.cpp:867:24
    #9 0x7f8a62f23249 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16

SUMMARY: AddressSanitizer: heap-use-after-free /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bitset:415:16 in std::__cxx1998::_Base_bitset<1ul>::_M_getword(unsigned long) const
Shadow bytes around the buggy address:
  0x0c107fff8640: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 fa
  0x0c107fff8650: fa fa fa fa fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c107fff8660: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 05
  0x0c107fff8670: fa fa fa fa fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c107fff8680: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 06
=>0x0c107fff8690: fa fa fa fa fd fd fd[fd]fd fd fd fd fd fd fd fd
  0x0c107fff86a0: fa fa fa fa fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c107fff86b0: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c107fff86c0: fa fa fa fa fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c107fff86d0: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 02
  0x0c107fff86e0: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==31201==ABORTING

@stweil
Copy link
Member

stweil commented Nov 25, 2024

The current code uses random values to add noise outside of the image. Using a constant instead of the random values might work better (still to try with the other cases):

diff --git a/src/lstm/networkio.cpp b/src/lstm/networkio.cpp
index 3cb068c6..83347260 100644
--- a/src/lstm/networkio.cpp
+++ b/src/lstm/networkio.cpp
@@ -417,7 +417,7 @@ void NetworkIO::Randomize(int t, int offset, int num_features, TRand *randomizer
   if (int_mode_) {
     int8_t *line = i_[t] + offset;
     for (int i = 0; i < num_features; ++i) {
-      line[i] = IntCastRounded(randomizer->SignedRand(INT8_MAX));
+      line[i] = 0;
     }
   } else {
     // float mode.

@egorpugin
Copy link
Contributor

Still it is better to understand what is wrong with using lists.
I guess lists usage is incorrect somewhere.

@egorpugin
Copy link
Contributor

Or more in general - fix all other issues around random values and crashes they spotlight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants