Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fileformat/elf: Prevent creation of duplicate imports while parsing #1210

Merged
merged 2 commits into from
Sep 25, 2024

Conversation

PeterMatula
Copy link
Collaborator

@PeterMatula PeterMatula commented Sep 23, 2024

  • Problematic file 6dae046fe2268eba225be5916013b7799ba042c7ca87cd49cda77ae59a34d29f.
  • Objectdump and readelf say there are no symbols, but we found ~50k of them.
  • First ~30k looks ok, so its not all garbage.
  • In the next ~20k, there are a lot of (but not all) same symbols with empty name, zero section index, zero address, etc.
  • However, we have relocations for these values anyway, and a lot of them (~170k).
  • So, we end up with ~20K same symbols each with ~170K relocations -> 3.4B Imports.
  • The problem is, that from program's PoV, nothing is obviously wrong. The addresses are valid in the binary, relocation are into valid sections/segments, etc.
  • We could make some heuristic to fix this concrete situation, e.g. cap that multiplication result, trim it down/skip it for empty-named symbols, etc. But, it would be ugly and who knows what would break. However, I realized that this could be solved more elegantly.
  • The same huge amount of Imports was being created for every symbol, as the values were the same. But we did not catch this and happily created them all, as Imports are not indexed in any way, its just a vector of unique pointers.
  • I added a cache to keep track of the created imports and skip recreating them. This will still create a bunch of empty/garbage symbols and Imports, but it does not break anything, and a huge amount of entries is created only once.
  • To optimize it further the cache works only with string references, which is however a bit error prone. Please check the logic, as my C++ is kinda rusty, and maybe there could be less error-prone solution. However, I don't think the underlying code (vector of unique pointer in Import Table) will change in the future, so it should not really break that easy.

@PeterMatula
Copy link
Collaborator Author

I run regression tests locally and all seems fine. I will add looking into not-working CI into my TODO list, but will merge this in the meantime. Also, unless you insist, I won't add regression test for this, as it is super rare case, I don't feel like manufacturing a binary to test for it, and the one in question is a 80 MB ELF possibly tagged as malware.

@PeterMatula PeterMatula merged commit 04df6de into master Sep 25, 2024
1 of 7 checks passed
@PeterMatula PeterMatula deleted the fix-elf-sym-load branch September 25, 2024 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants