Big files was parsed slowly #94

PavelFil · 2024-09-10T06:13:27Z

I have huge HTML 2MB:

<!DOCTYPE html>
<html>
<head>
</head>
<body>
    <div><div>dnbfkjsb asdhfjkashjkfhalkshdfljkhaskdj fhkajsdfkjaslflkjashdlfkhaskldfhaklsj hdflkasdfkjlhasdflkashdklfj hasdk</div></div>
    <!--Repeat row below 19000 times-->
</body>
</script>
</html>

And the request below takes 78 seconds:

    hQuery::fromHTML($html)->find('script,style');

In browser equal request takes less than 0.2 seconds.

duzun · 2024-09-12T07:34:39Z

I was able to reproduce this synthetic test.
Turns out hQuery/Parser/HTML::parse() is not linear with respect to the number of tags in the document 🤔.
In other words, the hQuery::fromHTML($html) is affected, but not the >find('script,style').

I'll try to analyze the code and improve it.

Thank you for the challenge!

duzun · 2024-10-03T21:37:43Z

I have an intuition that the issue is in the heavy usage of strspn and strcspn for parsing HTML. I had the assumption that they are very fast. But by reading the implementation code I realize that each call is initializing an array of 256 bytes, even for small character list. This doesn't scale well.

duzun added the enhancement label Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Big files was parsed slowly #94

Big files was parsed slowly #94

PavelFil commented Sep 10, 2024

duzun commented Sep 12, 2024 •

edited

Loading

duzun commented Oct 3, 2024

Big files was parsed slowly #94

Big files was parsed slowly #94

Comments

PavelFil commented Sep 10, 2024

duzun commented Sep 12, 2024 • edited Loading

duzun commented Oct 3, 2024

duzun commented Sep 12, 2024 •

edited

Loading