Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big files was parsed slowly #94

Open
PavelFil opened this issue Sep 10, 2024 · 2 comments
Open

Big files was parsed slowly #94

PavelFil opened this issue Sep 10, 2024 · 2 comments

Comments

@PavelFil
Copy link

I have huge HTML 2MB:

<!DOCTYPE html>
<html>
<head>
</head>
<body>
    <div><div>dnbfkjsb asdhfjkashjkfhalkshdfljkhaskdj fhkajsdfkjaslflkjashdlfkhaskldfhaklsj hdflkasdfkjlhasdflkashdklfj hasdk</div></div>
    <!--Repeat row below 19000 times-->
</body>
</script>
</html>

And the request below takes 78 seconds:

    hQuery::fromHTML($html)->find('script,style');

In browser equal request takes less than 0.2 seconds.

@duzun
Copy link
Owner

duzun commented Sep 12, 2024

I was able to reproduce this synthetic test.
Turns out hQuery/Parser/HTML::parse() is not linear with respect to the number of tags in the document 🤔.
In other words, the hQuery::fromHTML($html) is affected, but not the >find('script,style').

I'll try to analyze the code and improve it.

Thank you for the challenge!

@duzun
Copy link
Owner

duzun commented Oct 3, 2024

I have an intuition that the issue is in the heavy usage of strspn and strcspn for parsing HTML. I had the assumption that they are very fast. But by reading the implementation code I realize that each call is initializing an array of 256 bytes, even for small character list. This doesn't scale well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants