This repository implements DOM-based Content Extraction via Text Density in Go. The project is particularly useful for extracting main content from web pages by analyzing text density, and it has been thoroughly tested on Korean web pages.
๐ DOM-based Content Extraction via Text Density ๋ ผ๋ฌธ์ ๊ธฐ๋ฐ์ผ๋ก Go ์ธ์ด๋ก ๊ตฌํํ ํ๋ก์ ํธ์ ๋๋ค. ์ฃผ๋ก ํ๊ตญ์ด ์น ํ์ด์ง๋ค์ ๋์์ผ๋ก ํ ์คํธํ์์ผ๋ฉฐ, ํ ์คํธ ๋ฐ๋๋ฅผ ๋ถ์ํ์ฌ ์ฃผ์ ์ฝํ ์ธ ๋ฅผ ์ถ์ถํ๋ ๋ฐ ์ ์ฉํฉ๋๋ค.
Extracts main content from web pages by removing unnecessary elements (e.g., ads, navigation bars, sidebars). Analyzes text density to determine relevant content blocks. Efficient DOM traversal using goquery for HTML parsing and manipulation.
๊ด๊ณ , ๋ค๋น๊ฒ์ด์ ๋ฐ, ์ฌ์ด๋๋ฐ ๋ฑ ๋ถํ์ํ ์์๋ฅผ ์ ๊ฑฐํ์ฌ ์น ํ์ด์ง์ ์ฃผ์ ์ฝํ ์ธ ๋ฅผ ์ถ์ถํฉ๋๋ค. ํ ์คํธ ๋ฐ๋๋ฅผ ๋ถ์ํ์ฌ ๊ด๋ จ๋ ์ฝํ ์ธ ๋ธ๋ก์ ์๋ณํฉ๋๋ค. goquery๋ฅผ ํ์ฉํ์ฌ ํจ์จ์ ์ธ DOM ์ํ ๋ฐ HTML ํ์ฑ์ ์ง์ํฉ๋๋ค.
# Clone the repository
gh repo clone minarc/godensity
# Change to the project directory
cd godensity
# Run tests
go test -v .
Feel free to open issues or submit pull requests if you find any bugs or have suggestions for improvement. Contributions are always welcome!
๊ธฐ์ฌ๋ ์ธ์ ๋ ํ์ํฉ๋๋ค! ๋ฒ๊ทธ๋ฅผ ๋ฐ๊ฒฌํ๊ฑฐ๋ ๊ฐ์ ์ฌํญ์ด ์๋ค๋ฉด ์์ ๋กญ๊ฒ ์ด์๋ฅผ ๋ฑ๋กํ๊ฑฐ๋ PR์ ์ ์ถํด์ฃผ์ธ์.