Skip to content

This repository is implematation of ๐Ÿ“„ DOM based content extraction via text density. Tested for Korean web pages.

License

Notifications You must be signed in to change notification settings

minarc/godensity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

22 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

godensity

This repository implements DOM-based Content Extraction via Text Density in Go. The project is particularly useful for extracting main content from web pages by analyzing text density, and it has been thoroughly tested on Korean web pages.

๐Ÿ“„ DOM-based Content Extraction via Text Density ๋…ผ๋ฌธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ Go ์–ธ์–ด๋กœ ๊ตฌํ˜„ํ•œ ํ”„๋กœ์ ํŠธ์ž…๋‹ˆ๋‹ค. ์ฃผ๋กœ ํ•œ๊ตญ์–ด ์›น ํŽ˜์ด์ง€๋“ค์„ ๋Œ€์ƒ์œผ๋กœ ํ…Œ์ŠคํŠธํ•˜์˜€์œผ๋ฉฐ, ํ…์ŠคํŠธ ๋ฐ€๋„๋ฅผ ๋ถ„์„ํ•˜์—ฌ ์ฃผ์š” ์ฝ˜ํ…์ธ ๋ฅผ ์ถ”์ถœํ•˜๋Š” ๋ฐ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.

image

Features

Extracts main content from web pages by removing unnecessary elements (e.g., ads, navigation bars, sidebars). Analyzes text density to determine relevant content blocks. Efficient DOM traversal using goquery for HTML parsing and manipulation.

์ฃผ์š” ๊ธฐ๋Šฅ:

๊ด‘๊ณ , ๋„ค๋น„๊ฒŒ์ด์…˜ ๋ฐ”, ์‚ฌ์ด๋“œ๋ฐ” ๋“ฑ ๋ถˆํ•„์š”ํ•œ ์š”์†Œ๋ฅผ ์ œ๊ฑฐํ•˜์—ฌ ์›น ํŽ˜์ด์ง€์˜ ์ฃผ์š” ์ฝ˜ํ…์ธ ๋ฅผ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค. ํ…์ŠคํŠธ ๋ฐ€๋„๋ฅผ ๋ถ„์„ํ•˜์—ฌ ๊ด€๋ จ๋œ ์ฝ˜ํ…์ธ  ๋ธ”๋ก์„ ์‹๋ณ„ํ•ฉ๋‹ˆ๋‹ค. goquery๋ฅผ ํ™œ์šฉํ•˜์—ฌ ํšจ์œจ์ ์ธ DOM ์ˆœํšŒ ๋ฐ HTML ํŒŒ์‹ฑ์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

How to run?

# Clone the repository
gh repo clone minarc/godensity

# Change to the project directory
cd godensity

# Run tests
go test -v .

image

Contribution

Feel free to open issues or submit pull requests if you find any bugs or have suggestions for improvement. Contributions are always welcome!

๊ธฐ์—ฌ๋Š” ์–ธ์ œ๋‚˜ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค! ๋ฒ„๊ทธ๋ฅผ ๋ฐœ๊ฒฌํ–ˆ๊ฑฐ๋‚˜ ๊ฐœ์„  ์‚ฌํ•ญ์ด ์žˆ๋‹ค๋ฉด ์ž์œ ๋กญ๊ฒŒ ์ด์Šˆ๋ฅผ ๋“ฑ๋กํ•˜๊ฑฐ๋‚˜ PR์„ ์ œ์ถœํ•ด์ฃผ์„ธ์š”.

About

This repository is implematation of ๐Ÿ“„ DOM based content extraction via text density. Tested for Korean web pages.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages