Automatic summarization with semi-automatic pre-processing of long documents
View Demo
·
Report Bug
·
Request Feature
Table of Contents
Long Document Summarization (LDS) is a NLP task motivated by source documents where the texts exceed the model’s context lengths. As there is no commonly agreed-upon solution to this problem, LDS remains an active research area (Tunstall et al., 2022).
Vig et al. (2021) report two-step extractive-abstractive frameworks as a main category for approaching the Long Document Summarization (LDS) task. This approach consists of:
- Extracting a subset of the text.
- Here-applied by a regex-based approach for automatic segmentation, reduction and cleaning of the input document.
- Feeding it to an abstractive [or extractive] summarization model.
- You may:
- Use one of the re-imported HuggingFace abstractive summarization models
- Use the provided stand-alone implementation of TextRank (Mihalcea and Tarau, 2004)
- Pipe your own summarization model
- You may:
Lewis Tunstall, Leandro von Werra, and Thomas Wolf. 2022. Natural language processing with transformers, chapter 6. "O'Reilly Media, Inc.".
Jesse Vig, Alexander R Fabbri, and Wojciech Kryściński. 2021. Exploring neural models for query-focused summarization. arXiv preprint arXiv:2112.07637.
Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing Order into Text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 404-411, Barcelona, Spain. Association for Computational Linguistics.
curl -sSL https://install.python-poetry.org | python3 -
git clone https://github.com/Ayenem/LDS.git
cd LDS/
poetry install
from LDS.textrank import TextRank
from sentence_transformers import SentenceTransformer
summarizer = TextRank(
sentence_encoder = SentenceTransformer("Sahajtomar/french_semantic"),
)
summary = summarizer(text, n_sentences=5)
print(summary)
from LDS.book_loader import BookLoader
from LDS.textrank import TextRank
from sentence_transformers import SentenceTransformer
book = BookLoader(
doc_path = "data/D5627-Dolan.docx", # Word documents are handled
markers = { # Refer to the BookLoader class docstrings for the role of markers
"slice": [r"^Introduction$", r"Annexe /$"],
"chapter": r"^Chapitre \d+ /$|^Conclusion$",
"headers": r"^Chapitre \d+ /.+"
r"|^Introduction$"
r"|^Stress, santé et performance au travail$"
r"|^Conclusion$",
"footnotes": re.compile(
r""".+?[A-Z]\. # At least one character + a capital letter + a dot
\s.*? # + Whitespace + any # of characters
\(\d{4}\) # + 4 digits within parens
""", re.VERBOSE), # e.g. "12 Zuckerman, M. (1971). Dimensions of ..."
"undesirables": re.compile(
r"""^CONFUCIUS$
|^Matière à réFlexion$
|^/\tPost-scriptum$
|^<www\.pbs\.org/bodyandsoul/218/meditation\.htm>.+?\.$
|^Source\s:\s
""", re.VERBOSE),
"citing": re.compile(
rf"""((?:{RE_ALPHA}){3,}?) # Capture at least 3 alphabetic characters
\d+ # + at least one digit
""", re.VERBOSE), # e.g. "cited1"
"na_span": [
# Starts with this:
r"^exerCiCe \d\.\d /$",
# Ends with any of these:
r"^Chapitre \d+ /$"
r"|^Conclusion$"
r"|^Les caractéristiques personnelles\."
r"|/\tLocus de contrôle$"
r"|^L'observation de sujets a amené Rotter"
r"|^Lorsqu'une personne souffre de stress"]
}
)
chapters_to_summarize = book.get_chapters(1, 3)
summarizer = TextRank(
sentence_encoder = SentenceTransformer("Sahajtomar/french_semantic"),
)
chapter_summaries = [summarizer(chapter, n_sentences=10)
for chapter in chapters_to_summarize]
print(chapter_summaries)
- Write a roadmap
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the MIT License. See LICENSE.txt
for more information.
Ahmed Moubtahij - @TheAyenem - moub.ahmed@hotmail.com
Project Link: https://github.com/Ayenem/LDS