Releases: pszemraj/textsum
v0.2.2 - new default model
The default model loaded when creating a Summarizer
is now BEE-spoke-data/pegasus-x-base-synthsumm_open-16k which I trained on diverse data (including some code) to be a better generalist model than the previous default model based on booksum.
as before, you can load the old model/any model on the hub with:
from textsum.summarize import Summarizer
summarizer = Summarizer(model_name_or_path = "pszemraj/long-t5-tglobal-base-16384-book-summary")
What's Changed
Full Changelog: v0.2.1...v0.2.2
batch processing improvements
improvements for batch processing
A small release that includes some improvements to the Summarizer class for batch-processing related use.
let's say you've loaded your Summarizer class:
from textsum.summarize import Summarizer
model_name = "pszemraj/pegasus-x-large-book_synthsumm-bf16" # recent model
summarizer = Summarizer(model_name)
new features/improvements:
Smart __call__
Function for Summarizer
Class:
- Added a smart
__call__
function to automatically distinguish between text input and file paths for summarization, allowing easier integration into batch processing and.map()
tasks.
# Directly passing text to be summarized
summary_text = summarizer("This is a sample text to summarize.")
print(summary_text)
# Passing a file path to be summarized
output_filepath = summarizer(
"/path/to/textfile.extension",
output_dir="./my-summary-stash",
)
print(output_filepath)
Enhanced Batch Processing Controls:
- Introduced
disable_progress_bar
andbatch_delimiter
options to improve control over batch processing and output formatting
from datasets import load_dataset
dataset = load_dataset("Trelis/tiny-shakespeare")
dataset = dataset.map(
lambda x: {"summary": summarizer(x["text"], disable_progress_bar=True)},
batched=False,
) # doesn't spam you with multiple progress bars!!
print(dataset)
Note: You can pass
disable_progress_bar=True
when instantiating the Summarizer() for cleaner inference.
You can now set the 'summary batch delimiter' string by the batch_delimiter
arg when running inference:
summary_output = summarizer(text, batch_delimiter="<I AM A DELIMITER>")
print(summary_output)
# "Summary of first chunk.<I AM A DELIMITER>Summary of second chunk.<I AM A DELIMITER>Summary of third chunk."
by default, it's "\n\n"
Misc
- default parameter update: the
length_penalty
for inference is now 1.0 (was 0.8) - code cleanup across modules, mostly for readability and maintainability.
What's Changed
Full Changelog: v0.2.0...v0.2.1
inference optimization ⚗
🦿 this release adds support for some features that can make inference faster:
- support for torch compile & optimum onnx1
- improved the
textsum-dir
command, more options/streamline etc, addedfire
package to help with that- the saved config
JSON
files are now better structured to keep track of parameters, etc
- the saved config
- some small adjustments to the
Summarizer
class
Next up: the UI app will finally get an overhaul.
-
please note that Support for is not an equivalent statement to "I have tested every longctx model with ONNX max quantization and sign off guaranteeing they will all provide accurate results". I've had some good results, but also some strange ones (with Long-T5 specifically). Test beforehand, and file an issue on the Optimum repo as needed 🙏 ↩
support for LLM.int8
On GPU, you can now use LLM.int8 to use less memory:
from textsum.summarize import Summarizer
summarizer = Summarizer(load_in_8bit=True) # loads default model in LLM.int8, taking 1/4 of the memory
What's Changed
Full Changelog: v0.1.3...v0.1.5
minor doc & logging updates
improves docs, logging, and makes it easier to set the inference params from JSON
What's Changed
Full Changelog: v0.1.2...v0.1.3
pip install textsum
updated docs reflecting that it's on pypi!
pip install textsum
What's Changed
Full Changelog: v0.1.1...v0.1.2
pypi
Summarize class object
easy-to-use API in python courtesy of a class object:
from textsum.summarize import Summarizer
summarizer = Summarizer() # loads default model and parameters
out_str = summarizer.summarize_string('This is a long string of text that will be summarized.')
print(out_str)
What's Changed
Full Changelog: v0.0.5...v0.1
v0.0.5
- Adds functionality in a CLI summarization workflow to summarize all text files in a dir:
textsum-dir
- UI demo with gradio CLI updated to
textsum-ui
What's Changed
Full Changelog: v0.0.1...v0.0.5