protectai · asofter · Aug 12, 2023 · Aug 12, 2023 · Aug 12, 2023 · Aug 12, 2023
@@ -41,10 +41,9 @@ jobs:
         run: |
           python -m pip install --upgrade pip
           python -m pip install -U -r requirements.txt -r requirements-dev.txt
-          python -m spacy download en_core_web_lg
       - name: Download spacy dictionary
         run: |
-          python -m spacy download en_core_web_lg
+          python -m spacy download en_core_web_trf
       - name: Test with pytest
         run: |
           python -m pytest --exitfirst --verbose --failed-first --cov=.
@@ -10,6 +10,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ### Added
 -
 
+### Fixed
+-
+
+### Changed
+- [Anonymize prompt scanner] Using the transformer based Spacy model `en_core_web_trf` ([reference](https://microsoft.github.io/presidio/analyzer/nlp_engines/spacy_stanza/))
+- [Anonymize prompt scanner] Supporting faker for applicable entities instead of placeholder (`use_faker` parameter)
+
+### Removed
+-
+
 ## [0.0.3] - 2023-08-10
 
 ### Added

@@ -17,12 +17,12 @@ jailbreak attacks, LLM-Guard ensures that your interactions with LLMs remain saf
 
 ## Installation
 
-Begin your journey with LLM Guard by downloading the package and acquiring the `en_core_web_lg` spaCy model (essential
+Begin your journey with LLM Guard by downloading the package and acquiring the `en_core_web_trf` spaCy model (essential
 for the [Anonymize](./docs/input_scanners/anonymize.md) scanner):
 
 ```sh
 pip install llm-guard
-python -m spacy download en_core_web_lg
+python -m spacy download en_core_web_trf
 ```
 
 ## Getting Started

@@ -24,8 +24,8 @@ cd llm-guard
 pip install -r requirements.txt
 python setup.py install
 
-# download `en_core_web_lg` from SpaCy
-python -m spacy download en_core_web_lg
+# download SpaCy model
+python -m spacy download en_core_web_trf
 ```
 
 ## Testing

@@ -18,6 +18,9 @@ When you use the Anonymize scanner, you can talk to LLMs without worrying about
 The scanner uses a tool called the [Presidio Analyzer](https://github.com/microsoft/presidio/) library. This tool, built
 with Python's spaCy, is really good at finding private info in text.
 
+It uses transformers based model `en_core_web_trf` which uses a more modern deep-learning architecture, but is generally
+slower than the default `en_core_web_lg` model.
+
 On top of that, the Anonymize scanner can also understand special patterns to catch anything the Presidio Analyzer might
 miss.
 
@@ -54,5 +57,6 @@ Here's what those options do:
 - `hidden_names` are names we change to something like `[REDACTED_CUSTOM_1]`.
 - You can also choose specific types of info to hide using `entity_types`.
 - If you want, you can use your own patterns by giving the path in `regex_pattern_groups_path`.
+- `use_faker` will replace applicable entities with fake ones.
 
-- If you want to see the original info again, you can use the [Deanonymizer](../output_scanners/deanonymize.md).
+If you want to see the original info again, you can use the [Deanonymizer](../output_scanners/deanonymize.md).
@@ -20,6 +20,9 @@ To combat this, it's vital to integrate data sanitization and adopt strict user
 It takes advantage of the [Presidio Analyzer Engine](https://github.com/microsoft/presidio/). Coupled
 with predefined internal patterns, the tool offers robust scanning capabilities.
 
+It uses transformers based model `en_core_web_trf` which uses a more modern deep-learning architecture, but is generally
+slower than the default `en_core_web_lg` model.
+
 When running, the scanner inspects the model's output for specific entity types that may be considered sensitive. If no
 types are chosen, the tool defaults to scanning for all known entity types, offering comprehensive coverage.