Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SpacyPreprocessor can't load en_core_web_sm #1496

Closed
rjurney opened this issue Oct 18, 2019 · 7 comments
Closed

SpacyPreprocessor can't load en_core_web_sm #1496

rjurney opened this issue Oct 18, 2019 · 7 comments

Comments

@rjurney
Copy link

rjurney commented Oct 18, 2019

Issue description

I am having the following problem while running snorkel master/0.9.2+dev in a Jupyter Notebook (link to the notebook in nbviewer, it won't render on Github... my notebooks never render on Github):

# Download the spaCy english model
! python -m spacy download en_core_web_sm
from snorkel.preprocess.nlp import SpacyPreprocessor
from snorkel.labeling import LabelingFunction

ABSTAIN = -1

spacy_processor = SpacyPreprocessor(
    text_field='_Lower_Text',
    doc_field='_Doc',
    memoize=True,
)

def keyword_lookup(x, keywords, label):
    if any(word.lower() in x._Doc for word in keywords if len(word) > 2):
        return label
    return ABSTAIN

def make_keyword_lf(keywords, label=ABSTAIN):
    return LabelingFunction(
        name=f"keyword_{keywords}",
        f=keyword_lookup,
        resources=dict(keywords=keywords, label=label),
        pre=[spacy_processor]
    )


# For each keyword, split on hyphen and create an LF that detects if that tag is present in the data
keyword_lfs = OrderedDict()
for label_set, index in zip(df['_Tag'].unique(), df['_Index'].unique()):
    for label in label_set.split('-'):
        keyword_lfs[label] = make_keyword_lf(label, label=index)

list(keyword_lfs.items())[:5]

I get the following error:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-50-03576fa20006> in <module>
      7     text_field='_Lower_Text',
      8     doc_field='_Doc',
----> 9     memoize=True,
     10 )
     11 

~/anaconda3/envs/weak/lib/python3.7/site-packages/snorkel/preprocess/nlp.py in __init__(self, text_field, doc_field, language, disable, pre, memoize)
     60             memoize=memoize,
     61         )
---> 62         self._nlp = spacy.load(language, disable=disable or [])
     63 
     64     def run(self, text: str) -> FieldMap:  # type: ignore

~/anaconda3/envs/weak/lib/python3.7/site-packages/spacy/__init__.py in load(name, **overrides)
     25     if depr_path not in (True, False, None):
     26         deprecation_warning(Warnings.W001.format(path=depr_path))
---> 27     return util.load_model(name, **overrides)
     28 
     29 

~/anaconda3/envs/weak/lib/python3.7/site-packages/spacy/util.py in load_model(name, **overrides)
    169     elif hasattr(name, "exists"):  # Path or Path-like to model data
    170         return load_model_from_path(name, **overrides)
--> 171     raise IOError(Errors.E050.format(name=name))
    172 
    173 

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

Code example/repro steps

See Jupyter Notebook for full code.

Expected behavior

I expect the SpacyPreprocessor to instantiate succesfully and it does not.

System info

  • How you installed Snorkel:

requirements.txt reads:

git+git://github.com/snorkel-team/snorkel@master#egg=snorkel

pip freeze | grep snorkel reads: 0.9.2+dev

  • Build command you used (if compiling from source): pip install -r requirements.txt
  • OS: Ubuntu 18.04 LTS bionic Linux
  • Python version: 3.7.4
  • Snorkel version: master / 0.9.2+dev
  • Versions of any other relevant libraries:
absl-py==0.8.0
argh==0.26.2
asn1crypto==0.24.0
astor==0.8.0
attrs==19.2.0
backcall==0.1.0
beautifulsoup4==4.8.1
bert-for-tf2==0.6.0
bleach==3.1.0
blessings==1.7
blis==0.4.1
boto==2.49.0
boto3==1.9.244
botocore==1.12.244
Bottleneck==1.2.1
bz2file==0.98
certifi==2019.9.11
cffi==1.12.3
chardet==3.0.4
Click==7.0
cloudpickle==1.2.2
configparser==4.0.2
cryptography==2.7
cupy-cuda100==6.4.0
cycler==0.10.0
cymem==2.0.2
cytoolz==0.9.0.1
dask==2.5.0
decorator==4.4.0
defusedxml==0.6.0
dill==0.3.1.1
docker-pycreds==0.4.0
docutils==0.15.2
en-core-web-sm==2.2.0
entrypoints==0.3
fast-bert==1.4.2
fastai==1.0.58
fastparquet==0.3.2
fastprogress==0.1.21
fastrlock==0.4
frozendict==1.2
fsspec==0.5.2
gast==0.2.2
gensim==3.8.1
gitdb2==2.0.6
GitPython==3.0.3
google-pasta==0.1.7
gql==0.1.0
graphql-core==2.2.1
grpcio==1.24.1
h5py==2.10.0
idna==2.8
imageio==2.6.0
ipykernel==5.1.2
ipython==7.8.0
ipython-genutils==0.2.0
ipywidgets==7.5.1
iso8601==0.1.12
jedi==0.15.1
Jinja2==2.10.3
jmespath==0.9.4
joblib==0.14.0
jsonschema==3.0.2
jupyter==1.0.0
jupyter-client==5.3.3
jupyter-console==6.0.0
jupyter-core==4.5.0
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.0
kiwisolver==1.1.0
llvmlite==0.30.0
lxml==4.4.1
Markdown==3.1.1
MarkupSafe==1.1.1
matplotlib==3.1.1
mistune==0.8.4
mkl-fft==1.0.14
mkl-random==1.1.0
mkl-service==2.3.0
mock==3.0.5
msgpack==0.6.1
msgpack-numpy==0.4.3.2
murmurhash==1.0.2
nbconvert==5.6.0
nbformat==4.4.0
nbstripout==0.3.6
networkx==2.3
nltk==3.4.5
notebook==6.0.1
numba==0.46.0
numexpr==2.7.0
numpy==1.17.2
nvidia-ml-py==375.53.1
nvidia-ml-py3==7.352.0
olefile==0.46
opt-einsum==3.1.0
packaging==19.2
pandas==0.25.1
pandocfilters==1.4.2
params-flow==0.7.0
parso==0.5.1
pathtools==0.1.2
patsy==0.5.1
pexpect==4.7.0
pickleshare==0.7.5
Pillow==6.2.0
pip-tools==4.1.0
plac==0.9.6
preshed==3.0.2
prometheus-client==0.7.1
promise==2.2.1
prompt-toolkit==2.0.10
protobuf==3.10.0
psutil==5.6.3
ptyprocess==0.6.0
py-params==0.6.4
py4j==0.10.7
pyarrow==0.14.1
pycparser==2.19
Pygments==2.4.2
pyOpenSSL==19.0.0
pyparsing==2.4.2
pyrsistent==0.15.4
PySocks==1.7.1
pyspark==2.4.4
python-dateutil==2.8.0
pytorch-lamb==1.0.0
pytz==2019.3
PyWavelets==1.0.3
PyYAML==5.1.2
pyzmq==18.1.0
qtconsole==4.5.5
regex==2019.8.19
requests==2.22.0
Rx==1.6.1
s3fs==0.3.5
s3transfer==0.2.1
sacremoses==0.0.35
scikit-image==0.15.0
scikit-learn==0.21.3
scipy==1.3.1
seaborn==0.9.0
Send2Trash==1.5.0
sentencepiece==0.1.83
sentry-sdk==0.12.3
shap==0.31.0
shortuuid==0.5.0
six==1.12.0
smart-open==1.8.4
smmap2==2.0.5
snorkel==0.9.2+dev
soupsieve==1.9.4
spacy==2.2.1
srsly==0.1.0
statsmodels==0.10.1
subprocess32==3.5.4
tensorboard==2.0.0
tensorboardX==1.9
tensorflow==2.0.0
tensorflow-estimator==2.0.0
tensorflow-gpu==2.0.0
tensorflow-hub==0.5.0
termcolor==1.1.0
terminado==0.8.2
testpath==0.4.2
textblob==0.15.3
texttable==1.6.2
thinc==7.1.1
thrift==0.11.0
toolz==0.10.0
torch==1.1.0
torchvision==0.4.0
tornado==6.0.3
tqdm==4.36.1
traitlets==4.3.3
transformers==2.0.0
ujson==1.35
urllib3==1.25.6
wandb==0.8.12
wasabi==0.2.2
watchdog==0.9.0
wcwidth==0.1.7
webencodings==0.5.1
Werkzeug==0.16.0
widgetsnbextension==3.5.1
wrapt==1.11.2

Additional context

Halp :)

@rjurney
Copy link
Author

rjurney commented Oct 18, 2019

See also #1458

@rjurney
Copy link
Author

rjurney commented Oct 18, 2019

Ok, this seems to be a Jupyter problem. When re-running a cell where I had previously not used/imported SpacyPreprocessor and then did so, the bug occurred but when restarting the entire kernel/run all it works ok.

Don't know who to blame but thought I'd let you know :) I'll let you close it.

@rjurney
Copy link
Author

rjurney commented Oct 18, 2019

Btw, to reproduce change the PATH_SET variable to s3 and the notebook should run. Then remove the import SpacyPreprocessor and remove it from pre=. Then add it back and re-run.

@bhancock8
Copy link
Member

Ah, thanks for posting this! I'm sure others will find it helpful. Yep, I believe that after downloading a spacy model, you need to restart the kernel to have it recognized (i.e., it needs to be present when spaCy is first imported?). If anyone else discovers more here, please leave another comment. And when #1458 gets addressed, there should at least be a more helpful error message when the model isn't present or being recognized.

@bhancock8
Copy link
Member

Love the notebook, btw @rjurney!

@rjurney
Copy link
Author

rjurney commented Oct 19, 2019

@bhancock8 thanks! I found that you can use spacy.cli.download(language) and it works ok. I thought of a second ticket to catch the missing library exception and load it.

@rjurney
Copy link
Author

rjurney commented Oct 19, 2019

Actually, I will fix #1458 since I know just how

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants