Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting Data #1

Open
Mayar2009 opened this issue Feb 8, 2020 · 1 comment
Open

Extracting Data #1

Mayar2009 opened this issue Feb 8, 2020 · 1 comment

Comments

@Mayar2009
Copy link

when i run
processed_text = []
processed_title = []

for i in dataset[:N]:
file = open(i[0], 'r', encoding="utf8", errors='ignore')
text = file.read().strip()
file.close()

processed_text.append(word_tokenize(str(preprocess(text))))
processed_title.append(word_tokenize(str(preprocess(i[1]))))

i got the error


LookupError Traceback (most recent call last)
in ()
8 file.close()
9
---> 10 processed_text.append(word_tokenize((preprocess(text))))
11 processed_title.append(word_tokenize((preprocess(i[1]))))

6 frames
in preprocess(data)
3 data = remove_punctuation(data) #remove comma seperately
4 data = remove_apostrophe(data)
----> 5 data = remove_stop_words(data)
6 data = convert_numbers(data)
7 data = stemming(data)

in remove_stop_words(data)
1 def remove_stop_words(data):
2 stop_words = stopwords.words('english')
----> 3 words = word_tokenize(str(data))
4 new_text = ""
5 for w in words:

/usr/local/lib/python3.6/dist-packages/nltk/tokenize/init.py in word_tokenize(text, language, preserve_line)
126 :type preserver_line: bool
127 """
--> 128 sentences = [text] if preserve_line else sent_tokenize(text, language)
129 return [token for sent in sentences
130 for token in _treebank_word_tokenizer.tokenize(sent)]

/usr/local/lib/python3.6/dist-packages/nltk/tokenize/init.py in sent_tokenize(text, language)
92 :param language: the model name in the Punkt corpus
93 """
---> 94 tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
95 return tokenizer.tokenize(text)
96

/usr/local/lib/python3.6/dist-packages/nltk/data.py in load(resource_url, format, cache, verbose, logic_parser, fstruct_reader, encoding)
832
833 # Load the resource.
--> 834 opened_resource = _open(resource_url)
835
836 if format == 'raw':

/usr/local/lib/python3.6/dist-packages/nltk/data.py in open(resource_url)
950
951 if protocol is None or protocol.lower() == 'nltk':
--> 952 return find(path
, path + ['']).open()
953 elif protocol.lower() == 'file':
954 # urllib might not use mode='rb', so handle this one ourselves:

/usr/local/lib/python3.6/dist-packages/nltk/data.py in find(resource_name, paths)
671 sep = '*' * 70
672 resource_not_found = '\n%s\n%s\n%s\n' % (sep, msg, sep)
--> 673 raise LookupError(resource_not_found)
674
675

LookupError:


Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:

import nltk
nltk.download('punkt')

Searched in:
- '/root/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- '/usr/nltk_data'
- '/usr/lib/nltk_data'
- ''

even I imported
nltk
and nltk.download('punkt')
what I should do?

@ddebrup
Copy link

ddebrup commented Sep 28, 2020

This thread on Stackoverflow might be of some help.
https://stackoverflow.com/questions/38916452/nltk-download-ssl-certificate-verify-failed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants