-
- import sys
- import nltk
- reload(sys)
- sys.setdefaultencoding('utf-8')
- #nltk.download('punkt')
-
- for line in sys.stdin:
- for sentence in nltk.sent_tokenize(line.decode('utf8')):
- print(' '.join(nltk.word_tokenize(sentence)).lower())
-
-
- cat wiki.10k.txt | python process.py | wc
- 26651 434379 3183291
-
- cat wiki.10k.txt | python process.py |./kenlm/bin/lmplz -o 3 > bible.arpa
-
-
- 1/5 Counting and sorting n-grams ===
- File stdin isn't normal. Using slower read() instead of mmap(). No progress bar.
- util/scoped.cc:18 in void* util::{anonymous}::InspectAddr(void*, std::size_t, const char*) threw MallocException because `!addr && requested'.
- Cannot allocate memory for 614873440 bytes in malloc
- Try rerunning with a more conservative -S setting than 80%
- Traceback (most recent call last):
- File "process.py", line 11, in <module>
- print(' '.join(nltk.word_tokenize(sentence)).lower())
- IOError: [Errno 32] Broken pipe
-