Paste #h0wjAx11sxviOW2q53C1 at spacepaste

import sys
import nltk
reload(sys)
sys.setdefaultencoding('utf-8')
#nltk.download('punkt')
for line in sys.stdin:
for sentence in nltk.sent_tokenize(line.decode('utf8')):
print(' '.join(nltk.word_tokenize(sentence)).lower())
cat wiki.10k.txt | python process.py | wc
26651 434379 3183291
cat wiki.10k.txt | python process.py |./kenlm/bin/lmplz -o 3 > bible.arpa
1/5 Counting and sorting n-grams ===
File stdin isn't normal. Using slower read() instead of mmap(). No progress bar.
util/scoped.cc:18 in void* util::{anonymous}::InspectAddr(void*, std::size_t, const char*) threw MallocException because `!addr && requested'.
Cannot allocate memory for 614873440 bytes in malloc
Try rerunning with a more conservative -S setting than 80%
Traceback (most recent call last):
File "process.py", line 11, in <module>
print(' '.join(nltk.word_tokenize(sentence)).lower())
IOError: [Errno 32] Broken pipe

spacepaste