import sys import nltk reload(sys) sys.setdefaultencoding('utf-8') #nltk.download('punkt') for line in sys.stdin: for sentence in nltk.sent_tokenize(line.decode('utf8')): print(' '.join(nltk.word_tokenize(sentence)).lower()) cat wiki.10k.txt | python process.py | wc 26651 434379 3183291 cat wiki.10k.txt | python process.py |./kenlm/bin/lmplz -o 3 > bible.arpa 1/5 Counting and sorting n-grams === File stdin isn't normal. Using slower read() instead of mmap(). No progress bar. util/scoped.cc:18 in void* util::{anonymous}::InspectAddr(void*, std::size_t, const char*) threw MallocException because `!addr && requested'. Cannot allocate memory for 614873440 bytes in malloc Try rerunning with a more conservative -S setting than 80% Traceback (most recent call last): File "process.py", line 11, in print(' '.join(nltk.word_tokenize(sentence)).lower()) IOError: [Errno 32] Broken pipe