spacepaste

  1.  
  2. import sys
  3. import nltk
  4. reload(sys)
  5. sys.setdefaultencoding('utf-8')
  6. #nltk.download('punkt')
  7. for line in sys.stdin:
  8. for sentence in nltk.sent_tokenize(line.decode('utf8')):
  9. print(' '.join(nltk.word_tokenize(sentence)).lower())
  10. cat wiki.10k.txt | python process.py | wc
  11. 26651 434379 3183291
  12. cat wiki.10k.txt | python process.py |./kenlm/bin/lmplz -o 3 > bible.arpa
  13. 1/5 Counting and sorting n-grams ===
  14. File stdin isn't normal. Using slower read() instead of mmap(). No progress bar.
  15. util/scoped.cc:18 in void* util::{anonymous}::InspectAddr(void*, std::size_t, const char*) threw MallocException because `!addr && requested'.
  16. Cannot allocate memory for 614873440 bytes in malloc
  17. Try rerunning with a more conservative -S setting than 80%
  18. Traceback (most recent call last):
  19. File "process.py", line 11, in <module>
  20. print(' '.join(nltk.word_tokenize(sentence)).lower())
  21. IOError: [Errno 32] Broken pipe
  22.