Byte-Pair-Encoding-for-Text-Tokenization
PublicThis project implements a Byte Pair Encoding (BPE) algorithm for text tokenization, training it on NLTK's Gutenberg Corpus and evaluating its accuracy, coverage, and F1-score against NLTK's standard punkt tokenizer.