package bytesegmentencoding
Greedy contraction of consecutive n-grams
Linear Supertypes
Ordering
- Alphabetic
- By Inheritance
Inherited
- bytesegmentencoding
- AnyRef
- Any
- Hide All
- Show All
Visibility
- Public
- Protected
Type Members
- case class ByteSegmentCodec(trained: Vector[(Vector[Byte], Char)], unknownToken: Char, unknownByte: Byte) extends Codec with Product with Serializable
- case class ByteSegmentCodecFactory(vocabularyMin: Char, vocabularyMax: Char, maxMergedSegmentLength: Int, unknownToken: Char, unknownByte: Byte) extends CodecFactory[ByteSegmentCodec] with Product with Serializable
Value Members
- def decode(encoded: Array[Char], encoding: Vector[(Vector[Byte], Char)], unknown: Byte): Array[Byte]
- def encode(corpus: Array[Byte], encoding: Vector[(Vector[Byte], Char)], unknownToken: Char): Array[Char]
- def readEncodingFromFile(file: File): ByteSegmentEncoding
- def saveEncodingToFile(file: File, encoding: Vector[(Vector[Byte], Char)], unknownToken: Char, unknownByte: Byte): Unit
- def train(corpus: Array[Byte], vocabularyMin: Char, vocabularyMax: Char, maxMergedSegmentLength: Int): Vector[(Vector[Byte], Char)]
Trains BPE encoding
Trains BPE encoding
Char here is used as unsigned 16 bit integer