tokenizer

Ordering

Visibility

case class KoreanChunk (text: String, offset: Int, length: Int) extends Product with Serializable
case class ParsedChunk (posNodes: Seq[KoreanToken], words: Int, profile: TokenizerProfile = TokenizerProfile.defaultProfile) extends Product with Serializable
A candidate parse for a chunk.
A candidate parse for a chunk.
posNodes
Sequence of KoreanTokens.
words
Number of words in this candidate parse.
case class Sentence (text: String, start: Int, end: Int) extends Product with Serializable
case class TokenizerProfile (tokenCount: Float = 0.18f, unknown: Float = 0.3f, wordCount: Float = 0.3f, freq: Float = 0.2f, unknownCoverage: Float = 0.5f, exactMatch: Float = 0.5f, allNoun: Float = 0.1f, unknownPosCount: Float = 10.0f, determinerPosCount: Float = 0.01f, exclamationPosCount: Float = 0.01f, initialPostPosition: Float = 0.2f, haVerb: Float = 0.3f, preferredPattern: Float = 0.6f, preferredPatterns: Seq[Seq[Any]] = ..., spaceGuide: Set[Int] = Set[Int](), spaceGuidePenalty: Float = 3.0f, josaUnmatchedPenalty: Float = 3.0f) extends Product with Serializable

object KoreanChunker
Split input text into Korean Chunks (어절)
object KoreanDetokenizer
Detokenizes a list of tokenized words into a readable sentence.
object KoreanSentenceSplitter
Sentence Splitter
object KoreanTokenizer
Provides Korean tokenization.
Provides Korean tokenization.
Chunk: 어절 - 공백으로 구분되어 있는 단위 (사랑하는사람을) Word: 단어 - 하나의 문장 구성 요소 (사랑하는, 사람을) Token: 토큰 - 형태소와 비슷한 단위이지만 문법적으로 정확하지는 않음 (사랑, 하는, 사람, 을)
Whenever there is an updates in the behavior of KoreanParser, the initial cache has to be updated by running tools.CreateInitialCache.
object ParsedChunk extends Serializable
object TokenizerProfile extends Serializable

Packages