package tokenizer
Ordering
- Alphabetic
Visibility
- Public
- All
Type Members
- case class KoreanChunk (text: String, offset: Int, length: Int) extends Product with Serializable
-
case class
ParsedChunk
(posNodes: Seq[KoreanToken], words: Int, profile: TokenizerProfile = TokenizerProfile.defaultProfile) extends Product with Serializable
A candidate parse for a chunk.
A candidate parse for a chunk.
- posNodes
Sequence of KoreanTokens.
- words
Number of words in this candidate parse.
- case class Sentence (text: String, start: Int, end: Int) extends Product with Serializable
- case class TokenizerProfile (tokenCount: Float = 0.18f, unknown: Float = 0.3f, wordCount: Float = 0.3f, freq: Float = 0.2f, unknownCoverage: Float = 0.5f, exactMatch: Float = 0.5f, allNoun: Float = 0.1f, unknownPosCount: Float = 10.0f, determinerPosCount: Float = 0.01f, exclamationPosCount: Float = 0.01f, initialPostPosition: Float = 0.2f, haVerb: Float = 0.3f, preferredPattern: Float = 0.6f, preferredPatterns: Seq[Seq[Any]] = ..., spaceGuide: Set[Int] = Set[Int](), spaceGuidePenalty: Float = 3.0f, josaUnmatchedPenalty: Float = 3.0f) extends Product with Serializable
Value Members
-
object
KoreanChunker
Split input text into Korean Chunks (어절)
-
object
KoreanDetokenizer
Detokenizes a list of tokenized words into a readable sentence.
-
object
KoreanSentenceSplitter
Sentence Splitter
-
object
KoreanTokenizer
Provides Korean tokenization.
Provides Korean tokenization.
Chunk: 어절 - 공백으로 구분되어 있는 단위 (사랑하는사람을) Word: 단어 - 하나의 문장 구성 요소 (사랑하는, 사람을) Token: 토큰 - 형태소와 비슷한 단위이지만 문법적으로 정확하지는 않음 (사랑, 하는, 사람, 을)
Whenever there is an updates in the behavior of KoreanParser, the initial cache has to be updated by running tools.CreateInitialCache.
- object ParsedChunk extends Serializable
- object TokenizerProfile extends Serializable