kadot package¶
Submodules¶
kadot.bot_engine module¶
kadot.fuzzy module¶
Inspired by FuzzyWuzzy: https://github.com/seatgeek/fuzzywuzzy
-
kadot.fuzzy.
extract
(query: kadot.tokenizers.Tokens, choices: Sequence[kadot.tokenizers.Tokens], best: Optional[int] = None, ratio_function: Callable[[...], float] = <function ratio>) → List[Tuple[str, int]]¶ Find the best most similar choices to the query.
Parameters: - query – a Tokens object.
- choices – a list of Tokens objects.
- best – the most similar number of choices to return. If None (default), the function will return all choices.
- ratio_function – a function calculating the similarity between two Tokens objects.
Returns: a list of tuple containing the plain text of the extracted choice and its similarity with the query.
-
kadot.fuzzy.
ratio
(s1: Union[str, kadot.tokenizers.Tokens], s2: Union[str, kadot.tokenizers.Tokens]) → float¶ Compute the similarity ratio between two (tokenized or not) strings.
>>> ratio('I ate the apple', 'I ate the pear') 0.759
-
kadot.fuzzy.
token_ratio
(s1: kadot.tokenizers.Tokens, s2: kadot.tokenizers.Tokens) → float¶ Compute the similarity ratio between two tokenized strings.
>>> token_ratio(regex_tokenizer('I ate the apple'), regex_tokenizer('the apple I ate')) 0.5
-
kadot.fuzzy.
vocabulary_ratio
(s1: kadot.tokenizers.Tokens, s2: kadot.tokenizers.Tokens) → float¶ Compute the similarity ratio of the vocabulary of two tokenized strings.
>>> vocabulary_ratio(regex_tokenizer('I ate the apple'), regex_tokenizer('the apple I ate')) 1.0
kadot.models module¶
kadot.preprocessing module¶
kadot.tokenizers module¶
-
class
kadot.tokenizers.
Tokens
(text: str, tokens: Sequence[str], delimiters: Optional[Sequence[str]] = None, starts_with_token: Optional[bool] = None, exclude: Optional[Sequence[str]] = None)¶ Bases:
kadot.utils.SavedObject
An object representing a tokenized text.
-
ngrams
(n: int) → list¶ Returns n-grams of the text. Based on a code found on locallyoptimal.com
-
rebuild
(tokens: Union[Tokens, Sequence[str]]) → str¶ Allows you to reconstruct a modified raw text using the word delimiters of the original text.
-
-
kadot.tokenizers.
corpus_tokenizer
(corpus: List[str], lower: bool = False, exclude: Optional[Sequence[str]] = None, tokenizer: Callable[[...], kadot.tokenizers.Tokens] = <function regex_tokenizer>) → List[kadot.tokenizers.Tokens]¶ Tokenize a whole list of documents (corpus) using the same tokenizer.
>>> corpus_tokenizer(['Hello bob !', 'Hi John !']) [Tokens(['Hello', 'bob']), Tokens(['Hi', 'John'])]
-
kadot.tokenizers.
ngram_tokenizer
(text: str, n: int = 2, separator: str = '-', lower: bool = False, exclude: Optional[Sequence[str]] = None, tokenizer: Callable[[...], kadot.tokenizers.Tokens] = <function regex_tokenizer>) → kadot.tokenizers.Tokens¶ A “meta” tokenizer that returns n-grams as tokens.
Parameters: - text – the text to tokenize.
- n – the size of the gram.
- separator – the separator joining words together in a gram.
- lower – if True, the text will be written in lowercase before it is tokenized.
- exclude – a list of words and/or gram that should not be included after tokenization.
- tokenizer – the word tokenizer to use.
Returns: a Tokens object.
>>> ngram_tokenizer("This is another example.") Tokens(['This-is', 'is-another', 'another-example'])
-
kadot.tokenizers.
regex_tokenizer
(text: str, lower: bool = False, exclude: Optional[Sequence[str]] = None, delimiter: Pattern[AnyStr] = re.compile('[., !?:;()[\\]{}><+\\-*/\\= "\'\r\t\n\x0b\x0c@^¨`~_|]+')) → kadot.tokenizers.Tokens¶ Tokenize using regular expressions.
Parameters: - text – the text to tokenize.
- lower – if True, the text will be written in lowercase before it is tokenized.
- exclude – a list of words that should not be included after tokenization.
- delimiter – the regex defining what delimits words between them.
Returns: a Tokens object.
>>> regex_tokenizer("Let's try this example.") Tokens(['Let', 's', 'try', 'this', 'example'])
-
kadot.tokenizers.
whitespace_tokenizer
(text: str, lower: bool = False, exclude: Optional[Sequence[str]] = None) → kadot.tokenizers.Tokens¶ Tokenize considering that words are separated by whitespaces characters.
Parameters: - text – the text to tokenize.
- lower – if True, the text will be written in lowercase before it is tokenized.
- exclude – a list of words that should not be included after tokenization.
Returns: a Tokens object.
>>> whitespace_tokenizer("Let's try this example.") Tokens(["Let's", 'try', 'this', 'example.'])