kadot package

Submodules

kadot.bot_engine module

kadot.fuzzy module

Inspired by FuzzyWuzzy: https://github.com/seatgeek/fuzzywuzzy

kadot.fuzzy.extract(query: kadot.tokenizers.Tokens, choices: Sequence[kadot.tokenizers.Tokens], best: Optional[int] = None, ratio_function: Callable[[...], float] = <function ratio>) → List[Tuple[str, int]]

Find the best most similar choices to the query.

Parameters:
  • query – a Tokens object.
  • choices – a list of Tokens objects.
  • best – the most similar number of choices to return. If None (default), the function will return all choices.
  • ratio_function – a function calculating the similarity between two Tokens objects.
Returns:

a list of tuple containing the plain text of the extracted choice and its similarity with the query.

kadot.fuzzy.ratio(s1: Union[str, kadot.tokenizers.Tokens], s2: Union[str, kadot.tokenizers.Tokens]) → float

Compute the similarity ratio between two (tokenized or not) strings.

>>> ratio('I ate the apple', 'I ate the pear')
0.759
kadot.fuzzy.token_ratio(s1: kadot.tokenizers.Tokens, s2: kadot.tokenizers.Tokens) → float

Compute the similarity ratio between two tokenized strings.

>>> token_ratio(regex_tokenizer('I ate the apple'), regex_tokenizer('the apple I ate'))
0.5
kadot.fuzzy.vocabulary_ratio(s1: kadot.tokenizers.Tokens, s2: kadot.tokenizers.Tokens) → float

Compute the similarity ratio of the vocabulary of two tokenized strings.

>>> vocabulary_ratio(regex_tokenizer('I ate the apple'), regex_tokenizer('the apple I ate'))
1.0

kadot.models module

kadot.preprocessing module

kadot.tokenizers module

class kadot.tokenizers.Tokens(text: str, tokens: Sequence[str], delimiters: Optional[Sequence[str]] = None, starts_with_token: Optional[bool] = None, exclude: Optional[Sequence[str]] = None)

Bases: kadot.utils.SavedObject

An object representing a tokenized text.

ngrams(n: int) → list

Returns n-grams of the text. Based on a code found on locallyoptimal.com

rebuild(tokens: Union[Tokens, Sequence[str]]) → str

Allows you to reconstruct a modified raw text using the word delimiters of the original text.

kadot.tokenizers.corpus_tokenizer(corpus: List[str], lower: bool = False, exclude: Optional[Sequence[str]] = None, tokenizer: Callable[[...], kadot.tokenizers.Tokens] = <function regex_tokenizer>) → List[kadot.tokenizers.Tokens]

Tokenize a whole list of documents (corpus) using the same tokenizer.

>>> corpus_tokenizer(['Hello bob !', 'Hi John !'])
[Tokens(['Hello', 'bob']), Tokens(['Hi', 'John'])]
kadot.tokenizers.ngram_tokenizer(text: str, n: int = 2, separator: str = '-', lower: bool = False, exclude: Optional[Sequence[str]] = None, tokenizer: Callable[[...], kadot.tokenizers.Tokens] = <function regex_tokenizer>) → kadot.tokenizers.Tokens

A “meta” tokenizer that returns n-grams as tokens.

Parameters:
  • text – the text to tokenize.
  • n – the size of the gram.
  • separator – the separator joining words together in a gram.
  • lower – if True, the text will be written in lowercase before it is tokenized.
  • exclude – a list of words and/or gram that should not be included after tokenization.
  • tokenizer – the word tokenizer to use.
Returns:

a Tokens object.

>>> ngram_tokenizer("This is another example.")
Tokens(['This-is', 'is-another', 'another-example'])
kadot.tokenizers.regex_tokenizer(text: str, lower: bool = False, exclude: Optional[Sequence[str]] = None, delimiter: Pattern[AnyStr] = re.compile('[., !?:;()[\\]{}><+\\-*/\\= "\'\r\t\n\x0b\x0c@^¨`~_|]+')) → kadot.tokenizers.Tokens

Tokenize using regular expressions.

Parameters:
  • text – the text to tokenize.
  • lower – if True, the text will be written in lowercase before it is tokenized.
  • exclude – a list of words that should not be included after tokenization.
  • delimiter – the regex defining what delimits words between them.
Returns:

a Tokens object.

>>> regex_tokenizer("Let's try this example.")
Tokens(['Let', 's', 'try', 'this', 'example'])
kadot.tokenizers.whitespace_tokenizer(text: str, lower: bool = False, exclude: Optional[Sequence[str]] = None) → kadot.tokenizers.Tokens

Tokenize considering that words are separated by whitespaces characters.

Parameters:
  • text – the text to tokenize.
  • lower – if True, the text will be written in lowercase before it is tokenized.
  • exclude – a list of words that should not be included after tokenization.
Returns:

a Tokens object.

>>> whitespace_tokenizer("Let's try this example.")
Tokens(["Let's", 'try', 'this', 'example.'])

kadot.utils module

class kadot.utils.SavedObject

Bases: object

A class that can be saved in a file.

save(filename)
kadot.utils.load_object(filename)
kadot.utils.unique_words(words)

kadot.vectorizers module

Module contents