mangoes.vocabulary module¶
Class to manage the words to be represented in embeddings or used as contexts.
-
class
mangoes.vocabulary.
Bigram
(first, second)¶ Bases:
tuple
Methods
count
(value, /)Return number of occurrences of value.
index
(value[, start, stop])Return first index of value.
-
property
first
¶ Alias for field number 0
-
property
second
¶ Alias for field number 1
-
property
-
class
mangoes.vocabulary.
Vocabulary
(source, language=None, entity=None)¶ Bases:
object
List of words.
Vocabulary encapsulates a mapping between words and their ids. A Vocabulary can be create from a collection of words.
- Parameters
- source: list or dict
List of words or dict where keys are words and values are their indices
- language: str (optional)
- entity: str or tuple (optional)
if the words are annotated, attribute(s) of each word
- Attributes
- entity
- language
- params
words
Returns the words of the vocabulary as a list
Methods
append
(word)Append the word to the vocabulary
extend
(other[, inplace, return_map])Extend the vocabulary with words in other
index
(word)Returns the index associated to the word
indices
(sentence)Convert words of the sentence to indices
load
(path, name)Load the vocabulary from its associated file.
save
(path[, name])Save the vocabulary in a file.
copy
get_bigrams
-
FILE_HEADER_PREFIX
= '_$'¶
-
copy
()¶
-
property
language
¶
-
property
entity
¶
-
property
params
¶
-
property
words
¶ Returns the words of the vocabulary as a list
-
get_bigrams
()¶
-
index
(word)¶ Returns the index associated to the word
-
indices
(sentence)¶ Convert words of the sentence to indices
If a word isn’t in the vocabulary, its index is replaced with -1
- Parameters
- sentence: list of str
- Returns
- list of int
-
append
(word)¶ Append the word to the vocabulary
-
extend
(other, inplace=True, return_map=False)¶ Extend the vocabulary with words in other
- Parameters
- other: list or Vocabulary
- inplace: boolean
If False, create a new Vocabulary
- return_map: boolean
If True, the mapping between the indices of the words from original to merged is returned
- Returns
- Vocabulary or (Vocabulary, dict)
Returns the merged vocabulary and, if return_map is True, the mapping between the indices of the words from original to merged
-
save
(path, name='vocabulary')¶ Save the vocabulary in a file.
- Parameters
- path: str
Local path to the directory where vocabulary should be written
- name: str
Name of the file to create (without extension)
Warning
If the file already exists, it will be overwritten.
-
classmethod
load
(path, name)¶ Load the vocabulary from its associated file.
- Parameters
- path: str
Local path to the directory where vocabulary file is located
- name: str
Name of the file (without extension)
- Returns
- Vocabulary
-
class
mangoes.vocabulary.
DynamicVocabulary
(source=None, *args, **kwargs)¶ Bases:
mangoes.vocabulary.Vocabulary
Extensible list of words.
A DynamicVocabulary can be created from a collection of words or empty and each new encountered word will be added to it either explicitly (with
add()
or implicitly when testing if the word is in the vocabulary (always returns True) or getting its index- Parameters
- source: list or dict (optional)
List of words or dict where keys are words and values are their indices
- language: str (optional)
- entity: str or tuple (optional)
if the words are annotated, attribute(s) of each word
Examples
>>> v = mangoes.vocabulary.DynamicVocabulary() >>> print(v.words) [] >>> v.append('a') 0 >>> print(v.words) ['a'] >>> 'b' in v True >>> v.index('b') 1 >>> v.index('c') 2
- Attributes
- entity
- language
- params
words
Returns the words of the vocabulary as a list
Methods
append
(word)Append the word to the vocabulary
extend
(other[, inplace, return_map])Extend the vocabulary with words in other
index
(word)Returns the index associated to the word, adding it to the vocabulary if not yet
indices
(sentence)Convert words of the sentence to indices
load
(path, name)Load the vocabulary from its associated file.
save
(path[, name])Save the vocabulary in a file.
copy
get_bigrams
-
index
(word)¶ Returns the index associated to the word, adding it to the vocabulary if not yet
-
mangoes.vocabulary.
merge
(*vocabularies, keys=None, concat=<function <lambda>>, return_map=False)¶ Merge a list of Vocabulary into one
- Parameters
- vocabularies: list of vocabularies
- keys: None (default), or bool or list of str
If None or False, words that are common to several vocabularies are considered the same and will appear only once in resulting Vocabulary If keys is a list of string, of same size as counts, all words are prefixes with these keys. If keys is True, the languages of the vocabularies are used as keys
- concat: callable, optional
Function that takes a key and a word (or a token) as input and returns a new word (or token) If keys are given, this function is called to create the word of the merged vocabulary from the given keys and the original words Default is ‘{key}_{word}’ that prefixes each word with their key and is only valid from simple string words vocabularies. Bigrams are transformed applying these function to both of their part
- Returns
- Vocabulary
Examples
>>> import mangoes.vocabulary >>> v1 = mangoes.Vocabulary(['a', 'b', 'c'], language='l1') >>> v2 = mangoes.Vocabulary(['a', 'd'], language='l2') >>> merge(v1, v2) Vocabulary(['a', 'b', 'c', 'd']) >>> merge(v1, v2, keys=True) Vocabulary(['l1_a', 'l1_b', 'l1_c', 'l2_a', 'l2_d']) >>> merge(v1, v2, keys=['v1', 'v2']) Vocabulary(['v1_a', 'v1_b', 'v1_c', 'v2_a', 'v2_d'])
With tokens : >>> import collections >>> Token = collections.namedtuple(‘Token’, ‘lemma POS’) >>> v3 = mangoes.Vocabulary([Token(‘a’, ‘X’), Token(‘b’, ‘Y’), Token(‘c’, ‘X’)], language=’l1’) >>> v4 = mangoes.Vocabulary([Token(‘a’, ‘X’), Token(‘d’, ‘Y’)], language=’l2’) >>> LangToken = collections.namedtuple(‘LangToken’, ‘lemma POS lang’) >>> merge(v3, v4, keys=True, concat=lambda lang, token: LangToken(*token, lang)) #doctest: +ELLIPSIS Vocabulary([LangToken(lemma=’a’, POS=’X’, lang=’l1’), …, LangToken(lemma=’d’, POS=’Y’, lang=’l2’)])
-
mangoes.vocabulary.
create_token_filter
(fields)¶ Returns a function to filter the given fields from a token
- Parameters
- fields: str or tuple
name of the fields(s) to keep
- Returns
- callable
Examples
>>> Token = mangoes.corpus.BROWN.Token >>> cat_token = Token(form="cat", lemma="cat", POS="NOUN") >>> mangoes.vocabulary.create_token_filter("lemma")(cat_token) 'cat' >>> mangoes.vocabulary.create_token_filter(("lemma", "POS"))(cat_token) Token(lemma='cat', POS='NOUN')
-
mangoes.vocabulary.
create_tokens_filter
(fields)¶ Returns a function to filter the given fields from a list of tokens
-
mangoes.vocabulary.
create_bigrams_filter
(bigrams=None)¶ Returns a function to find expected bigrams within a sentence