mangoes.vocabulary module¶

Class to manage the words to be represented in embeddings or used as contexts.

class mangoes.vocabulary.Bigram(first, second)¶

Bases: tuple

Attributes

first: Alias for field number 0
second: Alias for field number 1

Methods

`count`(value, /)	Return number of occurrences of value.
`index`(value[, start, stop])	Return first index of value.

property first¶: Alias for field number 0

property second¶: Alias for field number 1

class mangoes.vocabulary.Vocabulary(source, language=None, entity=None)¶

Bases: object

List of words.

Vocabulary encapsulates a mapping between words and their ids. A Vocabulary can be create from a collection of words.

Parameters

source: list or dict: List of words or dict where keys are words and values are their indices
language: str (optional)
entity: str or tuple (optional): if the words are annotated, attribute(s) of each word

See also

mangoes.corpus.Corpus.create_vocabulary()

Attributes

entity
language
params
words: Returns the words of the vocabulary as a list

Methods

`append`(word)	Append the word to the vocabulary
`extend`(other[, inplace, return_map])	Extend the vocabulary with words in other
`index`(word)	Returns the index associated to the word
`indices`(sentence)	Convert words of the sentence to indices
`load`(path, name)	Load the vocabulary from its associated file.
`save`(path[, name])	Save the vocabulary in a file.

copy
get_bigrams

FILE_HEADER_PREFIX = '_$'¶

copy()¶

property language¶

property entity¶

property params¶

property words¶: Returns the words of the vocabulary as a list

get_bigrams()¶

index(word)¶: Returns the index associated to the word

indices(sentence)¶

Convert words of the sentence to indices

If a word isn’t in the vocabulary, its index is replaced with -1

Parameters

sentence: list of str

Returns

list of int

append(word)¶: Append the word to the vocabulary

extend(other, inplace=True, return_map=False)¶

Extend the vocabulary with words in other

Parameters

other: list or Vocabulary
inplace: boolean: If False, create a new Vocabulary
return_map: boolean: If True, the mapping between the indices of the words from original to merged is returned

Returns

Vocabulary or (Vocabulary, dict): Returns the merged vocabulary and, if return_map is True, the mapping between the indices of the words from original to merged

save(path, name='vocabulary')¶

Save the vocabulary in a file.

Parameters

path: str: Local path to the directory where vocabulary should be written
name: str: Name of the file to create (without extension)

Warning

If the file already exists, it will be overwritten.

classmethod load(path, name)¶

Load the vocabulary from its associated file.

Parameters

path: str: Local path to the directory where vocabulary file is located
name: str: Name of the file (without extension)

Returns

Vocabulary

class mangoes.vocabulary.DynamicVocabulary(source=None, *args, **kwargs)¶

Bases: mangoes.vocabulary.Vocabulary

Extensible list of words.

A DynamicVocabulary can be created from a collection of words or empty and each new encountered word will be added to it either explicitly (with add() or implicitly when testing if the word is in the vocabulary (always returns True) or getting its index

Parameters

source: list or dict (optional): List of words or dict where keys are words and values are their indices
language: str (optional)
entity: str or tuple (optional): if the words are annotated, attribute(s) of each word

See also

mangoes.corpus.Corpus.create_vocabulary()

Examples

>>> v = mangoes.vocabulary.DynamicVocabulary()
>>> print(v.words)
[]
>>> v.append('a')
0
>>> print(v.words)
['a']
>>> 'b' in v
True
>>> v.index('b')
1
>>> v.index('c')
2

Attributes

entity
language
params
words: Returns the words of the vocabulary as a list

Methods

`append`(word)	Append the word to the vocabulary
`extend`(other[, inplace, return_map])	Extend the vocabulary with words in other
`index`(word)	Returns the index associated to the word, adding it to the vocabulary if not yet
`indices`(sentence)	Convert words of the sentence to indices
`load`(path, name)	Load the vocabulary from its associated file.
`save`(path[, name])	Save the vocabulary in a file.

copy
get_bigrams

index(word)¶: Returns the index associated to the word, adding it to the vocabulary if not yet

mangoes.vocabulary.merge(*vocabularies, keys=None, concat=<function <lambda>>, return_map=False)¶

Merge a list of Vocabulary into one

Parameters

vocabularies: list of vocabularies
keys: None (default), or bool or list of str: If None or False, words that are common to several vocabularies are considered the same and will appear only once in resulting Vocabulary If keys is a list of string, of same size as counts, all words are prefixes with these keys. If keys is True, the languages of the vocabularies are used as keys
concat: callable, optional: Function that takes a key and a word (or a token) as input and returns a new word (or token) If keys are given, this function is called to create the word of the merged vocabulary from the given keys and the original words Default is ‘{key}_{word}’ that prefixes each word with their key and is only valid from simple string words vocabularies. Bigrams are transformed applying these function to both of their part

Returns

Vocabulary

Examples

>>> import mangoes.vocabulary
>>> v1 = mangoes.Vocabulary(['a', 'b', 'c'], language='l1')
>>> v2 = mangoes.Vocabulary(['a', 'd'], language='l2')
>>> merge(v1, v2)
Vocabulary(['a', 'b', 'c', 'd'])
>>> merge(v1, v2, keys=True)
Vocabulary(['l1_a', 'l1_b', 'l1_c', 'l2_a', 'l2_d'])
>>> merge(v1, v2, keys=['v1', 'v2'])
Vocabulary(['v1_a', 'v1_b', 'v1_c', 'v2_a', 'v2_d'])

With tokens : >>> import collections >>> Token = collections.namedtuple(‘Token’, ‘lemma POS’) >>> v3 = mangoes.Vocabulary([Token(‘a’, ‘X’), Token(‘b’, ‘Y’), Token(‘c’, ‘X’)], language=’l1’) >>> v4 = mangoes.Vocabulary([Token(‘a’, ‘X’), Token(‘d’, ‘Y’)], language=’l2’) >>> LangToken = collections.namedtuple(‘LangToken’, ‘lemma POS lang’) >>> merge(v3, v4, keys=True, concat=lambda lang, token: LangToken(*token, lang)) #doctest: +ELLIPSIS Vocabulary([LangToken(lemma=’a’, POS=’X’, lang=’l1’), …, LangToken(lemma=’d’, POS=’Y’, lang=’l2’)])

mangoes.vocabulary.create_token_filter(fields)¶

Returns a function to filter the given fields from a token

Parameters

fields: str or tuple: name of the fields(s) to keep

Returns

callable

Examples

>>> Token = mangoes.corpus.BROWN.Token
>>> cat_token = Token(form="cat", lemma="cat", POS="NOUN")
>>> mangoes.vocabulary.create_token_filter("lemma")(cat_token)
'cat'
>>> mangoes.vocabulary.create_token_filter(("lemma", "POS"))(cat_token)
Token(lemma='cat', POS='NOUN')

mangoes.vocabulary.create_tokens_filter(fields)¶: Returns a function to filter the given fields from a list of tokens

mangoes.vocabulary.create_bigrams_filter(bigrams=None)¶: Returns a function to find expected bigrams within a sentence