mangoes.vocabulary module

Class to manage the words to be represented in embeddings or used as contexts.

class mangoes.vocabulary.Bigram(first, second)

Bases: tuple

Attributes
first

Alias for field number 0

second

Alias for field number 1

Methods

count(value, /)

Return number of occurrences of value.

index(value[, start, stop])

Return first index of value.

property first

Alias for field number 0

property second

Alias for field number 1

class mangoes.vocabulary.Vocabulary(source, language=None, entity=None)

Bases: object

List of words.

Vocabulary encapsulates a mapping between words and their ids. A Vocabulary can be create from a collection of words.

Parameters
source: list or dict

List of words or dict where keys are words and values are their indices

language: str (optional)
entity: str or tuple (optional)

if the words are annotated, attribute(s) of each word

Attributes
entity
language
params
words

Returns the words of the vocabulary as a list

Methods

append(word)

Append the word to the vocabulary

extend(other[, inplace, return_map])

Extend the vocabulary with words in other

index(word)

Returns the index associated to the word

indices(sentence)

Convert words of the sentence to indices

load(path, name)

Load the vocabulary from its associated file.

save(path[, name])

Save the vocabulary in a file.

copy

get_bigrams

FILE_HEADER_PREFIX = '_$'
copy()
property language
property entity
property params
property words

Returns the words of the vocabulary as a list

get_bigrams()
index(word)

Returns the index associated to the word

indices(sentence)

Convert words of the sentence to indices

If a word isn’t in the vocabulary, its index is replaced with -1

Parameters
sentence: list of str
Returns
list of int
append(word)

Append the word to the vocabulary

extend(other, inplace=True, return_map=False)

Extend the vocabulary with words in other

Parameters
other: list or Vocabulary
inplace: boolean

If False, create a new Vocabulary

return_map: boolean

If True, the mapping between the indices of the words from original to merged is returned

Returns
Vocabulary or (Vocabulary, dict)

Returns the merged vocabulary and, if return_map is True, the mapping between the indices of the words from original to merged

save(path, name='vocabulary')

Save the vocabulary in a file.

Parameters
path: str

Local path to the directory where vocabulary should be written

name: str

Name of the file to create (without extension)

Warning

If the file already exists, it will be overwritten.

classmethod load(path, name)

Load the vocabulary from its associated file.

Parameters
path: str

Local path to the directory where vocabulary file is located

name: str

Name of the file (without extension)

Returns
Vocabulary
class mangoes.vocabulary.DynamicVocabulary(source=None, *args, **kwargs)

Bases: mangoes.vocabulary.Vocabulary

Extensible list of words.

A DynamicVocabulary can be created from a collection of words or empty and each new encountered word will be added to it either explicitly (with add() or implicitly when testing if the word is in the vocabulary (always returns True) or getting its index

Parameters
source: list or dict (optional)

List of words or dict where keys are words and values are their indices

language: str (optional)
entity: str or tuple (optional)

if the words are annotated, attribute(s) of each word

Examples

>>> v = mangoes.vocabulary.DynamicVocabulary()
>>> print(v.words)
[]
>>> v.append('a')
0
>>> print(v.words)
['a']
>>> 'b' in v
True
>>> v.index('b')
1
>>> v.index('c')
2
Attributes
entity
language
params
words

Returns the words of the vocabulary as a list

Methods

append(word)

Append the word to the vocabulary

extend(other[, inplace, return_map])

Extend the vocabulary with words in other

index(word)

Returns the index associated to the word, adding it to the vocabulary if not yet

indices(sentence)

Convert words of the sentence to indices

load(path, name)

Load the vocabulary from its associated file.

save(path[, name])

Save the vocabulary in a file.

copy

get_bigrams

index(word)

Returns the index associated to the word, adding it to the vocabulary if not yet

mangoes.vocabulary.merge(*vocabularies, keys=None, concat=<function <lambda>>, return_map=False)

Merge a list of Vocabulary into one

Parameters
vocabularies: list of vocabularies
keys: None (default), or bool or list of str

If None or False, words that are common to several vocabularies are considered the same and will appear only once in resulting Vocabulary If keys is a list of string, of same size as counts, all words are prefixes with these keys. If keys is True, the languages of the vocabularies are used as keys

concat: callable, optional

Function that takes a key and a word (or a token) as input and returns a new word (or token) If keys are given, this function is called to create the word of the merged vocabulary from the given keys and the original words Default is ‘{key}_{word}’ that prefixes each word with their key and is only valid from simple string words vocabularies. Bigrams are transformed applying these function to both of their part

Returns
Vocabulary

Examples

>>> import mangoes.vocabulary
>>> v1 = mangoes.Vocabulary(['a', 'b', 'c'], language='l1')
>>> v2 = mangoes.Vocabulary(['a', 'd'], language='l2')
>>> merge(v1, v2)
Vocabulary(['a', 'b', 'c', 'd'])
>>> merge(v1, v2, keys=True)
Vocabulary(['l1_a', 'l1_b', 'l1_c', 'l2_a', 'l2_d'])
>>> merge(v1, v2, keys=['v1', 'v2'])
Vocabulary(['v1_a', 'v1_b', 'v1_c', 'v2_a', 'v2_d'])

With tokens : >>> import collections >>> Token = collections.namedtuple(‘Token’, ‘lemma POS’) >>> v3 = mangoes.Vocabulary([Token(‘a’, ‘X’), Token(‘b’, ‘Y’), Token(‘c’, ‘X’)], language=’l1’) >>> v4 = mangoes.Vocabulary([Token(‘a’, ‘X’), Token(‘d’, ‘Y’)], language=’l2’) >>> LangToken = collections.namedtuple(‘LangToken’, ‘lemma POS lang’) >>> merge(v3, v4, keys=True, concat=lambda lang, token: LangToken(*token, lang)) #doctest: +ELLIPSIS Vocabulary([LangToken(lemma=’a’, POS=’X’, lang=’l1’), …, LangToken(lemma=’d’, POS=’Y’, lang=’l2’)])

mangoes.vocabulary.create_token_filter(fields)

Returns a function to filter the given fields from a token

Parameters
fields: str or tuple

name of the fields(s) to keep

Returns
callable

Examples

>>> Token = mangoes.corpus.BROWN.Token
>>> cat_token = Token(form="cat", lemma="cat", POS="NOUN")
>>> mangoes.vocabulary.create_token_filter("lemma")(cat_token)
'cat'
>>> mangoes.vocabulary.create_token_filter(("lemma", "POS"))(cat_token)
Token(lemma='cat', POS='NOUN')
mangoes.vocabulary.create_tokens_filter(fields)

Returns a function to filter the given fields from a list of tokens

mangoes.vocabulary.create_bigrams_filter(bigrams=None)

Returns a function to find expected bigrams within a sentence