mangoes.base module¶

Base classes to define and create word representations.

This module provides the abstract base class Representation with two implementations and the main function create_representation to construct one..

class mangoes.base.Representation(words, matrix, hyperparameters=None)¶

Bases: abc.ABC

Abstract base class for a Representation.

Parameters

words: mangoes.Vocabulary: words represented as vectors (rows of the matrix)
matrix: mangoes.utils.arrays.Matrix: vectors representing words

Attributes

params: Dict of the parameters used to build this matrix, if available
shape

Methods

`distance`(word1, word2[, metric])	Returns the distance between two words
`get_closest_words`(word[, nb, metric])	Returns the closest words and their distances from the given word
`load`(path)	Load a Representation
`pairwise_distances`(words[, other_words, metric])	Compute the distance matrix from one or two list(s) of words.
`pprint`([display])	Pretty print the matrix with labels for rows and columns
`save`(path)	Save the Representation
`to_df`()	Returns a pandas DataFrame representation of this matrix

analogy
word_mover_distance

property shape¶

property params¶: Dict of the parameters used to build this matrix, if available

pprint(display=<built-in function print>)¶: Pretty print the matrix with labels for rows and columns

abstract to_df()¶: Returns a pandas DataFrame representation of this matrix

classmethod load(path)¶

Load a Representation

According to files found in the path, the Representation is either a CountBasedRepresentation object or an Embeddings object.

Parameters

path: str: Path to a folder or an archive

save(path)¶

Save the Representation

Parameters

path: str: Path to a folder or an archive. Will be created if doesn’t exist.

distance(word1, word2, metric='cosine', **kwargs)¶

Returns the distance between two words

This function relies on the sklearn.metrics.pairwise_distances module so you can use any distance available in it.

Parameters

word1: str: First word
word2: str: Second word
metric: str: The metric to use
**kwargsoptional keyword parameters: Any further parameters are passed directly to the distance function.

pairwise_distances(words, other_words=None, metric='cosine', **kwargs)¶

Compute the distance matrix from one or two list(s) of words.

This method takes a list of words, and returns the matrix of the distances between them. Or, if other_words is given (default is None), then the returned matrix is the pairwise distance between the words from words and the ones from other_words.

This function relies on the sklearn.metrics.pairwise_distances module so you can use any distance available in it.

Valid values for metric are: - From scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’].

These metrics support sparse matrix inputs.

From scipy.spatial.distance: [‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]

See the documentation for sklearn.metrics.pairwise_distances for details on these metrics.

Parameters

wordslist of str: List of words
other_words: list of str, optional: An optional second list.
metricstring, or callable: The metric to use when calculating distance between words vectors.
**kwargsoptional keyword parameters: Any further parameters are passed directly to the distance function.

Returns

array [len(words), len(words)] or [len(words), len(other_words)]: A distance matrix D such that D_{i, j} is the distance between the ith and jth words of the words list, if other_words is None. If other_words is not None, then D_{i, j} is the distance between the ith word from words and the jth word from other_words.

word_mover_distance(sentence1, sentence2, stopwords=None, metric='euclidean', emd=None)¶

get_closest_words(word, nb=10, metric='cosine')¶

Returns the closest words and their distances from the given word

Returns

list: list of tuples (word, distance) sorted by distance

Attributes

word: str, Token or vector: a word or a vector representing a word
nb: int: the number of neighbors to return
metric: str or callable: the metric to use (see pairwise_distances method)

analogy(question)¶

class mangoes.base.Embeddings(words, matrix, hyperparameters=None)¶

Bases: mangoes.base.Representation

Base class for a Word Embedding.

Parameters

words: mangoes.Vocabulary: words represented as vectors (rows of the matrix)
matrix: mangoes.utils.arrays.Matrix: vectors representing words

Attributes

params: Dict of the parameters used to build this matrix, if available
shape

Methods

`distance`(word1, word2[, metric])	Returns the distance between two words
`get_closest_words`(word[, nb, metric])	Returns the closest words and their distances from the given word
`load`(path)	Load an Embeddings
`load_from_gensim`(path)	Load an Embeddings from gensim
`load_from_pickle_files`(matrix_file_path[, …])	Load an Embeddings instance from a matrix and vocabulary pickle file(s).
`load_from_text_file`(file_path[, sep])	Load an embedding from a text file, where there is one word and its corresponding list of embedding values per line.
`pairwise_distances`(words[, other_words, metric])	Compute the distance matrix from one or two list(s) of words.
`pprint`([display])	Pretty print the matrix with labels for rows and columns
`save`(path)	Save the Representation
`save_as_text_file`(file_path[, compress, sep])	Save the embedding as a text file, with one word and its corresponding list of embedding values per line.
`to_df`()	Returns a pandas DataFrame representation of this matrix with labels for rows

analogy
word_mover_distance

to_df()¶: Returns a pandas DataFrame representation of this matrix with labels for rows

classmethod load(path)¶

Load an Embeddings

This loader expects to find in path :

a file named ‘matrix.npy’ or ‘matrix.npz’ for the matrix
a text files ‘words.txt’ with the words represented as vectors in the matrix

Parameters

path: str: Path to a folder or an archive

classmethod load_from_gensim(path)¶

Load an Embeddings from gensim

Parameters

path: str

save_as_text_file(file_path, compress=False, sep='\t')¶

Save the embedding as a text file, with one word and its corresponding list of embedding values per line.

Parameters

file_path: str: path to the location where to store the Embeddings instance as a text file
compress: boolean: whether or not to compress the output text file (default = False). If True, it will be compressed using ‘gz’, and be named accordingly.
sep: str: the string that shall act as the delimiter between words and/or between numbers on a line. (default = ‘ ‘)

static load_from_text_file(file_path, sep='\t')¶

Load an embedding from a text file, where there is one word and its corresponding list of embedding values per line.

The text file may be in a compressed format, such as ‘.gz’.

Parameters

file_path: str: path to the text file containing the Embeddings’ instance’s data
sep: str: the string that shall act as the delimiter between words and/or between numbers on a line. (default = ‘ ‘)

Returns

Embeddings

classmethod load_from_pickle_files(matrix_file_path, vocabulary_file_path=None)¶

Load an Embeddings instance from a matrix and vocabulary pickle file(s).

Parameters

matrix_file_path: str: path to the pickle file where is stored at least the matrix (if vocabulary_file_path is not None) and also the vocabulary (if vocabulary_file_path is None).
vocabulary_file_path: str, optional (default=None): path to the pickle file, where the vocabulary is stored, if the matrix and the vocabulary are in separate files

Returns

Embeddings

class mangoes.base.CountBasedRepresentation(words_vocabulary, contexts_vocabulary, matrix, hyperparameters=None)¶

Bases: mangoes.base.Representation

Base class for a cooccurrence count matrix.

Parameters

words_vocabulary: mangoes.Vocabulary: words represented as vectors (rows of the matrix)
contexts_vocabulary: mangoes.Vocabulary: words used as features (columns of the matrix)
matrix: numbers of cooccurrence

Attributes

params: Dict of the parameters used to build this matrix, if available
shape

Methods

`distance`(word1, word2[, metric])	Returns the distance between two words
`get_closest_words`(word[, nb, metric])	Returns the closest words and their distances from the given word
`load`(path)	Load a CooccurrenceCount
`pairwise_distances`(words[, other_words, metric])	Compute the distance matrix from one or two list(s) of words.
`pprint`([display])	Pretty print the matrix with labels for rows and columns
`save`(path)	Save the Representation
`to_df`()	Returns a pandas DataFrame representation of this matrix with labels for rows and columns

analogy
word_mover_distance

to_df()¶: Returns a pandas DataFrame representation of this matrix with labels for rows and columns

classmethod load(path)¶

Load a CooccurrenceCount

This loader expects to find in path :

a file named ‘matrix.npy’ or ‘matrix.npz’ for the matrix
two text files ‘words.txt’ and ‘contexts_words.txt’ with the words used in rows and columns of the matrix, respectively

Parameters

path: str: Path to a folder or an archive

class mangoes.base.Transformation¶

Bases: object

Base callable class to define transformation to be applied to a Matrix