mangoes.base module

Base classes to define and create word representations.

This module provides the abstract base class Representation with two implementations and the main function create_representation to construct one..

class mangoes.base.Representation(words, matrix, hyperparameters=None)

Bases: abc.ABC

Abstract base class for a Representation.

Parameters
words: mangoes.Vocabulary

words represented as vectors (rows of the matrix)

matrix: mangoes.utils.arrays.Matrix

vectors representing words

Attributes
params

Dict of the parameters used to build this matrix, if available

shape

Methods

distance(word1, word2[, metric])

Returns the distance between two words

get_closest_words(word[, nb, metric])

Returns the closest words and their distances from the given word

load(path)

Load a Representation

pairwise_distances(words[, other_words, metric])

Compute the distance matrix from one or two list(s) of words.

pprint([display])

Pretty print the matrix with labels for rows and columns

save(path)

Save the Representation

to_df()

Returns a pandas DataFrame representation of this matrix

analogy

word_mover_distance

property shape
property params

Dict of the parameters used to build this matrix, if available

pprint(display=<built-in function print>)

Pretty print the matrix with labels for rows and columns

abstract to_df()

Returns a pandas DataFrame representation of this matrix

classmethod load(path)

Load a Representation

According to files found in the path, the Representation is either a CountBasedRepresentation object or an Embeddings object.

Parameters
path: str

Path to a folder or an archive

save(path)

Save the Representation

Parameters
path: str

Path to a folder or an archive. Will be created if doesn’t exist.

distance(word1, word2, metric='cosine', **kwargs)

Returns the distance between two words

This function relies on the sklearn.metrics.pairwise_distances module so you can use any distance available in it.

Parameters
word1: str

First word

word2: str

Second word

metric: str

The metric to use

**kwargsoptional keyword parameters

Any further parameters are passed directly to the distance function.

pairwise_distances(words, other_words=None, metric='cosine', **kwargs)

Compute the distance matrix from one or two list(s) of words.

This method takes a list of words, and returns the matrix of the distances between them. Or, if other_words is given (default is None), then the returned matrix is the pairwise distance between the words from words and the ones from other_words.

This function relies on the sklearn.metrics.pairwise_distances module so you can use any distance available in it.

Valid values for metric are: - From scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’].

These metrics support sparse matrix inputs.

  • From scipy.spatial.distance: [‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]

See the documentation for sklearn.metrics.pairwise_distances for details on these metrics.

Parameters
wordslist of str

List of words

other_words: list of str, optional

An optional second list.

metricstring, or callable

The metric to use when calculating distance between words vectors.

**kwargsoptional keyword parameters

Any further parameters are passed directly to the distance function.

Returns
array [len(words), len(words)] or [len(words), len(other_words)]

A distance matrix D such that D_{i, j} is the distance between the ith and jth words of the words list, if other_words is None. If other_words is not None, then D_{i, j} is the distance between the ith word from words and the jth word from other_words.

word_mover_distance(sentence1, sentence2, stopwords=None, metric='euclidean', emd=None)
get_closest_words(word, nb=10, metric='cosine')

Returns the closest words and their distances from the given word

Returns
list

list of tuples (word, distance) sorted by distance

Attributes
word: str, Token or vector

a word or a vector representing a word

nb: int

the number of neighbors to return

metric: str or callable

the metric to use (see pairwise_distances method)

analogy(question)
class mangoes.base.Embeddings(words, matrix, hyperparameters=None)

Bases: mangoes.base.Representation

Base class for a Word Embedding.

Parameters
words: mangoes.Vocabulary

words represented as vectors (rows of the matrix)

matrix: mangoes.utils.arrays.Matrix

vectors representing words

Attributes
params

Dict of the parameters used to build this matrix, if available

shape

Methods

distance(word1, word2[, metric])

Returns the distance between two words

get_closest_words(word[, nb, metric])

Returns the closest words and their distances from the given word

load(path)

Load an Embeddings

load_from_gensim(path)

Load an Embeddings from gensim

load_from_pickle_files(matrix_file_path[, …])

Load an Embeddings instance from a matrix and vocabulary pickle file(s).

load_from_text_file(file_path[, sep])

Load an embedding from a text file, where there is one word and its corresponding list of embedding values per line.

pairwise_distances(words[, other_words, metric])

Compute the distance matrix from one or two list(s) of words.

pprint([display])

Pretty print the matrix with labels for rows and columns

save(path)

Save the Representation

save_as_text_file(file_path[, compress, sep])

Save the embedding as a text file, with one word and its corresponding list of embedding values per line.

to_df()

Returns a pandas DataFrame representation of this matrix with labels for rows

analogy

word_mover_distance

to_df()

Returns a pandas DataFrame representation of this matrix with labels for rows

classmethod load(path)

Load an Embeddings

This loader expects to find in path :

  • a file named ‘matrix.npy’ or ‘matrix.npz’ for the matrix

  • a text files ‘words.txt’ with the words represented as vectors in the matrix

Parameters
path: str

Path to a folder or an archive

classmethod load_from_gensim(path)

Load an Embeddings from gensim

Parameters
path: str
save_as_text_file(file_path, compress=False, sep='\t')

Save the embedding as a text file, with one word and its corresponding list of embedding values per line.

Parameters
file_path: str

path to the location where to store the Embeddings instance as a text file

compress: boolean

whether or not to compress the output text file (default = False). If True, it will be compressed using ‘gz’, and be named accordingly.

sep: str

the string that shall act as the delimiter between words and/or between numbers on a line. (default = ‘ ‘)

static load_from_text_file(file_path, sep='\t')

Load an embedding from a text file, where there is one word and its corresponding list of embedding values per line.

The text file may be in a compressed format, such as ‘.gz’.

Parameters
file_path: str

path to the text file containing the Embeddings’ instance’s data

sep: str

the string that shall act as the delimiter between words and/or between numbers on a line. (default = ‘ ‘)

Returns
Embeddings
classmethod load_from_pickle_files(matrix_file_path, vocabulary_file_path=None)

Load an Embeddings instance from a matrix and vocabulary pickle file(s).

Parameters
matrix_file_path: str

path to the pickle file where is stored at least the matrix (if vocabulary_file_path is not None) and also the vocabulary (if vocabulary_file_path is None).

vocabulary_file_path: str, optional (default=None)

path to the pickle file, where the vocabulary is stored, if the matrix and the vocabulary are in separate files

Returns
Embeddings
class mangoes.base.CountBasedRepresentation(words_vocabulary, contexts_vocabulary, matrix, hyperparameters=None)

Bases: mangoes.base.Representation

Base class for a cooccurrence count matrix.

Parameters
words_vocabulary: mangoes.Vocabulary

words represented as vectors (rows of the matrix)

contexts_vocabulary: mangoes.Vocabulary

words used as features (columns of the matrix)

matrix

numbers of cooccurrence

Attributes
params

Dict of the parameters used to build this matrix, if available

shape

Methods

distance(word1, word2[, metric])

Returns the distance between two words

get_closest_words(word[, nb, metric])

Returns the closest words and their distances from the given word

load(path)

Load a CooccurrenceCount

pairwise_distances(words[, other_words, metric])

Compute the distance matrix from one or two list(s) of words.

pprint([display])

Pretty print the matrix with labels for rows and columns

save(path)

Save the Representation

to_df()

Returns a pandas DataFrame representation of this matrix with labels for rows and columns

analogy

word_mover_distance

to_df()

Returns a pandas DataFrame representation of this matrix with labels for rows and columns

classmethod load(path)

Load a CooccurrenceCount

This loader expects to find in path :

  • a file named ‘matrix.npy’ or ‘matrix.npz’ for the matrix

  • two text files ‘words.txt’ and ‘contexts_words.txt’ with the words used in rows and columns of the matrix, respectively

Parameters
path: str

Path to a folder or an archive

class mangoes.base.Transformation

Bases: object

Base callable class to define transformation to be applied to a Matrix

See also

mangoes.create_representation()
mangoes.weighting
mangoes.reduction
Attributes
params

Methods

__call__(matrix)

Apply the transformation and return the transformed matrix

property params
mangoes.base.create_representation(source, weighting=None, reduction=None)

Create an Embeddings object from a CooccurrenceCount

Apply the function(s) passed in weighting and reduction parameters and returns a mangoes.Representation.

Parameters
source: mangoes.CountBasedRepresentation
weighting: mangoes.Transformation

weighting function to apply to the source (see : mangoes.weighting)

reduction: mangoes.Transformation

reduction to apply to the (weighted) source matrix (see : mangoes.reduction)

Returns
Embeddings or CountBasedRepresentation

Examples

>>>  embedding = mangoes.create_representation(cooccurrence_matrix,
>>>                                           reduction=mangoes.reduction.pca(dimensions=50))