mangoes.base module¶
Base classes to define and create word representations.
This module provides the abstract base class Representation with two implementations and the main function create_representation to construct one..
-
class
mangoes.base.
Representation
(words, matrix, hyperparameters=None)¶ Bases:
abc.ABC
Abstract base class for a Representation.
- Parameters
- words: mangoes.Vocabulary
words represented as vectors (rows of the matrix)
- matrix: mangoes.utils.arrays.Matrix
vectors representing words
- Attributes
params
Dict of the parameters used to build this matrix, if available
- shape
Methods
distance
(word1, word2[, metric])Returns the distance between two words
get_closest_words
(word[, nb, metric])Returns the closest words and their distances from the given word
load
(path)Load a Representation
pairwise_distances
(words[, other_words, metric])Compute the distance matrix from one or two list(s) of words.
pprint
([display])Pretty print the matrix with labels for rows and columns
save
(path)Save the Representation
to_df
()Returns a pandas DataFrame representation of this matrix
analogy
word_mover_distance
-
property
shape
¶
-
property
params
¶ Dict of the parameters used to build this matrix, if available
-
pprint
(display=<built-in function print>)¶ Pretty print the matrix with labels for rows and columns
-
abstract
to_df
()¶ Returns a pandas DataFrame representation of this matrix
-
classmethod
load
(path)¶ Load a Representation
According to files found in the path, the Representation is either a CountBasedRepresentation object or an Embeddings object.
- Parameters
- path: str
Path to a folder or an archive
-
save
(path)¶ Save the Representation
- Parameters
- path: str
Path to a folder or an archive. Will be created if doesn’t exist.
-
distance
(word1, word2, metric='cosine', **kwargs)¶ Returns the distance between two words
This function relies on the sklearn.metrics.pairwise_distances module so you can use any distance available in it.
- Parameters
- word1: str
First word
- word2: str
Second word
- metric: str
The metric to use
- **kwargsoptional keyword parameters
Any further parameters are passed directly to the distance function.
-
pairwise_distances
(words, other_words=None, metric='cosine', **kwargs)¶ Compute the distance matrix from one or two list(s) of words.
This method takes a list of words, and returns the matrix of the distances between them. Or, if other_words is given (default is None), then the returned matrix is the pairwise distance between the words from words and the ones from other_words.
This function relies on the sklearn.metrics.pairwise_distances module so you can use any distance available in it.
Valid values for metric are: - From scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’].
These metrics support sparse matrix inputs.
From scipy.spatial.distance: [‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’]
See the documentation for sklearn.metrics.pairwise_distances for details on these metrics.
- Parameters
- wordslist of str
List of words
- other_words: list of str, optional
An optional second list.
- metricstring, or callable
The metric to use when calculating distance between words vectors.
- **kwargsoptional keyword parameters
Any further parameters are passed directly to the distance function.
- Returns
- array [len(words), len(words)] or [len(words), len(other_words)]
A distance matrix D such that D_{i, j} is the distance between the ith and jth words of the words list, if other_words is None. If other_words is not None, then D_{i, j} is the distance between the ith word from words and the jth word from other_words.
-
word_mover_distance
(sentence1, sentence2, stopwords=None, metric='euclidean', emd=None)¶
-
get_closest_words
(word, nb=10, metric='cosine')¶ Returns the closest words and their distances from the given word
- Returns
- list
list of tuples (word, distance) sorted by distance
- Attributes
- word: str, Token or vector
a word or a vector representing a word
- nb: int
the number of neighbors to return
- metric: str or callable
the metric to use (see pairwise_distances method)
-
analogy
(question)¶
-
class
mangoes.base.
Embeddings
(words, matrix, hyperparameters=None)¶ Bases:
mangoes.base.Representation
Base class for a Word Embedding.
- Parameters
- words: mangoes.Vocabulary
words represented as vectors (rows of the matrix)
- matrix: mangoes.utils.arrays.Matrix
vectors representing words
- Attributes
params
Dict of the parameters used to build this matrix, if available
- shape
Methods
distance
(word1, word2[, metric])Returns the distance between two words
get_closest_words
(word[, nb, metric])Returns the closest words and their distances from the given word
load
(path)Load an Embeddings
load_from_gensim
(path)Load an Embeddings from gensim
load_from_pickle_files
(matrix_file_path[, …])Load an Embeddings instance from a matrix and vocabulary pickle file(s).
load_from_text_file
(file_path[, sep])Load an embedding from a text file, where there is one word and its corresponding list of embedding values per line.
pairwise_distances
(words[, other_words, metric])Compute the distance matrix from one or two list(s) of words.
pprint
([display])Pretty print the matrix with labels for rows and columns
save
(path)Save the Representation
save_as_text_file
(file_path[, compress, sep])Save the embedding as a text file, with one word and its corresponding list of embedding values per line.
to_df
()Returns a pandas DataFrame representation of this matrix with labels for rows
analogy
word_mover_distance
-
to_df
()¶ Returns a pandas DataFrame representation of this matrix with labels for rows
-
classmethod
load
(path)¶ Load an Embeddings
This loader expects to find in path :
a file named ‘matrix.npy’ or ‘matrix.npz’ for the matrix
a text files ‘words.txt’ with the words represented as vectors in the matrix
- Parameters
- path: str
Path to a folder or an archive
-
classmethod
load_from_gensim
(path)¶ Load an Embeddings from gensim
- Parameters
- path: str
-
save_as_text_file
(file_path, compress=False, sep='\t')¶ Save the embedding as a text file, with one word and its corresponding list of embedding values per line.
- Parameters
- file_path: str
path to the location where to store the Embeddings instance as a text file
- compress: boolean
whether or not to compress the output text file (default = False). If True, it will be compressed using ‘gz’, and be named accordingly.
- sep: str
the string that shall act as the delimiter between words and/or between numbers on a line. (default = ‘ ‘)
-
static
load_from_text_file
(file_path, sep='\t')¶ Load an embedding from a text file, where there is one word and its corresponding list of embedding values per line.
The text file may be in a compressed format, such as ‘.gz’.
- Parameters
- file_path: str
path to the text file containing the Embeddings’ instance’s data
- sep: str
the string that shall act as the delimiter between words and/or between numbers on a line. (default = ‘ ‘)
- Returns
- Embeddings
-
classmethod
load_from_pickle_files
(matrix_file_path, vocabulary_file_path=None)¶ Load an Embeddings instance from a matrix and vocabulary pickle file(s).
- Parameters
- matrix_file_path: str
path to the pickle file where is stored at least the matrix (if vocabulary_file_path is not None) and also the vocabulary (if vocabulary_file_path is None).
- vocabulary_file_path: str, optional (default=None)
path to the pickle file, where the vocabulary is stored, if the matrix and the vocabulary are in separate files
- Returns
- Embeddings
-
class
mangoes.base.
CountBasedRepresentation
(words_vocabulary, contexts_vocabulary, matrix, hyperparameters=None)¶ Bases:
mangoes.base.Representation
Base class for a cooccurrence count matrix.
- Parameters
- words_vocabulary: mangoes.Vocabulary
words represented as vectors (rows of the matrix)
- contexts_vocabulary: mangoes.Vocabulary
words used as features (columns of the matrix)
- matrix
numbers of cooccurrence
- Attributes
params
Dict of the parameters used to build this matrix, if available
- shape
Methods
distance
(word1, word2[, metric])Returns the distance between two words
get_closest_words
(word[, nb, metric])Returns the closest words and their distances from the given word
load
(path)Load a CooccurrenceCount
pairwise_distances
(words[, other_words, metric])Compute the distance matrix from one or two list(s) of words.
pprint
([display])Pretty print the matrix with labels for rows and columns
save
(path)Save the Representation
to_df
()Returns a pandas DataFrame representation of this matrix with labels for rows and columns
analogy
word_mover_distance
-
to_df
()¶ Returns a pandas DataFrame representation of this matrix with labels for rows and columns
-
classmethod
load
(path)¶ Load a CooccurrenceCount
This loader expects to find in path :
a file named ‘matrix.npy’ or ‘matrix.npz’ for the matrix
two text files ‘words.txt’ and ‘contexts_words.txt’ with the words used in rows and columns of the matrix, respectively
- Parameters
- path: str
Path to a folder or an archive
-
class
mangoes.base.
Transformation
¶ Bases:
object
Base callable class to define transformation to be applied to a Matrix
See also
mangoes.create_representation()
mangoes.weighting
mangoes.reduction
- Attributes
- params
Methods
__call__
(matrix)Apply the transformation and return the transformed matrix
-
property
params
¶
-
mangoes.base.
create_representation
(source, weighting=None, reduction=None)¶ Create an Embeddings object from a CooccurrenceCount
Apply the function(s) passed in weighting and reduction parameters and returns a mangoes.Representation.
- Parameters
- source: mangoes.CountBasedRepresentation
- weighting: mangoes.Transformation
weighting function to apply to the source (see : mangoes.weighting)
- reduction: mangoes.Transformation
reduction to apply to the (weighted) source matrix (see : mangoes.reduction)
- Returns
- Embeddings or CountBasedRepresentation
See also
Examples
>>> embedding = mangoes.create_representation(cooccurrence_matrix, >>> reduction=mangoes.reduction.pca(dimensions=50))