Mangoes documentation

Mangoes is a toolbox for constructing and evaluating static and contextual token vector representations (aka embeddings). The main functionalities are:

Contextual embeddings:

  • Easy-to-use interface for accessing, using and fine-tuning pretrained transformer models through a wrapper around Huggingface’s transformer library.

  • Extract contextual embeddings from thousands of pretrained transformer language models with minimal code.

  • Fine-tune any transformer model on a number of extrinsic NLP tasks, with multiple levels of training/inference customization.

  • Access “enhanced” models pretrained with linguistic or encyclopedic knowledge.

Static embeddings:

  • Process textual data and compute vocabularies and co-occurrence matrices. Input data should be raw text or annotated text.

  • Compute static word embeddings with different state-of-the art unsupervised methods.

  • Propose statistical and intrinsic evaluation methods, as well as some visualization tools.

Mangoes is developed as part of the IMPRESS (Improving Embeddings with Semantic Knowledge) project: https://project.inria.fr/impress/

Installation

Mangoes is available on PyPi and thus can be installed using pip:

>>> pip install mangoes

Quickstart

Contextual Language Models

Fine-tuning classes make it easy to fine-tune or obtain representations from any pretrained transformer model on the Huggingface model hub ( https://huggingface.co/models ). Available fine-tuning classes include sequence classification, token classification, question answering, multiple choice answering, and coreference resolution. Accessing pretrained models, in this case a BERT model:

>>> from mangoes.modeling import TransformerForFeatureExtraction
>>> model = TransformerForFeatureExtraction("bert-base-uncased", "bert-base-uncased", device=None)
>>> input_text = "This is a test sentence" # could also be a list of sentences
>>> outputs = model.generate_outputs(input_text, pre_tokenized=False, output_hidden_states=True, output_attentions=False, word_embeddings=False)
>>> # outputs is dict containing "hidden_states": hidden states of all layers for each token.

See more code examples in the contextual language model use cases.

Static Word Embeddings

From corpus to word embeddings :

>>> import mangoes
>>> import string
>>>
>>> path_to_corpus = "notebooks/data/wiki_article_en"
>>>
>>> corpus = mangoes.Corpus(path_to_corpus, lower=True)
>>> vocabulary = corpus.create_vocabulary(filters=[mangoes.corpus.remove_elements(string.punctuation)])
>>> cooccurrences = mangoes.counting.count_cooccurrence(corpus, vocabulary, vocabulary)
>>> embeddings = mangoes.create_representation(cooccurrences,
...                                            weighting=mangoes.weighting.PPMI(),
...                                            reduction=mangoes.reduction.SVD(dimensions=200))
>>> print(embeddings.get_closest_words("september", 3))
[('august', 5.803186132007723e-15), ('attracting', 2.7974552300044038), ('july', 2.7974552300044038)]

Evaluation :

>>> import mangoes.evaluation.similarity
>>> ws_evaluation = mangoes.evaluation.similarity.Evaluation(embeddings, *mangoes.evaluation.similarity.ALL_DATASETS)
>>> print(ws_evaluation.get_report())
                                                                           pearson       spearman
                                                       Nb questions        (p-val)        (p-val)
 ================================================================================================
 WS353                                                       32/353  -0.252(2e-01)  -0.158(4e-01)
 ------------------------------------------------------------------------------------------------
 WS353 relatedness                                           26/252  -0.317(1e-01) -0.0486(8e-01)
 ------------------------------------------------------------------------------------------------
 WS353 similarity                                            21/203  -0.137(6e-01)  -0.254(3e-01)
 ------------------------------------------------------------------------------------------------
 MEN                                                        32/3000   0.262(1e-01) -0.0312(9e-01)
 ------------------------------------------------------------------------------------------------
 M. Turk                                                     15/287 -0.0791(8e-01)    0.25(4e-01)
 ------------------------------------------------------------------------------------------------
 Rareword                                                   24/2034   0.452(3e-02)   0.407(5e-02)
 ------------------------------------------------------------------------------------------------
 RG65                                                          0/65       nan(nan)       nan(nan)
 ------------------------------------------------------------------------------------------------

See more code examples in the static word embeddings use cases.

Resources

You can download some resources created with Mangoes

Indices and tables