Mangoes documentation¶
Mangoes is a toolbox for constructing and evaluating static and contextual token vector representations (aka embeddings). The main functionalities are:
Contextual embeddings:
Easy-to-use interface for accessing, using and fine-tuning pretrained transformer models through a wrapper around Huggingface’s transformer library.
Extract contextual embeddings from thousands of pretrained transformer language models with minimal code.
Fine-tune any transformer model on a number of extrinsic NLP tasks, with multiple levels of training/inference customization.
Access “enhanced” models pretrained with linguistic or encyclopedic knowledge.
Static embeddings:
Process textual data and compute vocabularies and co-occurrence matrices. Input data should be raw text or annotated text.
Compute static word embeddings with different state-of-the art unsupervised methods.
Propose statistical and intrinsic evaluation methods, as well as some visualization tools.
Mangoes is developed as part of the IMPRESS (Improving Embeddings with Semantic Knowledge) project: https://project.inria.fr/impress/
Installation¶
Mangoes is available on PyPi and thus can be installed using pip:
>>> pip install mangoes
Quickstart¶
Contextual Language Models¶
Fine-tuning classes make it easy to fine-tune or obtain representations from any pretrained transformer model on the Huggingface model hub ( https://huggingface.co/models ). Available fine-tuning classes include sequence classification, token classification, question answering, multiple choice answering, and coreference resolution. Accessing pretrained models, in this case a BERT model:
>>> from mangoes.modeling import TransformerForFeatureExtraction
>>> model = TransformerForFeatureExtraction("bert-base-uncased", "bert-base-uncased", device=None)
>>> input_text = "This is a test sentence" # could also be a list of sentences
>>> outputs = model.generate_outputs(input_text, pre_tokenized=False, output_hidden_states=True, output_attentions=False, word_embeddings=False)
>>> # outputs is dict containing "hidden_states": hidden states of all layers for each token.
See more code examples in the contextual language model use cases.
Static Word Embeddings¶
From corpus to word embeddings :
>>> import mangoes
>>> import string
>>>
>>> path_to_corpus = "notebooks/data/wiki_article_en"
>>>
>>> corpus = mangoes.Corpus(path_to_corpus, lower=True)
>>> vocabulary = corpus.create_vocabulary(filters=[mangoes.corpus.remove_elements(string.punctuation)])
>>> cooccurrences = mangoes.counting.count_cooccurrence(corpus, vocabulary, vocabulary)
>>> embeddings = mangoes.create_representation(cooccurrences,
... weighting=mangoes.weighting.PPMI(),
... reduction=mangoes.reduction.SVD(dimensions=200))
>>> print(embeddings.get_closest_words("september", 3))
[('august', 5.803186132007723e-15), ('attracting', 2.7974552300044038), ('july', 2.7974552300044038)]
Evaluation :
>>> import mangoes.evaluation.similarity
>>> ws_evaluation = mangoes.evaluation.similarity.Evaluation(embeddings, *mangoes.evaluation.similarity.ALL_DATASETS)
>>> print(ws_evaluation.get_report())
pearson spearman
Nb questions (p-val) (p-val)
================================================================================================
WS353 32/353 -0.252(2e-01) -0.158(4e-01)
------------------------------------------------------------------------------------------------
WS353 relatedness 26/252 -0.317(1e-01) -0.0486(8e-01)
------------------------------------------------------------------------------------------------
WS353 similarity 21/203 -0.137(6e-01) -0.254(3e-01)
------------------------------------------------------------------------------------------------
MEN 32/3000 0.262(1e-01) -0.0312(9e-01)
------------------------------------------------------------------------------------------------
M. Turk 15/287 -0.0791(8e-01) 0.25(4e-01)
------------------------------------------------------------------------------------------------
Rareword 24/2034 0.452(3e-02) 0.407(5e-02)
------------------------------------------------------------------------------------------------
RG65 0/65 nan(nan) nan(nan)
------------------------------------------------------------------------------------------------
See more code examples in the static word embeddings use cases.