Mangoes documentation
=======================

Mangoes is a toolbox for constructing and evaluating static and contextual token vector representations (aka embeddings).

The main functionalities are:

Contextual embeddings:

* Easy-to-use interface for accessing, using and fine-tuning pretrained transformer models through a wrapper around Huggingface's transformer library.
* Extract contextual embeddings from thousands of pretrained transformer language models with minimal code.
* Fine-tune any transformer model on a number of extrinsic NLP tasks, with multiple levels of training/inference customization.
* Access "enhanced" models pretrained with linguistic or encyclopedic knowledge.

Static embeddings:

* Process textual data and compute vocabularies and co-occurrence matrices. Input data should be raw text or annotated text.
* Compute static word embeddings with different state-of-the art unsupervised methods.
* Propose statistical and intrinsic evaluation methods, as well as some visualization tools.

Mangoes is developed as part of the IMPRESS (Improving Embeddings with Semantic Knowledge) project:

Installation
============

Mangoes is available on PyPi and thus can be installed using pip:

>>> pip install mangoes

Quickstart
==========

Contextual Language Models
--------------------------

Fine-tuning classes make it easy to fine-tune or obtain representations from any pretrained transformer model on the Huggingface model hub ( ).

Available fine-tuning classes include sequence classification, token classification, question answering, multiple choice answering, and coreference resolution.

Accessing pretrained models, in this case a BERT model:

>>> from mangoes.modeling import TransformerForFeatureExtraction
>>> model = TransformerForFeatureExtraction("bert-base-uncased", "bert-base-uncased", device=None)
>>> input_text = "This is a test sentence"  # could also be a list of sentences
>>> outputs = model.generate_outputs(input_text, pre_tokenized=False, output_hidden_states=True, output_attentions=False, word_embeddings=False)
>>> # outputs is dict containing "hidden_states": hidden states of all layers for each token.

See more code examples in the contextual language model use cases.

Static Word Embeddings
----------------------

From corpus to word embeddings :

>>> import mangoes
>>> import string
>>>
>>> path_to_corpus = "notebooks/data/wiki_article_en"
>>>
>>> corpus = mangoes.Corpus(path_to_corpus, lower=True)
>>> vocabulary = corpus.create_vocabulary(filters=[mangoes.corpus.remove_elements(string.punctuation)])
>>> cooccurrences = mangoes.counting.count_cooccurrence(corpus, vocabulary, vocabulary)
>>> embeddings = mangoes.create_representation(cooccurrences,
...                                             weighting=mangoes.weighting.PPMI(),
...                                             reduction=mangoes.reduction.SVD(dimensions=200))
>>> print(embeddings.get_closest_words("september", 3))
[('august', 5.803186132007723e-15), ('attracting', 2.7974552300044038), ('july', 2.7974552300044038)]

Evaluation :

>>> import mangoes.evaluation.similarity
>>> ws_evaluation = mangoes.evaluation.similarity.Evaluation(embeddings, *mangoes.evaluation.similarity.ALL_DATASETS)
>>> print(ws_evaluation.get_report())
                                 pearson         spearman        Nb questions
                                (p-val)         (p-val)
================================================================================================
WS353                           -0.252(2e-01)   -0.158(4e-01)   32/353
------------------------------------------------------------------------------------------------
WS353 relatedness               -0.317(1e-01)   -0.0486(8e-01)  26/252
------------------------------------------------------------------------------------------------
WS353 similarity                -0.137(6e-01)   -0.254(3e-01)   21/203
------------------------------------------------------------------------------------------------
MEN                             0.262(1e-01)    -0.0312(9e-01)  32/3000
------------------------------------------------------------------------------------------------
M. Turk                         -0.0791(8e-01)  0.25(4e-01)     15/287
------------------------------------------------------------------------------------------------
Rareword                        0.452(3e-02)    0.407(5e-02)    24/2034
------------------------------------------------------------------------------------------------
RG65                            nan(nan)        nan(nan)        0/65
------------------------------------------------------------------------------------------------

See more code examples in the static word embeddings use cases.

Resources
=========

You can download some `resources `_ created with Mangoes