Mangoes documentation ======================= Mangoes is a toolbox for constructing and evaluating static and contextual token vector representations (aka embeddings). The main functionalities are: Contextual embeddings: * Easy-to-use interface for accessing, using and fine-tuning pretrained transformer models through a wrapper around Huggingface's transformer library. * Extract contextual embeddings from thousands of pretrained transformer language models with minimal code. * Fine-tune any transformer model on a number of extrinsic NLP tasks, with multiple levels of training/inference customization. * Access "enhanced" models pretrained with linguistic or encyclopedic knowledge. Static embeddings: * Process textual data and compute vocabularies and co-occurrence matrices. Input data should be raw text or annotated text. * Compute static word embeddings with different state-of-the art unsupervised methods. * Propose statistical and intrinsic evaluation methods, as well as some visualization tools. Mangoes is developed as part of the IMPRESS (Improving Embeddings with Semantic Knowledge) project: https://project.inria.fr/impress/ Installation ============ Mangoes is available on PyPi and thus can be installed using pip: >>> pip install mangoes Quickstart ========== Contextual Language Models -------------------------- Fine-tuning classes make it easy to fine-tune or obtain representations from any pretrained transformer model on the Huggingface model hub ( https://huggingface.co/models ). Available fine-tuning classes include sequence classification, token classification, question answering, multiple choice answering, and coreference resolution. Accessing pretrained models, in this case a BERT model: >>> from mangoes.modeling import TransformerForFeatureExtraction >>> model = TransformerForFeatureExtraction("bert-base-uncased", "bert-base-uncased", device=None) >>> input_text = "This is a test sentence" # could also be a list of sentences >>> outputs = model.generate_outputs(input_text, pre_tokenized=False, output_hidden_states=True, output_attentions=False, word_embeddings=False) >>> # outputs is dict containing "hidden_states": hidden states of all layers for each token. See more code examples in the contextual language model use cases. Static Word Embeddings ---------------------- From corpus to word embeddings : >>> import mangoes >>> import string >>> >>> path_to_corpus = "notebooks/data/wiki_article_en" >>> >>> corpus = mangoes.Corpus(path_to_corpus, lower=True) >>> vocabulary = corpus.create_vocabulary(filters=[mangoes.corpus.remove_elements(string.punctuation)]) >>> cooccurrences = mangoes.counting.count_cooccurrence(corpus, vocabulary, vocabulary) >>> embeddings = mangoes.create_representation(cooccurrences, ... weighting=mangoes.weighting.PPMI(), ... reduction=mangoes.reduction.SVD(dimensions=200)) >>> print(embeddings.get_closest_words("september", 3)) [('august', 5.803186132007723e-15), ('attracting', 2.7974552300044038), ('july', 2.7974552300044038)] Evaluation : >>> import mangoes.evaluation.similarity >>> ws_evaluation = mangoes.evaluation.similarity.Evaluation(embeddings, *mangoes.evaluation.similarity.ALL_DATASETS) >>> print(ws_evaluation.get_report()) pearson spearman Nb questions (p-val) (p-val) ================================================================================================ WS353 32/353 -0.252(2e-01) -0.158(4e-01) ------------------------------------------------------------------------------------------------ WS353 relatedness 26/252 -0.317(1e-01) -0.0486(8e-01) ------------------------------------------------------------------------------------------------ WS353 similarity 21/203 -0.137(6e-01) -0.254(3e-01) ------------------------------------------------------------------------------------------------ MEN 32/3000 0.262(1e-01) -0.0312(9e-01) ------------------------------------------------------------------------------------------------ M. Turk 15/287 -0.0791(8e-01) 0.25(4e-01) ------------------------------------------------------------------------------------------------ Rareword 24/2034 0.452(3e-02) 0.407(5e-02) ------------------------------------------------------------------------------------------------ RG65 0/65 nan(nan) nan(nan) ------------------------------------------------------------------------------------------------ See more code examples in the static word embeddings use cases. Resources ========= You can download some `resources `_ created with Mangoes Documentation ================= .. toctree:: :maxdepth: 1 use_cases_contextual/index enhanced_models/index use_cases_static/index parameters api Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search`