mangoes.counting module¶

Functions to count the words co-occurrence within a corpus.

This module provides the main function count_cooccurrence to construct a CountBasedRepresentation.

mangoes.counting.count_cooccurrence(corpus, words, context=<mangoes.context.Window object>, subsampling=False, nb_workers=None, batch=1000000)¶

Build a CountBasedRepresentation where rows are the words in words, counting co-occurrences from the corpus.

Parameters

corpus: mangoes.Corpus
words: mangoes.Vocabulary: words represented as vectors (rows of the matrix)
context: mangoes.context.Context or mangoes.Vocabulary: A Vocabulary or context defining function such as defined in the mangoes.context module. Default is a window of size 1-x-1 : count the co-occurrences between the words in words_vocabulary and the words surrounding it. If context is a Vocabulary, only consider the words of this vocabulary in the window.
nb_workers: int: number of subprocess to use;
subsampling: boolean or dict: to apply subsampling on frequent words. Value can be False (default), True or a frequency threshold. If True, the default value of create_subsampler() function is used

Returns

mangoes.CountBasedRepresentation

Examples

>>> import mangoes.counting
>>> window_5 = mangoes.context.Window(window_half_size=5)
>>> counts_matrix = mangoes.counting.count_cooccurrence(corpus, vocabulary, context=window_5)

mangoes.counting.create_subsampler(corpus, threshold=1e-05)¶

Compute probabilities of removal of frequent words

For each word appearing with a frequency higher than the threshold in the corpus, a probabilty of removal is computed following the formula :

p = 1 - \sqrt{\frac{t}{f}}

where t is the threshold and f the frequency of the word in the corpus.

Parameters

corpus: mangoes.Corpus: Frequencies come from corpus.words_count
threshold: float, optional: Words appearing more than this threshold appear in the subsampler (default : 10^{-5})

Returns

dict: a dictionary associating each frequent word with a removal probability

mangoes.counting.merge(*counts, word_keys=None, context_keys=None, concat=<function <lambda>>)¶

Merge cooccurrence counts into one, providing parameters to handle how words and contexts should be merged

Parameters

counts: list of mangoes.CountBasedRepresentation: List of cooccurrence counts to be merged
word_keys: None (default), or bool or list of str: If None or False, words that are common to several vocabularies are considered the same and their counts are summed. If word_keys is a list of string, of same size as counts, words are prefixed with these keys (prefixing is default but you can change that with format_str parameter). If word_keys is True, the languages of the vocabularies are used as keys
context_keys: None (default), or bool or list of str: If None or False, context words that are common to several context vocabularies are considered the same and their counts are summed. If context_keys is a list of string, of same size as counts, context words are prefixes with these keys. If context_keys is True, the languages of the context vocabularies are used as keys
concat: callable, optional: Function that takes a key and a word (or a token) as input and returns a new word (or token) If keys are given, this function is called to create the word of the merged vocabulary from the given keys and the original words Default is ‘{key}_{word}’ that prefixes each word with their key and is only valid from simple string words vocabularies. Bigrams are transformed applying this function to both of their part

Returns

mangoes.CountBasedRepresentation: a cooccurrence count with merged vocabulary, context and counts

Examples

First example is a use case where your counts cooccurrences of words and POS tags from different languages >>> import mangoes >>> english_words = mangoes.Vocabulary([‘can’, ‘car’, ‘cap’], language=’en’) >>> french_words = mangoes.Vocabulary([‘car’, ‘cap’], language=’fr’) >>> pos_contexts = mangoes.Vocabulary([‘ADJ’, ‘NOUN’, ‘VERB’]) >>> en_count = mangoes.CountBasedRepresentation(english_words, pos_contexts, np.array(range(9)).reshape((3,3))) >>> print(en_count.to_df()) # doctest: +NORMALIZE_WHITESPACE

ADJ NOUN VERB

can 0 1 2 car 3 4 5 cap 6 7 8

>>> fr_count = mangoes.CountBasedRepresentation(french_words, pos_contexts, np.array(range(6)).reshape((2,3)))
>>> print(fr_count.to_df()) 
         ADJ  NOUN  VERB
    car    0     1     2
    cap    3     4     5
>>> print(mangoes.counting.merge(en_count, fr_count, word_keys=True).to_df()) 
           ADJ  NOUN  VERB
    en_can   0     1     2
    en_car   3     4     5
    en_cap   6     7     8
    fr_car   0     1     2
    fr_cap   3     4     5

A second example where contexts are keyed: >>> import mangoes >>> words = mangoes.Vocabulary([‘a’, ‘b’, ‘c’]) >>> contexts1 = mangoes.Vocabulary([‘x’, ‘y’, ‘z’]) >>> contexts2 = mangoes.Vocabulary([‘x’, ‘y’]) >>> count1 = mangoes.CountBasedRepresentation(words, contexts1, np.array(range(9)).reshape((3,3))) >>> print(count1.to_df()) # doctest: +NORMALIZE_WHITESPACE

x y z

a 0 1 2 b 3 4 5 c 6 7 8

>>> count2 = mangoes.CountBasedRepresentation(words, contexts2, np.array(range(6)).reshape((3, 2)))
>>> print(count2.to_df()) 
         x     y
    a    0     1
    b    2     3
    c    4     5
>>> print(mangoes.counting.merge(count1, count2, context_keys=['c1', 'c2']).to_df()) 
        c1_x    c1_y    c1_z    c2_x    c2_y
    a      0       1       2       0       1
    b      3       4       5       2       3
    c      6       7       8       4       5
>>> print(mangoes.counting.merge(count1, count2, context_keys=['c1', 'c2'],                                      concat=lambda k,w : '{}({})'.format(w, k)).to_df()) 
         x(c1)   y(c1)   z(c1)   x(c2)   y(c2)
    a      0       1       2       0       1
    b      3       4       5       2       3
    c      6       7       8       4       5

And a third example with tokens >>> import collections >>> Token = collections.namedtuple(‘Token’, ‘lemma POS’) >>> english_tokens = mangoes.Vocabulary([Token(‘can’, ‘N’), Token(‘can’, ‘V’), Token(‘car’, ‘N’), Token(‘cap’, ‘N’)], language=’en’) >>> french_tokens = mangoes.Vocabulary([Token(‘car’, ‘C’), Token(‘cap’, ‘N’)], language=’fr’) >>> en_tok_count = mangoes.CountBasedRepresentation(english_tokens, pos_contexts, np.array(range(12)).reshape((4,3))) >>> print(en_tok_count.to_df()) # doctest: +NORMALIZE_WHITESPACE

ADJ NOUN VERB

can N 0 1 2
V 3 4 5

car N 6 7 8 cap N 9 10 11

>>> fr_tok_count = mangoes.CountBasedRepresentation(french_tokens, pos_contexts, np.array(range(6)).reshape((2,3)))
>>> print(fr_tok_count.to_df()) 
                  ADJ  NOUN  VERB
    car C    0     1     2
    cap N    3     4     5
>>> print(mangoes.counting.merge(en_tok_count, fr_tok_count, word_keys=True,                                      concat=lambda k, w: (*w, k)).to_df()) 
               ADJ  NOUN  VERB
    can N en    0     1     2
        V en    3     4     5
    car N en    6     7     8
    cap N en    9    10    11
    car C fr    0     1     2
    cap N fr    3     4     5