mangoes.counting module

Functions to count the words co-occurrence within a corpus.

This module provides the main function count_cooccurrence to construct a CountBasedRepresentation.

mangoes.counting.count_cooccurrence(corpus, words, context=<mangoes.context.Window object>, subsampling=False, nb_workers=None, batch=1000000)

Build a CountBasedRepresentation where rows are the words in words, counting co-occurrences from the corpus.

Parameters
corpus: mangoes.Corpus
words: mangoes.Vocabulary

words represented as vectors (rows of the matrix)

context: mangoes.context.Context or mangoes.Vocabulary

A Vocabulary or context defining function such as defined in the mangoes.context module. Default is a window of size 1-x-1 : count the co-occurrences between the words in words_vocabulary and the words surrounding it. If context is a Vocabulary, only consider the words of this vocabulary in the window.

nb_workers: int

number of subprocess to use;

subsampling: boolean or dict

to apply subsampling on frequent words. Value can be False (default), True or a frequency threshold. If True, the default value of create_subsampler() function is used

Returns
mangoes.CountBasedRepresentation

Examples

>>> import mangoes.counting
>>> window_5 = mangoes.context.Window(window_half_size=5)
>>> counts_matrix = mangoes.counting.count_cooccurrence(corpus, vocabulary, context=window_5)
mangoes.counting.create_subsampler(corpus, threshold=1e-05)

Compute probabilities of removal of frequent words

For each word appearing with a frequency higher than the threshold in the corpus, a probabilty of removal is computed following the formula :

p = 1 - \sqrt{\frac{t}{f}}

where t is the threshold and f the frequency of the word in the corpus.

Parameters
corpus: mangoes.Corpus

Frequencies come from corpus.words_count

threshold: float, optional

Words appearing more than this threshold appear in the subsampler (default : 10^{-5})

Returns
dict

a dictionary associating each frequent word with a removal probability

mangoes.counting.merge(*counts, word_keys=None, context_keys=None, concat=<function <lambda>>)

Merge cooccurrence counts into one, providing parameters to handle how words and contexts should be merged

Parameters
counts: list of mangoes.CountBasedRepresentation

List of cooccurrence counts to be merged

word_keys: None (default), or bool or list of str

If None or False, words that are common to several vocabularies are considered the same and their counts are summed. If word_keys is a list of string, of same size as counts, words are prefixed with these keys (prefixing is default but you can change that with format_str parameter). If word_keys is True, the languages of the vocabularies are used as keys

context_keys: None (default), or bool or list of str

If None or False, context words that are common to several context vocabularies are considered the same and their counts are summed. If context_keys is a list of string, of same size as counts, context words are prefixes with these keys. If context_keys is True, the languages of the context vocabularies are used as keys

concat: callable, optional

Function that takes a key and a word (or a token) as input and returns a new word (or token) If keys are given, this function is called to create the word of the merged vocabulary from the given keys and the original words Default is ‘{key}_{word}’ that prefixes each word with their key and is only valid from simple string words vocabularies. Bigrams are transformed applying this function to both of their part

Returns
mangoes.CountBasedRepresentation

a cooccurrence count with merged vocabulary, context and counts

Examples

First example is a use case where your counts cooccurrences of words and POS tags from different languages >>> import mangoes >>> english_words = mangoes.Vocabulary([‘can’, ‘car’, ‘cap’], language=’en’) >>> french_words = mangoes.Vocabulary([‘car’, ‘cap’], language=’fr’) >>> pos_contexts = mangoes.Vocabulary([‘ADJ’, ‘NOUN’, ‘VERB’]) >>> en_count = mangoes.CountBasedRepresentation(english_words, pos_contexts, np.array(range(9)).reshape((3,3))) >>> print(en_count.to_df()) # doctest: +NORMALIZE_WHITESPACE

ADJ NOUN VERB

can 0 1 2 car 3 4 5 cap 6 7 8

>>> fr_count = mangoes.CountBasedRepresentation(french_words, pos_contexts, np.array(range(6)).reshape((2,3)))
>>> print(fr_count.to_df()) 
         ADJ  NOUN  VERB
    car    0     1     2
    cap    3     4     5
>>> print(mangoes.counting.merge(en_count, fr_count, word_keys=True).to_df()) 
           ADJ  NOUN  VERB
    en_can   0     1     2
    en_car   3     4     5
    en_cap   6     7     8
    fr_car   0     1     2
    fr_cap   3     4     5

A second example where contexts are keyed: >>> import mangoes >>> words = mangoes.Vocabulary([‘a’, ‘b’, ‘c’]) >>> contexts1 = mangoes.Vocabulary([‘x’, ‘y’, ‘z’]) >>> contexts2 = mangoes.Vocabulary([‘x’, ‘y’]) >>> count1 = mangoes.CountBasedRepresentation(words, contexts1, np.array(range(9)).reshape((3,3))) >>> print(count1.to_df()) # doctest: +NORMALIZE_WHITESPACE

x y z

a 0 1 2 b 3 4 5 c 6 7 8

>>> count2 = mangoes.CountBasedRepresentation(words, contexts2, np.array(range(6)).reshape((3, 2)))
>>> print(count2.to_df()) 
         x     y
    a    0     1
    b    2     3
    c    4     5
>>> print(mangoes.counting.merge(count1, count2, context_keys=['c1', 'c2']).to_df()) 
        c1_x    c1_y    c1_z    c2_x    c2_y
    a      0       1       2       0       1
    b      3       4       5       2       3
    c      6       7       8       4       5
>>> print(mangoes.counting.merge(count1, count2, context_keys=['c1', 'c2'],                                      concat=lambda k,w : '{}({})'.format(w, k)).to_df()) 
         x(c1)   y(c1)   z(c1)   x(c2)   y(c2)
    a      0       1       2       0       1
    b      3       4       5       2       3
    c      6       7       8       4       5

And a third example with tokens >>> import collections >>> Token = collections.namedtuple(‘Token’, ‘lemma POS’) >>> english_tokens = mangoes.Vocabulary([Token(‘can’, ‘N’), Token(‘can’, ‘V’), Token(‘car’, ‘N’), Token(‘cap’, ‘N’)], language=’en’) >>> french_tokens = mangoes.Vocabulary([Token(‘car’, ‘C’), Token(‘cap’, ‘N’)], language=’fr’) >>> en_tok_count = mangoes.CountBasedRepresentation(english_tokens, pos_contexts, np.array(range(12)).reshape((4,3))) >>> print(en_tok_count.to_df()) # doctest: +NORMALIZE_WHITESPACE

ADJ NOUN VERB

can N 0 1 2

V 3 4 5

car N 6 7 8 cap N 9 10 11

>>> fr_tok_count = mangoes.CountBasedRepresentation(french_tokens, pos_contexts, np.array(range(6)).reshape((2,3)))
>>> print(fr_tok_count.to_df()) 
                  ADJ  NOUN  VERB
    car C    0     1     2
    cap N    3     4     5
>>> print(mangoes.counting.merge(en_tok_count, fr_tok_count, word_keys=True,                                      concat=lambda k, w: (*w, k)).to_df()) 
               ADJ  NOUN  VERB
    can N en    0     1     2
        V en    3     4     5
    car N en    6     7     8
    cap N en    9    10    11
    car C fr    0     1     2
    cap N fr    3     4     5