mangoes.counting module¶
Functions to count the words co-occurrence within a corpus.
This module provides the main function count_cooccurrence to construct a CountBasedRepresentation.
-
mangoes.counting.
count_cooccurrence
(corpus, words, context=<mangoes.context.Window object>, subsampling=False, nb_workers=None, batch=1000000)¶ Build a CountBasedRepresentation where rows are the words in words, counting co-occurrences from the corpus.
- Parameters
- corpus: mangoes.Corpus
- words: mangoes.Vocabulary
words represented as vectors (rows of the matrix)
- context: mangoes.context.Context or mangoes.Vocabulary
A Vocabulary or context defining function such as defined in the
mangoes.context
module. Default is a window of size 1-x-1 : count the co-occurrences between the words in words_vocabulary and the words surrounding it. If context is a Vocabulary, only consider the words of this vocabulary in the window.- nb_workers: int
number of subprocess to use;
- subsampling: boolean or dict
to apply subsampling on frequent words. Value can be False (default), True or a frequency threshold. If True, the default value of create_subsampler() function is used
- Returns
- mangoes.CountBasedRepresentation
Examples
>>> import mangoes.counting >>> window_5 = mangoes.context.Window(window_half_size=5) >>> counts_matrix = mangoes.counting.count_cooccurrence(corpus, vocabulary, context=window_5)
-
mangoes.counting.
create_subsampler
(corpus, threshold=1e-05)¶ Compute probabilities of removal of frequent words
For each word appearing with a frequency higher than the threshold in the corpus, a probabilty of removal is computed following the formula :
p = 1 - \sqrt{\frac{t}{f}}
where t is the threshold and f the frequency of the word in the corpus.
- Parameters
- corpus: mangoes.Corpus
Frequencies come from corpus.words_count
- threshold: float, optional
Words appearing more than this threshold appear in the subsampler (default : 10^{-5})
- Returns
- dict
a dictionary associating each frequent word with a removal probability
-
mangoes.counting.
merge
(*counts, word_keys=None, context_keys=None, concat=<function <lambda>>)¶ Merge cooccurrence counts into one, providing parameters to handle how words and contexts should be merged
- Parameters
- counts: list of mangoes.CountBasedRepresentation
List of cooccurrence counts to be merged
- word_keys: None (default), or bool or list of str
If None or False, words that are common to several vocabularies are considered the same and their counts are summed. If word_keys is a list of string, of same size as counts, words are prefixed with these keys (prefixing is default but you can change that with format_str parameter). If word_keys is True, the languages of the vocabularies are used as keys
- context_keys: None (default), or bool or list of str
If None or False, context words that are common to several context vocabularies are considered the same and their counts are summed. If context_keys is a list of string, of same size as counts, context words are prefixes with these keys. If context_keys is True, the languages of the context vocabularies are used as keys
- concat: callable, optional
Function that takes a key and a word (or a token) as input and returns a new word (or token) If keys are given, this function is called to create the word of the merged vocabulary from the given keys and the original words Default is ‘{key}_{word}’ that prefixes each word with their key and is only valid from simple string words vocabularies. Bigrams are transformed applying this function to both of their part
- Returns
- mangoes.CountBasedRepresentation
a cooccurrence count with merged vocabulary, context and counts
Examples
First example is a use case where your counts cooccurrences of words and POS tags from different languages >>> import mangoes >>> english_words = mangoes.Vocabulary([‘can’, ‘car’, ‘cap’], language=’en’) >>> french_words = mangoes.Vocabulary([‘car’, ‘cap’], language=’fr’) >>> pos_contexts = mangoes.Vocabulary([‘ADJ’, ‘NOUN’, ‘VERB’]) >>> en_count = mangoes.CountBasedRepresentation(english_words, pos_contexts, np.array(range(9)).reshape((3,3))) >>> print(en_count.to_df()) # doctest: +NORMALIZE_WHITESPACE
ADJ NOUN VERB
can 0 1 2 car 3 4 5 cap 6 7 8
>>> fr_count = mangoes.CountBasedRepresentation(french_words, pos_contexts, np.array(range(6)).reshape((2,3))) >>> print(fr_count.to_df()) ADJ NOUN VERB car 0 1 2 cap 3 4 5 >>> print(mangoes.counting.merge(en_count, fr_count, word_keys=True).to_df()) ADJ NOUN VERB en_can 0 1 2 en_car 3 4 5 en_cap 6 7 8 fr_car 0 1 2 fr_cap 3 4 5
A second example where contexts are keyed: >>> import mangoes >>> words = mangoes.Vocabulary([‘a’, ‘b’, ‘c’]) >>> contexts1 = mangoes.Vocabulary([‘x’, ‘y’, ‘z’]) >>> contexts2 = mangoes.Vocabulary([‘x’, ‘y’]) >>> count1 = mangoes.CountBasedRepresentation(words, contexts1, np.array(range(9)).reshape((3,3))) >>> print(count1.to_df()) # doctest: +NORMALIZE_WHITESPACE
x y z
a 0 1 2 b 3 4 5 c 6 7 8
>>> count2 = mangoes.CountBasedRepresentation(words, contexts2, np.array(range(6)).reshape((3, 2))) >>> print(count2.to_df()) x y a 0 1 b 2 3 c 4 5 >>> print(mangoes.counting.merge(count1, count2, context_keys=['c1', 'c2']).to_df()) c1_x c1_y c1_z c2_x c2_y a 0 1 2 0 1 b 3 4 5 2 3 c 6 7 8 4 5 >>> print(mangoes.counting.merge(count1, count2, context_keys=['c1', 'c2'], concat=lambda k,w : '{}({})'.format(w, k)).to_df()) x(c1) y(c1) z(c1) x(c2) y(c2) a 0 1 2 0 1 b 3 4 5 2 3 c 6 7 8 4 5
And a third example with tokens >>> import collections >>> Token = collections.namedtuple(‘Token’, ‘lemma POS’) >>> english_tokens = mangoes.Vocabulary([Token(‘can’, ‘N’), Token(‘can’, ‘V’), Token(‘car’, ‘N’), Token(‘cap’, ‘N’)], language=’en’) >>> french_tokens = mangoes.Vocabulary([Token(‘car’, ‘C’), Token(‘cap’, ‘N’)], language=’fr’) >>> en_tok_count = mangoes.CountBasedRepresentation(english_tokens, pos_contexts, np.array(range(12)).reshape((4,3))) >>> print(en_tok_count.to_df()) # doctest: +NORMALIZE_WHITESPACE
ADJ NOUN VERB
- can N 0 1 2
V 3 4 5
car N 6 7 8 cap N 9 10 11
>>> fr_tok_count = mangoes.CountBasedRepresentation(french_tokens, pos_contexts, np.array(range(6)).reshape((2,3))) >>> print(fr_tok_count.to_df()) ADJ NOUN VERB car C 0 1 2 cap N 3 4 5 >>> print(mangoes.counting.merge(en_tok_count, fr_tok_count, word_keys=True, concat=lambda k, w: (*w, k)).to_df()) ADJ NOUN VERB can N en 0 1 2 V en 3 4 5 car N en 6 7 8 cap N en 9 10 11 car C fr 0 1 2 cap N fr 3 4 5