mangoes.corpus module¶

Class and functions to manage the documents used as corpus.

class mangoes.corpus.Token(form, lemma, POS)¶

Bases: tuple

Attributes

POS: Alias for field number 2
form: Alias for field number 0
lemma: Alias for field number 1

Methods

`count`(value, /)	Return number of occurrences of value.
`index`(value[, start, stop])	Return first index of value.

property POS¶: Alias for field number 2

property form¶: Alias for field number 0

property lemma¶: Alias for field number 1

class mangoes.corpus.Corpus(content, name=None, language=None, reader=<class 'mangoes.utils.reader.TextSentenceGenerator'>, lower=False, digit=False, ignore_punctuation=False, nb_sentences=None, lazy=False)¶

Bases: object

Class to access to the source used as a Corpus

The Corpus class creates a sentence generator from documents.

Parameters

contenta string or an iterable: An iterable of sentences or a path to a file or a repository
name: str: A name for your corpus. If no name is given and content is a path, the name will be this path.
readerclass: A class deriving from mangoes.utils.reader.SentenceGenerator. Some shortcuts are defined in this module : TEXT (default), BROWN, XML and CONLL
lowerboolean, optional: If True, converts sentences to lower case. Default : False
digitboolean, optional: If True, replace numeric values with the value of DIGIT_TOKEN in sentences. Default : False
ignore_punctuation: boolean, optional: If True, the punctuation will be ignored when reading the corpus. Default : False
nb_sentencesint (optional): Expected number of sentences in Corpus, if known. This number is used to improve the output of the progress bar but the real value will be computed when initialized.
lazyboolean (optional): if False (default), count words and sentences when creating the Corpus (that can take a while); if True, only count words and sentences when needed.

Attributes

words_countcollections.Counter: Occurrences of each word in the corpus
nb_sentences: Number of sentences in this corpus or None if unknown.
size: Number of words in the Corpus
annotatedboolean: Whether or not the corpus is annotated

Methods

`create_vocabulary`([attributes, filters])	Create a vocabulary from the corpus
`describe`()	Print properties of this corpus
`load_from_metadata`(file_path)	Create a Corpus instance from previously saved metadata
`peek`([size])	Print the first sentences of the corpus
`save_metadata`(file_path)	Save metadata of this Corpus in a pickle file

property nb_sentences¶

Number of sentences in this corpus or None if unknown.

Returns

int or None

Notes

If Corpus is created with parameter lazy=True, this value is evaluated only when words_count is called. The value may also be set manually (if it is known).

property name¶: Name of the corpus

property language¶: Language of the corpus

property lower¶: If True, converts sentences to lower case.

property digit¶: If True, replace numeric values with the value of DIGIT_TOKEN in sentences.

property ignore_punctuation¶: If True, the punctuation will be ignored when reading the corpus.

property params¶: Parameters of the corpus

property words_count¶

Occurrences of each word in the corpus

Returns

collections.Counter: a Counter with words as keys and number of occurrences as value

property bigrams_count¶

Occurrences of each bigram in the corpus

Returns

collections.Counter: a Counter with bigrams as keys and number of occurrences as value

property size¶

Number of words in the Corpus

Returns

int

describe()¶: Print properties of this corpus

peek(size=5)¶

Print the first sentences of the corpus

Parameters

size: int: Number of sentences to display (default: 5)

create_vocabulary(attributes=None, filters=None)¶

Create a vocabulary from the corpus

Parameters

attributesstring or tuple of string, optional

If the Corpus is annotated, attribute(s) to get for each token. If None (default), all attributes are kept.

filterslist of callables, optional

A filter is a parametrized function that filter values from the Corpus’ words_count. This module provides 6 filters :

truncate()

remove_least_frequent()

remove_most_frequent()

remove_elements()

filter_by_attribute()

filter_attributes()

Returns

mangoes.Vocabulary: The words are sorted by frequency

Notes

You can also write and use your own filters. A filter is a parametrized function that takes a collections.Counter() as input and returns a collections.Counter(). It should be decorated with mangoes.utils.decorators.counter_filter()

save_metadata(file_path)¶

Save metadata of this Corpus in a pickle file

Save path to corpus, words_count, number of sentences, … in a pickle file

Parameters

file_pathstring

static load_from_metadata(file_path)¶

Create a Corpus instance from previously saved metadata

Parameters

file_pathstring

mangoes.corpus.truncate(max_nb, words_count)¶

Filter to apply to a counter to keep at most ‘max_nb’ elements.

Elements with higher counts are preferred over elements with lesser counts. Elements with equal counts are arbitrarily selected during the truncating, if necessary.

Parameters

max_nb: positive int: Maximal number of elements to keep
words_count: collections.Counter: The counter to filter

Returns

collections.Counter

See also

mangoes.vocabulary.Vocabulary
mangoes.utils.decorators.counter_filter()

mangoes.corpus.remove_least_frequent(min_frequency, words_count)¶

Filter to apply to a counter to keep the elements with a high enough frequency.

Parameters

min_frequency: positive int or float: If >= 1, will be interpreted as a ‘count’ value (a positive integer), else, will be interpreted as a frequency.
words_count: collections.Counter: The counter to filter

Returns

collections.Counter

See also

mangoes.vocabulary.Vocabulary
mangoes.utils.decorators.counter_filter()

Examples

>>> vocabulary = mangoes.Vocabulary(corpus,
>>>                                filters=[mangoes.vocabulary.remove_least_frequent(min_frequency)])

mangoes.corpus.remove_most_frequent(max_frequency, words_count)¶

Filter to apply to a counter to only keep the elements with a low enough frequency.

Parameters

max_frequency: positive int or float: If >= 1, will be interpreted as a ‘count’ value (a positive integer), else, will be interpreted as a frequency.
words_count: collections.Counter: The counter to filter

Returns

collections.Counter

See also

mangoes.vocabulary.Vocabulary
mangoes.utils.decorators.counter_filter()

Examples

>>> vocabulary = mangoes.Vocabulary(corpus,
>>>                                filters=[mangoes.vocabulary.remove_most_frequent(max_frequency)])

mangoes.corpus.remove_elements(stopwords, words_count=None, attribute=None)¶

Filter to apply to a counter to remove the elements in ‘stopwords’ set-like object.

Parameters

stopwords: list or set or string: collection of words to remove from the words_count (ex: nltk.corpus.stopwords.words(“english”) or string.punctuation)
attribute: str or tuple, optional: If the keys in words_count are annotated tokens, attribute to consider
words_count: collections.Counter: The counter to filter

Returns

collections.Counter

See also

mangoes.vocabulary.Vocabulary
mangoes.utils.decorators.counter_filter()

mangoes.corpus.filter_by_attribute(attribute, value, words_count=None)¶

Filter to apply to a counter to only keep certain tokens, based on the value of an attribute.

This filter can only be applied to an annotated Corpus

Parameters

attribute: str: If the keys in words_count are annotated tokens, attribute to consider
value: string or set of strings: List of the values to keep for the attribute
words_count: collections.Counter: The counter to filter

Returns

collections.Counter

See also

mangoes.vocabulary.Vocabulary
mangoes.utils.decorators.counter_filter()

mangoes.corpus.filter_attributes(attributes, words_count=None)¶

Filter to apply to a counter to only keep certain attributes of the tokens.

This filter can only be applied to an annotated Corpus

Parameters

attributes: str or tuple of str: If the keys in words_count are annotated tokens, attributes to keep
words_count: collections.Counter: The counter to filter

Returns

collections.Counter

See also

mangoes.vocabulary.Vocabulary
mangoes.utils.decorators.counter_filter()

Examples

>>> import collections
>>> Token = collections.namedtuple('Token', ('form', 'lemma', 'POS'))
>>> words_count = {Token('can', 'can', 'NOUN'): 5, Token('cans', 'can', 'NOUN'): 2, Token('can', 'can', 'VBZ'): 3}
>>> filter_attributes('lemma', words_count)
Counter({'can': 10})
>>> filter_attributes(('lemma', 'POS'), words_count)
Counter({Token(lemma='can', POS='NOUN'): 7, Token(lemma='can', POS='VBZ'): 3})