mangoes.corpus module

Class and functions to manage the documents used as corpus.

class mangoes.corpus.Token(form, lemma, POS)

Bases: tuple

Attributes
POS

Alias for field number 2

form

Alias for field number 0

lemma

Alias for field number 1

Methods

count(value, /)

Return number of occurrences of value.

index(value[, start, stop])

Return first index of value.

property POS

Alias for field number 2

property form

Alias for field number 0

property lemma

Alias for field number 1

class mangoes.corpus.Corpus(content, name=None, language=None, reader=<class 'mangoes.utils.reader.TextSentenceGenerator'>, lower=False, digit=False, ignore_punctuation=False, nb_sentences=None, lazy=False)

Bases: object

Class to access to the source used as a Corpus

The Corpus class creates a sentence generator from documents.

Parameters
contenta string or an iterable

An iterable of sentences or a path to a file or a repository

name: str

A name for your corpus. If no name is given and content is a path, the name will be this path.

readerclass

A class deriving from mangoes.utils.reader.SentenceGenerator. Some shortcuts are defined in this module : TEXT (default), BROWN, XML and CONLL

lowerboolean, optional

If True, converts sentences to lower case. Default : False

digitboolean, optional

If True, replace numeric values with the value of DIGIT_TOKEN in sentences. Default : False

ignore_punctuation: boolean, optional

If True, the punctuation will be ignored when reading the corpus. Default : False

nb_sentencesint (optional)

Expected number of sentences in Corpus, if known. This number is used to improve the output of the progress bar but the real value will be computed when initialized.

lazyboolean (optional)

if False (default), count words and sentences when creating the Corpus (that can take a while); if True, only count words and sentences when needed.

Attributes
words_countcollections.Counter

Occurrences of each word in the corpus

nb_sentences

Number of sentences in this corpus or None if unknown.

size

Number of words in the Corpus

annotatedboolean

Whether or not the corpus is annotated

Methods

create_vocabulary([attributes, filters])

Create a vocabulary from the corpus

describe()

Print properties of this corpus

load_from_metadata(file_path)

Create a Corpus instance from previously saved metadata

peek([size])

Print the first sentences of the corpus

save_metadata(file_path)

Save metadata of this Corpus in a pickle file

property nb_sentences

Number of sentences in this corpus or None if unknown.

Returns
int or None

Notes

If Corpus is created with parameter lazy=True, this value is evaluated only when words_count is called. The value may also be set manually (if it is known).

property name

Name of the corpus

property language

Language of the corpus

property lower

If True, converts sentences to lower case.

property digit

If True, replace numeric values with the value of DIGIT_TOKEN in sentences.

property ignore_punctuation

If True, the punctuation will be ignored when reading the corpus.

property params

Parameters of the corpus

property words_count

Occurrences of each word in the corpus

Returns
collections.Counter

a Counter with words as keys and number of occurrences as value

property bigrams_count

Occurrences of each bigram in the corpus

Returns
collections.Counter

a Counter with bigrams as keys and number of occurrences as value

property size

Number of words in the Corpus

Returns
int
describe()

Print properties of this corpus

peek(size=5)

Print the first sentences of the corpus

Parameters
size: int

Number of sentences to display (default: 5)

create_vocabulary(attributes=None, filters=None)

Create a vocabulary from the corpus

Parameters
attributesstring or tuple of string, optional

If the Corpus is annotated, attribute(s) to get for each token. If None (default), all attributes are kept.

filterslist of callables, optional

A filter is a parametrized function that filter values from the Corpus’ words_count. This module provides 6 filters :

Returns
mangoes.Vocabulary

The words are sorted by frequency

Notes

You can also write and use your own filters. A filter is a parametrized function that takes a collections.Counter() as input and returns a collections.Counter(). It should be decorated with mangoes.utils.decorators.counter_filter()

save_metadata(file_path)

Save metadata of this Corpus in a pickle file

Save path to corpus, words_count, number of sentences, … in a pickle file

Parameters
file_pathstring
static load_from_metadata(file_path)

Create a Corpus instance from previously saved metadata

Parameters
file_pathstring
mangoes.corpus.truncate(max_nb, words_count)

Filter to apply to a counter to keep at most ‘max_nb’ elements.

Elements with higher counts are preferred over elements with lesser counts. Elements with equal counts are arbitrarily selected during the truncating, if necessary.

Parameters
max_nb: positive int

Maximal number of elements to keep

words_count: collections.Counter

The counter to filter

Returns
collections.Counter
mangoes.corpus.remove_least_frequent(min_frequency, words_count)

Filter to apply to a counter to keep the elements with a high enough frequency.

Parameters
min_frequency: positive int or float

If >= 1, will be interpreted as a ‘count’ value (a positive integer), else, will be interpreted as a frequency.

words_count: collections.Counter

The counter to filter

Returns
collections.Counter

Examples

>>> vocabulary = mangoes.Vocabulary(corpus,
>>>                                filters=[mangoes.vocabulary.remove_least_frequent(min_frequency)])
mangoes.corpus.remove_most_frequent(max_frequency, words_count)

Filter to apply to a counter to only keep the elements with a low enough frequency.

Parameters
max_frequency: positive int or float

If >= 1, will be interpreted as a ‘count’ value (a positive integer), else, will be interpreted as a frequency.

words_count: collections.Counter

The counter to filter

Returns
collections.Counter

Examples

>>> vocabulary = mangoes.Vocabulary(corpus,
>>>                                filters=[mangoes.vocabulary.remove_most_frequent(max_frequency)])
mangoes.corpus.remove_elements(stopwords, words_count=None, attribute=None)

Filter to apply to a counter to remove the elements in ‘stopwords’ set-like object.

Parameters
stopwords: list or set or string

collection of words to remove from the words_count (ex: nltk.corpus.stopwords.words(“english”) or string.punctuation)

attribute: str or tuple, optional

If the keys in words_count are annotated tokens, attribute to consider

words_count: collections.Counter

The counter to filter

Returns
collections.Counter
mangoes.corpus.filter_by_attribute(attribute, value, words_count=None)

Filter to apply to a counter to only keep certain tokens, based on the value of an attribute.

This filter can only be applied to an annotated Corpus

Parameters
attribute: str

If the keys in words_count are annotated tokens, attribute to consider

value: string or set of strings

List of the values to keep for the attribute

words_count: collections.Counter

The counter to filter

Returns
collections.Counter
mangoes.corpus.filter_attributes(attributes, words_count=None)

Filter to apply to a counter to only keep certain attributes of the tokens.

This filter can only be applied to an annotated Corpus

Parameters
attributes: str or tuple of str

If the keys in words_count are annotated tokens, attributes to keep

words_count: collections.Counter

The counter to filter

Returns
collections.Counter

Examples

>>> import collections
>>> Token = collections.namedtuple('Token', ('form', 'lemma', 'POS'))
>>> words_count = {Token('can', 'can', 'NOUN'): 5, Token('cans', 'can', 'NOUN'): 2, Token('can', 'can', 'VBZ'): 3}
>>> filter_attributes('lemma', words_count)
Counter({'can': 10})
>>> filter_attributes(('lemma', 'POS'), words_count)
Counter({Token(lemma='can', POS='NOUN'): 7, Token(lemma='can', POS='VBZ'): 3})