mangoes.corpus module¶
Class and functions to manage the documents used as corpus.
-
class
mangoes.corpus.
Token
(form, lemma, POS)¶ Bases:
tuple
Methods
count
(value, /)Return number of occurrences of value.
index
(value[, start, stop])Return first index of value.
-
property
POS
¶ Alias for field number 2
-
property
form
¶ Alias for field number 0
-
property
lemma
¶ Alias for field number 1
-
property
-
class
mangoes.corpus.
Corpus
(content, name=None, language=None, reader=<class 'mangoes.utils.reader.TextSentenceGenerator'>, lower=False, digit=False, ignore_punctuation=False, nb_sentences=None, lazy=False)¶ Bases:
object
Class to access to the source used as a Corpus
The Corpus class creates a sentence generator from documents.
- Parameters
- contenta string or an iterable
An iterable of sentences or a path to a file or a repository
- name: str
A name for your corpus. If no name is given and content is a path, the name will be this path.
- readerclass
A class deriving from
mangoes.utils.reader.SentenceGenerator
. Some shortcuts are defined in this module : TEXT (default), BROWN, XML and CONLL- lowerboolean, optional
If True, converts sentences to lower case. Default : False
- digitboolean, optional
If True, replace numeric values with the value of DIGIT_TOKEN in sentences. Default : False
- ignore_punctuation: boolean, optional
If True, the punctuation will be ignored when reading the corpus. Default : False
- nb_sentencesint (optional)
Expected number of sentences in Corpus, if known. This number is used to improve the output of the progress bar but the real value will be computed when initialized.
- lazyboolean (optional)
if False (default), count words and sentences when creating the Corpus (that can take a while); if True, only count words and sentences when needed.
- Attributes
words_count
collections.CounterOccurrences of each word in the corpus
nb_sentences
Number of sentences in this corpus or None if unknown.
size
Number of words in the Corpus
- annotatedboolean
Whether or not the corpus is annotated
Methods
create_vocabulary
([attributes, filters])Create a vocabulary from the corpus
describe
()Print properties of this corpus
load_from_metadata
(file_path)Create a Corpus instance from previously saved metadata
peek
([size])Print the first sentences of the corpus
save_metadata
(file_path)Save metadata of this Corpus in a pickle file
-
property
nb_sentences
¶ Number of sentences in this corpus or None if unknown.
- Returns
- int or None
Notes
If Corpus is created with parameter lazy=True, this value is evaluated only when words_count is called. The value may also be set manually (if it is known).
-
property
name
¶ Name of the corpus
-
property
language
¶ Language of the corpus
-
property
lower
¶ If True, converts sentences to lower case.
-
property
digit
¶ If True, replace numeric values with the value of DIGIT_TOKEN in sentences.
-
property
ignore_punctuation
¶ If True, the punctuation will be ignored when reading the corpus.
-
property
params
¶ Parameters of the corpus
-
property
words_count
¶ Occurrences of each word in the corpus
- Returns
- collections.Counter
a Counter with words as keys and number of occurrences as value
-
property
bigrams_count
¶ Occurrences of each bigram in the corpus
- Returns
- collections.Counter
a Counter with bigrams as keys and number of occurrences as value
-
property
size
¶ Number of words in the Corpus
- Returns
- int
-
describe
()¶ Print properties of this corpus
-
peek
(size=5)¶ Print the first sentences of the corpus
- Parameters
- size: int
Number of sentences to display (default: 5)
-
create_vocabulary
(attributes=None, filters=None)¶ Create a vocabulary from the corpus
- Parameters
- attributesstring or tuple of string, optional
If the Corpus is annotated, attribute(s) to get for each token. If None (default), all attributes are kept.
- filterslist of callables, optional
A filter is a parametrized function that filter values from the Corpus’
words_count
. This module provides 6 filters :
- Returns
- mangoes.Vocabulary
The words are sorted by frequency
Notes
You can also write and use your own filters. A filter is a parametrized function that takes a
collections.Counter()
as input and returns acollections.Counter()
. It should be decorated withmangoes.utils.decorators.counter_filter()
-
save_metadata
(file_path)¶ Save metadata of this Corpus in a pickle file
Save path to corpus, words_count, number of sentences, … in a pickle file
- Parameters
- file_pathstring
-
static
load_from_metadata
(file_path)¶ Create a Corpus instance from previously saved metadata
- Parameters
- file_pathstring
-
mangoes.corpus.
truncate
(max_nb, words_count)¶ Filter to apply to a counter to keep at most ‘max_nb’ elements.
Elements with higher counts are preferred over elements with lesser counts. Elements with equal counts are arbitrarily selected during the truncating, if necessary.
- Parameters
- max_nb: positive int
Maximal number of elements to keep
- words_count: collections.Counter
The counter to filter
- Returns
- collections.Counter
-
mangoes.corpus.
remove_least_frequent
(min_frequency, words_count)¶ Filter to apply to a counter to keep the elements with a high enough frequency.
- Parameters
- min_frequency: positive int or float
If >= 1, will be interpreted as a ‘count’ value (a positive integer), else, will be interpreted as a frequency.
- words_count: collections.Counter
The counter to filter
- Returns
- collections.Counter
Examples
>>> vocabulary = mangoes.Vocabulary(corpus, >>> filters=[mangoes.vocabulary.remove_least_frequent(min_frequency)])
-
mangoes.corpus.
remove_most_frequent
(max_frequency, words_count)¶ Filter to apply to a counter to only keep the elements with a low enough frequency.
- Parameters
- max_frequency: positive int or float
If >= 1, will be interpreted as a ‘count’ value (a positive integer), else, will be interpreted as a frequency.
- words_count: collections.Counter
The counter to filter
- Returns
- collections.Counter
Examples
>>> vocabulary = mangoes.Vocabulary(corpus, >>> filters=[mangoes.vocabulary.remove_most_frequent(max_frequency)])
-
mangoes.corpus.
remove_elements
(stopwords, words_count=None, attribute=None)¶ Filter to apply to a counter to remove the elements in ‘stopwords’ set-like object.
- Parameters
- stopwords: list or set or string
collection of words to remove from the words_count (ex: nltk.corpus.stopwords.words(“english”) or string.punctuation)
- attribute: str or tuple, optional
If the keys in words_count are annotated tokens, attribute to consider
- words_count: collections.Counter
The counter to filter
- Returns
- collections.Counter
-
mangoes.corpus.
filter_by_attribute
(attribute, value, words_count=None)¶ Filter to apply to a counter to only keep certain tokens, based on the value of an attribute.
This filter can only be applied to an annotated Corpus
- Parameters
- attribute: str
If the keys in words_count are annotated tokens, attribute to consider
- value: string or set of strings
List of the values to keep for the attribute
- words_count: collections.Counter
The counter to filter
- Returns
- collections.Counter
-
mangoes.corpus.
filter_attributes
(attributes, words_count=None)¶ Filter to apply to a counter to only keep certain attributes of the tokens.
This filter can only be applied to an annotated Corpus
- Parameters
- attributes: str or tuple of str
If the keys in words_count are annotated tokens, attributes to keep
- words_count: collections.Counter
The counter to filter
- Returns
- collections.Counter
Examples
>>> import collections >>> Token = collections.namedtuple('Token', ('form', 'lemma', 'POS')) >>> words_count = {Token('can', 'can', 'NOUN'): 5, Token('cans', 'can', 'NOUN'): 2, Token('can', 'can', 'VBZ'): 3} >>> filter_attributes('lemma', words_count) Counter({'can': 10}) >>> filter_attributes(('lemma', 'POS'), words_count) Counter({Token(lemma='can', POS='NOUN'): 7, Token(lemma='can', POS='VBZ'): 3})