mangoes.utils.reader module

class mangoes.utils.reader.SentenceGenerator(source, lower=False, digit=False, ignore_punctuation=False)

Bases: object

Base class for sentences generators

A sentence generator yields sentence from a source, that can be an iterable or a set of files.

Parameters
sourcea string or an iterable

An iterable of sentences or a path to a file or a repository

lowerboolean, optional

If True (default), converts sentences to lower case

digitboolean, optional

If True (default), replace numeric values with DIGIT_TOKEN in sentences

ignore_punctuation: boolean, optional

If True, the punctuation will be ignored when reading the corpus. Default : False

Warning

This class should not be used directly. Use derived classes instead.

See also

TextGenerator
BrownGenerator
XmlGenerator
ConllGenerator

Methods

sentences()

Yields sentences from the source

abstract sentences()

Yields sentences from the source

Yields
list of str
class mangoes.utils.reader.TextSentenceGenerator(source, lower=False, digit=False, ignore_punctuation=False)

Bases: mangoes.utils.reader.SentenceGenerator

Sentence generator for simple text source

Methods

sentences()

Yields sentences from the source

sentences()

Yields sentences from the source

Yields
list of str
class mangoes.utils.reader.AnnotatedSentenceGenerator(source, lower=False, digit=False, ignore_punctuation=True)

Bases: mangoes.utils.reader.SentenceGenerator

Base class for sentences generators from annotated source

A sentence generator yields sentence from a source, that can be an iterable or a set of files.

Warning

This class should not be used directly. Use derived classes instead.

Methods

Token(form, lemma, POS)

Attributes

sentences()

Yields sentences from the source

FIELDS = ('form', 'lemma', 'POS')
NUM_TAG = 'NUM'
PUNCTUATION_TAG = 'PUNCT'
class Token(form, lemma, POS)

Bases: mangoes.utils.reader.Token

Attributes
POS

Alias for field number 2

form

Alias for field number 0

lemma

Alias for field number 1

Methods

count(value, /)

Return number of occurrences of value.

index(value[, start, stop])

Return first index of value.

lower

replace

lower()
replace(value)
abstract sentences()

Yields sentences from the source

Yields
list of str
class mangoes.utils.reader.BrownSentenceGenerator(source, lower=False, digit=False, ignore_punctuation=True)

Bases: mangoes.utils.reader.AnnotatedSentenceGenerator

Sentence generator for text source annotated in Brown format

Methods

Token(form, lemma, POS)

Attributes

sentences()

Yields sentences from the source

sentences()

Yields sentences from the source

Yields
list of str
class mangoes.utils.reader.XmlSentenceGenerator(source, lower=False, digit=False, ignore_punctuation=False)

Bases: mangoes.utils.reader.AnnotatedSentenceGenerator

Sentence generator for an XML source

Methods

Token(id, form, lemma, POS, features, head, …)

Attributes

sentences()

Yields sentences from the source

FIELDS = ('id', 'form', 'lemma', 'POS', 'features', 'head', 'dependency_relation')
class Token(id, form, lemma, POS, features, head, dependency_relation)

Bases: mangoes.utils.reader.Token

Attributes
POS

Alias for field number 3

dependency_relation

Alias for field number 6

features

Alias for field number 4

form

Alias for field number 1

head

Alias for field number 5

id

Alias for field number 0

lemma

Alias for field number 2

Methods

count(value, /)

Return number of occurrences of value.

index(value[, start, stop])

Return first index of value.

lower

replace

lower()
replace(value)
class mangoes.utils.reader.ConllSentenceGenerator(source, lower=False, digit=False, ignore_punctuation=True)

Bases: mangoes.utils.reader.AnnotatedSentenceGenerator

Sentence generator for source annotated in Conll format

Methods

Token(id, form, lemma, POS, NER, head, …)

Attributes

sentences()

Yields sentences from the source

FIELDS = ('id', 'form', 'lemma', 'POS', 'NER', 'head', 'dependency_relation')
class Token(id, form, lemma, POS, NER, head, dependency_relation)

Bases: mangoes.utils.reader.Token

Attributes
NER

Alias for field number 4

POS

Alias for field number 3

dependency_relation

Alias for field number 6

form

Alias for field number 1

head

Alias for field number 5

id

Alias for field number 0

lemma

Alias for field number 2

Methods

count(value, /)

Return number of occurrences of value.

index(value[, start, stop])

Return first index of value.

lower

replace

lower()
replace(value)
sentences()

Yields sentences from the source

Yields
list of str
class mangoes.utils.reader.ConllUSentenceGenerator(source, lower=False, digit=False, ignore_punctuation=True)

Bases: mangoes.utils.reader.ConllSentenceGenerator

Methods

Token(id, form, lemma, POS, xpostag, feats, …)

Attributes

sentences()

Yields sentences from the source

FIELDS = ('id', 'form', 'lemma', 'POS', 'xpostag', 'feats', 'head', 'dependency_relation', 'deps', 'misc')
class Token(id, form, lemma, POS, xpostag, feats, head, dependency_relation, deps, misc)

Bases: mangoes.utils.reader.Token

Attributes
POS

Alias for field number 3

dependency_relation

Alias for field number 7

deps

Alias for field number 8

feats

Alias for field number 5

form

Alias for field number 1

head

Alias for field number 6

id

Alias for field number 0

lemma

Alias for field number 2

misc

Alias for field number 9

xpostag

Alias for field number 4

Methods

count(value, /)

Return number of occurrences of value.

index(value[, start, stop])

Return first index of value.

lower

replace

lower()
replace(value)