mangoes.modeling.finetuning module

This module provides an interface into the transformers pretrained models for fine tuning. Note that for each fine-tuning class, any eligible model may be fine-tuned. See links in each class documentation for list of eligible models.

class mangoes.modeling.finetuning.TransformerForFeatureExtraction(pretrained_model, pretrained_tokenizer, device=None, **keyword_args)

Bases: mangoes.modeling.transformer_base.PipelineMixin, mangoes.modeling.transformer_base.TransformerModel

Class for using a pretrained transformer model and tokenizer. As it uses AutoModel (and not AutoModelForQuestionAnswering, for example), it can only be used for base pretrained models, and not fine tuning tasks.

Parameters
pretrained_model: str, transformers.PretrainedModel subclass.
Either:
  • A string with the shortcut name of a pretrained model to load from cache or download, e.g., bert-base-uncased.

  • A string with the identifier name of a pretrained model that was user-uploaded to our S3, e.g., dbmdz/bert-base-german-cased.

  • A path to a directory containing model weights saved using save_pretrained(), e.g., ./my_model_directory/.

  • An already instantiated transformer model, either an transformers.AutoModel compatible class or a mangoes enhanced language model.

pretrained_tokenizer: str, transformers.PreTrainedTokenizerBase subclass.
Either:
  • A string with the shortcut name of a pretrained model to load from cache or download, e.g., bert-base-uncased.

  • A string with the identifier name of a pretrained model that was user-uploaded to our S3, e.g., dbmdz/bert-base-german-cased.

  • A path to a directory containing model weights saved using save_pretrained(), e.g., ./my_model_directory/.

  • An already instantiated tokenizer, either an transformers.AutoTokenizer compatible class or a mangoes enhanced language model tokenizer.

device: int, optional, defaults to None

if -1, use cpu, if >= 0, use CUDA device number. If None, will use GPU if available

**keyword_args include arguments passed to transformers.AutoConfig

Methods

generate_outputs(text[, pre_tokenized, …])

Tokenize input text and pass it through the model, optionally outputting hidden states or attention matrices.

predict(inputs, **kwargs)

Run input text through the feature extraction pipeline, extracting the hidden states of each layer.

save(output_directory[, save_tokenizer])

Method to save transformers model and optionally save tokenizer.

train([output_dir, train_dataset, …])

This function does nothing, use a task specific class to pre-train or fine-tune

predict(inputs, **kwargs)

Run input text through the feature extraction pipeline, extracting the hidden states of each layer.

Parameters
inputs: str or list of strs

inputs to extract features

Returns
nested list of float, hidden states.
train(output_dir=None, train_dataset=None, eval_dataset=None, collator=None, trainer=None, **training_args)

This function does nothing, use a task specific class to pre-train or fine-tune

generate_outputs(text, pre_tokenized=False, output_attentions=False, output_hidden_states=False, word_embeddings=False, **tokenizer_inputs)

Tokenize input text and pass it through the model, optionally outputting hidden states or attention matrices.

Parameters
text: str or List[str] if pre_tokenized=False, else List[str] or List[List[str]]

the text to compute features for.

pre_tokenized: Boolean

whether or not the input text is pre-tokenized (ie, split on spaces)

output_attentions: Boolean, optional, defaults to False

Whether or not to return the attentions tensors of all attention layers.

output_hidden_states: Boolean, optional, defaults to False

Whether or not to return the hidden states of all layers.

word_embeddings: Boolean

whether or not to filter special token embeddings and average sub-word embeddings (hidden states) into word embeddings. If pre-tokenized inputs, the sub-word embeddings will be averaged into the tokens pass as inputs. If pre-tokenized=False, the text will be split on whitespace and the sub-word embeddings will be averaged back into these words produced by splitting the text on whitespace. Only used if output_hidden_states = True. If False, number of output embeddings could be greater than (number of words + special tokens). If True, number of output embeddings == number of words, sub-words are averaged together to create word level embeddings and special token embeddings are excluded.

tokenizer_inputs: tokenizer_inputs include arguments passed to tokenizer, such as presaved entity annotations

for enhanced models.

Returns
Dict containing (note that if single text sequence is passed as input, batch size will be 1):
hidden_states: (Tuple (one for each layer) of torch.FloatTensor (batch_size, sequence_length, hidden_size)).

Hidden-states of the model at the output of each layer plus the initial embedding outputs. Only returned if output_hidden_states is True. If word_embeddings, the sequence length will be the number of words in the longest sentence, ie the maximum number of words. Shorter sequences will be padded with zeros.

attentions: Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length,

sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. Only returned if output_attentions is True

offset_mappings: Tensor of shape (batch_size, sequence_length, 2)

Tensor containing (char_start, char_end) for each token, giving index into input strings of start and end character for each token. If input is pre-tokenized, start and end index maps into associated word. Note that special tokens are included with 0 for start and end indices, as these don’t map into input text because they are added inside the function. This output is only available to tokenizers that inherit from transformers.PreTrainedTokenizerFast . This includes the tokenizer and most other common tokenizers, but not all possible tokenizers in the library. If the tokenizer did not inherit from this class, this output value will be None.

if PretrainedTransformerModelForFeatureExtraction:
last_hidden_state: (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size))

Sequence of hidden-states at the output of the last layer of the model.

pooler_output: (torch.FloatTensor of shape (batch_size, hidden_size))

Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function.

if TransformerForSequenceClassification:
logits: (torch.FloatTensor of shape (batch_size, config.num_labels))

classification scores, before softmax

if TransformerForTokenClassification:
logits: (torch.FloatTensor of shape (batch_size, sequence_length, config.num_labels))

classification scores, before softmax

save(output_directory, save_tokenizer=False)

Method to save transformers model and optionally save tokenizer. The tokenizer is already saved (the input to this class includes a pretrained tokenizer), but this method will save the tokenizer as well if needed. Both the tokenizer files and model files will be saved to the output directory. The output directory can be inputted as an argument to the “load()” method of the inheriting classes (for the model and tokenizer arguments)

Parameters
output_directory: str

path to directory to save model

save_tokenizer: Boolean

whether to save tokenizer in directory or not, defaults to False

class mangoes.modeling.finetuning.TransformerForSequenceClassification(pretrained_model, pretrained_tokenizer, labels=None, label2id=None, device=None, **keyword_args)

Bases: mangoes.modeling.transformer_base.PipelineMixin, mangoes.modeling.transformer_base.TransformerModel

Transformers model with Sequence classification head (linear layer on top of pooled output). For a list of eligible model classes, see: https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForSequenceClassification.from_config.config

Variables
Parameters
pretrained_model: str, transformers.PretrainedModel subclass.
Either:
  • A string with the shortcut name of a pretrained model to load from cache or download, e.g., bert-base-uncased.

  • A string with the identifier name of a pretrained model that was user-uploaded to our S3, e.g., dbmdz/bert-base-german-cased.

  • A path to a directory containing model weights saved using save_pretrained(), e.g., ./my_model_directory/.

  • An already instantiated transformer model, either an transformers.AutoModelForSequenceClassification compatible class or a mangoes enhanced language model.

pretrained_tokenizer: str, transformers.PreTrainedTokenizerBase subclass.
Either:
  • A string with the shortcut name of a pretrained model to load from cache or download, e.g., bert-base-uncased.

  • A string with the identifier name of a pretrained model that was user-uploaded to our S3, e.g., dbmdz/bert-base-german-cased.

  • A path to a directory containing model weights saved using save_pretrained(), e.g., ./my_model_directory/.

  • An already instantiated tokenizer, either an transformers.AutoTokenizer compatible class or a mangoes enhanced language model tokenizer.

labels: list of str

List of class names. Will raise error if labels=None and label2id=None

label2id: dict of str -> int

dict mapping class name to index of output layer if None, will create from labels.

device: int, optional, defaults to None

if -1, use cpu, if >= 0, use CUDA device number. If None, will use GPU if available

**keyword_args include arguments passed to transformers.AutoConfig

Methods

generate_outputs(text[, pre_tokenized, …])

Tokenize input text and pass it through the model, optionally outputting hidden states or attention matrices.

predict(inputs[, return_all_scores])

Predicts classes for input texts.

save(output_directory[, save_tokenizer])

Method to save transformers model and optionally save tokenizer.

train([output_dir, train_text, …])

Fine tune a transformer model on a text classification dataset

predict(inputs, return_all_scores=False)

Predicts classes for input texts.

Parameters
inputs: str or list of strs

inputs to classify

return_all_scores: Boolean (default=False)

Whether to return all scores or just the predicted class score

Returns
list of dict, or list of list of dict if return_all_scores=True
If one sequence is passed as input, a list with one element (either a dict or list of dicts if

return_all_scores=True) will be returned.

for each input, a dict containing:

label (str) – The label predicted. score (float) – The corresponding probability.

if return_all_scores, dict will be returned for each class, for each input
train(output_dir=None, train_text=None, train_targets=None, eval_text=None, eval_targets=None, max_len=None, freeze_base=False, task_learn_rate=None, collator=None, train_dataset=None, eval_dataset=None, compute_metrics=None, trainer=None, **training_args)

Fine tune a transformer model on a text classification dataset

Parameters
output_dir: str

Path to the output directory where the model predictions and checkpoints will be written. Used to instantiate Trainer if trainer argument is None.

train_text: List[str]

list of training texts

train_targets: List[int] or List [str]

corresponding list of classes for each training text. If strings, will use label2id to convert to output indices, else will assume already converted.

eval_text: (Optional) str or List[str]

list of evaluation texts

eval_targets: List[int]

corresponding list of classes for each evaluation text

max_len: int

max length of input sequence. Will default to self.tokenizer.max_length() if None

freeze_base: Boolean

Whether to freeze the weights of the base model, so training only changes the task head weights. If true, the requires_grad flag for parameters of the base model will be set to false before training.

task_learn_rate: float

Learning rate to be used for task specific parameters, (base parameters will use the normal, ie already defined in **training_args, learning rate). If None, all parameters will use the same normal learning rate.

collator: Transformers.DataCollator

custom collator to use

train_dataset, eval_dataset: torch.Dataset

instantiated custom dataset object

compute_metrics: function

The function that will be used to compute metrics at evaluation. Must return a dictionary string to metric values. Used by the trainer, see https://huggingface.co/transformers/training.html#trainer for more info.

trainer: Transformers.Trainer

custom instantiated trainer to use

training_args:

keyword arguments for training. For complete list, see https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments

generate_outputs(text, pre_tokenized=False, output_attentions=False, output_hidden_states=False, word_embeddings=False, **tokenizer_inputs)

Tokenize input text and pass it through the model, optionally outputting hidden states or attention matrices.

Parameters
text: str or List[str] if pre_tokenized=False, else List[str] or List[List[str]]

the text to compute features for.

pre_tokenized: Boolean

whether or not the input text is pre-tokenized (ie, split on spaces)

output_attentions: Boolean, optional, defaults to False

Whether or not to return the attentions tensors of all attention layers.

output_hidden_states: Boolean, optional, defaults to False

Whether or not to return the hidden states of all layers.

word_embeddings: Boolean

whether or not to filter special token embeddings and average sub-word embeddings (hidden states) into word embeddings. If pre-tokenized inputs, the sub-word embeddings will be averaged into the tokens pass as inputs. If pre-tokenized=False, the text will be split on whitespace and the sub-word embeddings will be averaged back into these words produced by splitting the text on whitespace. Only used if output_hidden_states = True. If False, number of output embeddings could be greater than (number of words + special tokens). If True, number of output embeddings == number of words, sub-words are averaged together to create word level embeddings and special token embeddings are excluded.

tokenizer_inputs: tokenizer_inputs include arguments passed to tokenizer, such as presaved entity annotations

for enhanced models.

Returns
Dict containing (note that if single text sequence is passed as input, batch size will be 1):
hidden_states: (Tuple (one for each layer) of torch.FloatTensor (batch_size, sequence_length, hidden_size)).

Hidden-states of the model at the output of each layer plus the initial embedding outputs. Only returned if output_hidden_states is True. If word_embeddings, the sequence length will be the number of words in the longest sentence, ie the maximum number of words. Shorter sequences will be padded with zeros.

attentions: Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length,

sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. Only returned if output_attentions is True

offset_mappings: Tensor of shape (batch_size, sequence_length, 2)

Tensor containing (char_start, char_end) for each token, giving index into input strings of start and end character for each token. If input is pre-tokenized, start and end index maps into associated word. Note that special tokens are included with 0 for start and end indices, as these don’t map into input text because they are added inside the function. This output is only available to tokenizers that inherit from transformers.PreTrainedTokenizerFast . This includes the tokenizer and most other common tokenizers, but not all possible tokenizers in the library. If the tokenizer did not inherit from this class, this output value will be None.

if PretrainedTransformerModelForFeatureExtraction:
last_hidden_state: (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size))

Sequence of hidden-states at the output of the last layer of the model.

pooler_output: (torch.FloatTensor of shape (batch_size, hidden_size))

Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function.

if TransformerForSequenceClassification:
logits: (torch.FloatTensor of shape (batch_size, config.num_labels))

classification scores, before softmax

if TransformerForTokenClassification:
logits: (torch.FloatTensor of shape (batch_size, sequence_length, config.num_labels))

classification scores, before softmax

save(output_directory, save_tokenizer=False)

Method to save transformers model and optionally save tokenizer. The tokenizer is already saved (the input to this class includes a pretrained tokenizer), but this method will save the tokenizer as well if needed. Both the tokenizer files and model files will be saved to the output directory. The output directory can be inputted as an argument to the “load()” method of the inheriting classes (for the model and tokenizer arguments)

Parameters
output_directory: str

path to directory to save model

save_tokenizer: Boolean

whether to save tokenizer in directory or not, defaults to False

class mangoes.modeling.finetuning.TransformerForTokenClassification(pretrained_model, pretrained_tokenizer, labels=None, label2id=None, device=None, **keyword_args)

Bases: mangoes.modeling.transformer_base.PipelineMixin, mangoes.modeling.transformer_base.TransformerModel

Transformer model with Token classification head. For a list of eligible model classes, see: https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForTokenClassification.from_config.config

Variables
Parameters
pretrained_model: str, transformers.PretrainedModel subclass.
Either:
  • A string with the shortcut name of a pretrained model to load from cache or download, e.g., bert-base-uncased.

  • A string with the identifier name of a pretrained model that was user-uploaded to our S3, e.g., dbmdz/bert-base-german-cased.

  • A path to a directory containing model weights saved using save_pretrained(), e.g., ./my_model_directory/.

  • An already instantiated transformer model, either an transformers.AutoModelForTokenClassification compatible class or a mangoes enhanced language model.

pretrained_tokenizer: str, transformers.PreTrainedTokenizerBase subclass.
Either:
  • A string with the shortcut name of a pretrained model to load from cache or download, e.g., bert-base-uncased.

  • A string with the identifier name of a pretrained model that was user-uploaded to our S3, e.g., dbmdz/bert-base-german-cased.

  • A path to a directory containing model weights saved using save_pretrained(), e.g., ./my_model_directory/.

  • An already instantiated tokenizer, either an transformers.AutoTokenizer compatible class or a mangoes enhanced language model tokenizer.

labels: list of str

list of class names. Will raise error if labels=None and label2id=None.

label2id: dict of str -> int

dict mapping class name to index of output layer if None, will create from labels.

device: int, optional, defaults to None

if -1, use cpu, if >= 0, use CUDA device number. If None, will use GPU if available

**keyword_args include arguments passed to transformers.AutoConfig

Methods

generate_outputs(text[, pre_tokenized, …])

Tokenize input text and pass it through the model, optionally outputting hidden states or attention matrices.

predict(inputs)

Predicts classes for input texts

save(output_directory[, save_tokenizer])

Method to save transformers model and optionally save tokenizer.

train([output_dir, train_text, …])

Fine tune a transformers model on a text classification dataset

predict(inputs)

Predicts classes for input texts

Parameters
inputs: str or list of strs

inputs to classify

Returns
list of list of dict. For each sequence, a list of token prediction dicts. If a single sequence is passed as
input, the output will be a list of 1 list of dictionaries.
for each token, a dict containing:

word (str) – The token/word classified. score (float) – The corresponding probability for entity. entity (str) – The entity predicted for that token/word. index (int, only present when self.grouped_entities=False) – The index of the corresponding token in the

sentence.

train(output_dir=None, train_text=None, train_targets=None, eval_text=None, eval_targets=None, max_len=None, freeze_base=False, task_learn_rate=None, collator=None, train_dataset=None, eval_dataset=None, compute_metrics=None, trainer=None, **training_args)

Fine tune a transformers model on a text classification dataset

Parameters
output_dir: str

Path to the output directory where the model predictions and checkpoints will be written. Used to instantiate Trainer if trainer argument is None.

train_text: List[str]

list of training texts

train_targets: List[List[int]]

corresponding list of classes for each token in each training text

eval_text: str or List[str]

list of evaluation texts

eval_targets: List[int]

corresponding list of classes for each evaluation text

max_len: int

max length of input sequence. Will default to self.tokenizer.max_length() if None

freeze_base: Boolean

Whether to freeze the weights of the base model, so training only changes the task head weights. If true, the requires_grad flag for parameters of the base model will be set to false before training.

task_learn_rate: float

Learning rate to be used for task specific parameters, (base parameters will use the normal, ie already defined in **training_args, learning rate). If None, all parameters will use the same normal learning rate.

collator: Transformers.DataCollator

custom collator to use. If none, will use transformers.DataCollatorForTokenClassification.

train_dataset, eval_dataset: torch.Dataset

instantiated custom dataset object

compute_metrics: function

The function that will be used to compute metrics at evaluation. Must return a dictionary string to metric values. Used by the trainer, see https://huggingface.co/transformers/training.html#trainer for more info.

trainer: Transformers.Trainer

custom instantiated trainer to use

training_args:

keyword arguments for training. For complete list, see https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments

generate_outputs(text, pre_tokenized=False, output_attentions=False, output_hidden_states=False, word_embeddings=False, **tokenizer_inputs)

Tokenize input text and pass it through the model, optionally outputting hidden states or attention matrices.

Parameters
text: str or List[str] if pre_tokenized=False, else List[str] or List[List[str]]

the text to compute features for.

pre_tokenized: Boolean

whether or not the input text is pre-tokenized (ie, split on spaces)

output_attentions: Boolean, optional, defaults to False

Whether or not to return the attentions tensors of all attention layers.

output_hidden_states: Boolean, optional, defaults to False

Whether or not to return the hidden states of all layers.

word_embeddings: Boolean

whether or not to filter special token embeddings and average sub-word embeddings (hidden states) into word embeddings. If pre-tokenized inputs, the sub-word embeddings will be averaged into the tokens pass as inputs. If pre-tokenized=False, the text will be split on whitespace and the sub-word embeddings will be averaged back into these words produced by splitting the text on whitespace. Only used if output_hidden_states = True. If False, number of output embeddings could be greater than (number of words + special tokens). If True, number of output embeddings == number of words, sub-words are averaged together to create word level embeddings and special token embeddings are excluded.

tokenizer_inputs: tokenizer_inputs include arguments passed to tokenizer, such as presaved entity annotations

for enhanced models.

Returns
Dict containing (note that if single text sequence is passed as input, batch size will be 1):
hidden_states: (Tuple (one for each layer) of torch.FloatTensor (batch_size, sequence_length, hidden_size)).

Hidden-states of the model at the output of each layer plus the initial embedding outputs. Only returned if output_hidden_states is True. If word_embeddings, the sequence length will be the number of words in the longest sentence, ie the maximum number of words. Shorter sequences will be padded with zeros.

attentions: Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length,

sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. Only returned if output_attentions is True

offset_mappings: Tensor of shape (batch_size, sequence_length, 2)

Tensor containing (char_start, char_end) for each token, giving index into input strings of start and end character for each token. If input is pre-tokenized, start and end index maps into associated word. Note that special tokens are included with 0 for start and end indices, as these don’t map into input text because they are added inside the function. This output is only available to tokenizers that inherit from transformers.PreTrainedTokenizerFast . This includes the tokenizer and most other common tokenizers, but not all possible tokenizers in the library. If the tokenizer did not inherit from this class, this output value will be None.

if PretrainedTransformerModelForFeatureExtraction:
last_hidden_state: (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size))

Sequence of hidden-states at the output of the last layer of the model.

pooler_output: (torch.FloatTensor of shape (batch_size, hidden_size))

Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function.

if TransformerForSequenceClassification:
logits: (torch.FloatTensor of shape (batch_size, config.num_labels))

classification scores, before softmax

if TransformerForTokenClassification:
logits: (torch.FloatTensor of shape (batch_size, sequence_length, config.num_labels))

classification scores, before softmax

save(output_directory, save_tokenizer=False)

Method to save transformers model and optionally save tokenizer. The tokenizer is already saved (the input to this class includes a pretrained tokenizer), but this method will save the tokenizer as well if needed. Both the tokenizer files and model files will be saved to the output directory. The output directory can be inputted as an argument to the “load()” method of the inheriting classes (for the model and tokenizer arguments)

Parameters
output_directory: str

path to directory to save model

save_tokenizer: Boolean

whether to save tokenizer in directory or not, defaults to False

class mangoes.modeling.finetuning.TransformerForQuestionAnswering(pretrained_model, pretrained_tokenizer, device=None, **keyword_args)

Bases: mangoes.modeling.transformer_base.PipelineMixin, mangoes.modeling.transformer_base.TransformerModel

Transformer model with span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). For a list of eligible model classes, see: https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForQuestionAnswering.from_config.config

Variables
Parameters
pretrained_model: str, transformers.PretrainedModel subclass.
Either:
  • A string with the shortcut name of a pretrained model to load from cache or download, e.g., bert-base-uncased.

  • A string with the identifier name of a pretrained model that was user-uploaded to our S3, e.g., dbmdz/bert-base-german-cased.

  • A path to a directory containing model weights saved using save_pretrained(), e.g., ./my_model_directory/.

  • An already instantiated transformer model, either an transformers.AutoModelForQuestionAnswering compatible class or a mangoes enhanced language model.

pretrained_tokenizer: str, transformers.PreTrainedTokenizerBase subclass.
Either:
  • A string with the shortcut name of a pretrained model to load from cache or download, e.g., bert-base-uncased.

  • A string with the identifier name of a pretrained model that was user-uploaded to our S3, e.g., dbmdz/bert-base-german-cased.

  • A path to a directory containing model weights saved using save_pretrained(), e.g., ./my_model_directory/.

  • An already instantiated tokenizer, either an transformers.AutoTokenizer compatible class or a mangoes enhanced language model tokenizer.

device: int, optional, defaults to None

if -1, use cpu, if >= 0, use CUDA device number. If None, will use GPU if available

**keyword_args include arguments passed to transformers.AutoConfig

Methods

generate_outputs(question, context[, …])

Tokenize questions and context and pass them through the transformer model and QA head, optionally outputting hidden states or attention matrices.

predict([inputs, question, context])

Answer the question(s) given as inputs by using the context(s).

save(output_directory[, save_tokenizer])

Method to save transformers model and optionally save tokenizer.

train([output_dir, train_question_texts, …])

Fine tune a transformers model on a question answering dataset

generate_outputs(question, context, pre_tokenized=False, output_attentions=False, output_hidden_states=False, word_embeddings=False, doc_stride=128, **tokenizer_inputs)

Tokenize questions and context and pass them through the transformer model and QA head, optionally outputting hidden states or attention matrices. Works for single question/context or batch. If a single question/context is given, a batch of size 1 will be created. Note that offset_mappings are not returned for this class method: only tokenizers that inherit from transformers.PreTrainedTokenizerFast have this functionality, and these tokenizers are not compatible with QuestionAnswering functionality (see https://github.com/huggingface/transformers/issues/7735, and https://github.com/huggingface/transformers/issues/8787#issuecomment-779213050 for more details).

Parameters
question: str or List[str] if pre_tokenized=False, else List[str] or List[List[str]]

The question text

context: str or List[str] if pre_tokenized=False, else List[str] or List[List[str]]

The context text

pre_tokenized: Boolean

Whether or not the input text is pretokenized (ie, split on spaces)

output_attentions: Boolean, optional, defaults to False

Whether or not to return the attentions tensors of all attention layers.

output_hidden_states: Boolean, optional, defaults to False

Whether or not to return the hidden states of all layers.

word_embeddings: Boolean

Whether or not to filter special token embeddings and average subword embeddings (hidden states) into word embeddings. This functionality is not available for this task class. Use the feature extraction class instead.

doc_stride: int, defaults to 128

If the context is too long to fit with the question for the model, it will be split in several chunks with some overlap. This argument controls the size of that overlap.

tokenizer_inputs: tokenizer_inputs include arguments passed to tokenizer, such as presaved entity annotations

for enhanced models.

Returns
Dict containing (note that if single text sequence is pass as input, batch size will be 1):
start_logits: torch.FloatTensor of shape (batch_size, sequence_length)

Span-start scores (before SoftMax).

end_logits: torch.FloatTensor of shape (batch_size, sequence_length)

Span-end scores (before SoftMax).

offset_mappings: Tensor of shape (batch_size, sequence_length, 2)

Note that unlike other fine tuning classes and the base class, offset mappings are not returned for this class. See the class documentation for more details.

hidden_states: Tuple (one for each layer) of torch.FloatTensor (batch_size, sequence_length, hidden_size).

Hidden-states of the model at the output of each layer plus the initial embedding outputs. Only returned if output_hidden_states is True

attentions: Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length,

sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. Only returned if output_attentions is True

predict(inputs=None, question=None, context=None, **kwargs)

Answer the question(s) given as inputs by using the context(s). Takes either transformers.SquadExample (or list of them) or lists of strings, see argument documentation.

Parameters
inputs: transformers.SquadExample or a list of SquadExample

One or several SquadExample containing the question and context.

question: str or List[str]

One or several question(s) (must be used in conjunction with the context argument).

context: str or List[str]

One or several context(s) associated with the question(s) (must be used in conjunction with the question argument).

kwargs include:
topk: int, defaults to 1

The number of answers to return (will be chosen by order of likelihood).

doc_stride: int, defaults to 128

If the context is too long to fit with the question for the model, it will be split in several chunks with some overlap. This argument controls the size of that overlap.

max_answer_len: int, defaults to 15

The maximum length of predicted answers (e.g., only answers with a shorter length are considered).

max_seq_len: int, defaults to 384

The maximum length of the total sentence (context + question) after tokenization. The context will be split in several chunks (using doc_stride) if needed.

max_question_len: int, defaults to 64

The maximum length of the question after tokenization. It will be truncated if needed.

handle_impossible_answer: bool, defaults to False

Whether or not we accept impossible as an answer.

Returns
Returns a list of answers, one for each question. If only one question has been passed as input, returns a list
with one dictionary.
Each answer is a dict with:
score: float

The probability associated to the answer.

start: int

The start index of the answer (in the tokenized version of the input).

end: int

The end index of the answer (in the tokenized version of the input).

answer: str

The answer to the question.

train(output_dir=None, train_question_texts=None, eval_question_texts=None, train_context_texts=None, eval_context_texts=None, train_answer_texts=None, eval_answer_texts=None, train_start_indices=None, eval_start_indices=None, max_seq_length=384, doc_stride=128, max_query_length=64, freeze_base=False, task_learn_rate=None, collator=None, train_dataset=None, eval_dataset=None, compute_metrics=None, trainer=None, **training_args)

Fine tune a transformers model on a question answering dataset

Parameters
output_dir: str

Path to the output directory where the model predictions and checkpoints will be written. Used to instantiate Trainer if trainer argument is None.

train_question_texts, eval_question_texts: list of str

The texts corresponding to the questions

train_context_texts, eval_context_texts: list of str

The texts corresponding to the contexts

train_answer_texts, eval_answer_texts: list of str

The texts corresponding to the answers

train_start_indices, eval_start_indices: list of int

The character positions of the start of the answers

max_seq_length:int

The maximum total input sequence length after tokenization.

doc_stride: int

When splitting up a long document into chunks, how much stride to take between chunks.

max_query_length: int

The maximum number of tokens for the question.

freeze_base: Boolean

Whether to freeze the weights of the base model, so training only changes the task head weights. If true, the requires_grad flag for parameters of the base model will be set to false before training.

task_learn_rate: float

Learning rate to be used for task specific parameters, (base parameters will use the normal, ie already defined in **training_args, learning rate). If None, all parameters will use the same normal learning rate.

collator: Transformers.DataCollator

custom collator to use

train_dataset, eval_dataset: torch.Dataset

instantiated custom dataset object

compute_metrics: function

The function that will be used to compute metrics at evaluation. Must return a dictionary string to metric values. Used by the trainer, see https://huggingface.co/transformers/training.html#trainer for more info.

trainer: Transformers.Trainer

custom instantiated trainer to use

training_args:

keyword arguments for training. For complete list, see https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments

save(output_directory, save_tokenizer=False)

Method to save transformers model and optionally save tokenizer. The tokenizer is already saved (the input to this class includes a pretrained tokenizer), but this method will save the tokenizer as well if needed. Both the tokenizer files and model files will be saved to the output directory. The output directory can be inputted as an argument to the “load()” method of the inheriting classes (for the model and tokenizer arguments)

Parameters
output_directory: str

path to directory to save model

save_tokenizer: Boolean

whether to save tokenizer in directory or not, defaults to False

class mangoes.modeling.finetuning.TransformerForMultipleChoice(pretrained_model, pretrained_tokenizer, device=None, **keyword_args)

Bases: mangoes.modeling.transformer_base.TransformerModel

Pretrained model with a multiple choice classification head on top (a linear layer on top of the pooled output and a softmax) e.g. for RocStories/SWAG tasks.

For a list of eligible model classes, see: https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForMultipleChoice.from_config.config

For information on how multiple choice datasets should be formatted for fine-tuning, see this explanation: https://github.com/google-research/bert/issues/38

And this link for explanation of Huggingface’s multiple choice models: https://github.com/huggingface/transformers/issues/7701#issuecomment-707149546

Variables
Parameters
pretrained_model: str, transformers.PretrainedModel subclass.
Either:
  • A string with the shortcut name of a pretrained model to load from cache or download, e.g., bert-base-uncased.

  • A string with the identifier name of a pretrained model that was user-uploaded to our S3, e.g., dbmdz/bert-base-german-cased.

  • A path to a directory containing model weights saved using save_pretrained(), e.g., ./my_model_directory/.

  • An already instantiated transformer model, either an transformers.AutoModelForMultipleChoice compatible class or a mangoes enhanced language model.

pretrained_tokenizer: str, transformers.PreTrainedTokenizerBase subclass.
Either:
  • A string with the shortcut name of a pretrained model to load from cache or download, e.g., bert-base-uncased.

  • A string with the identifier name of a pretrained model that was user-uploaded to our S3, e.g., dbmdz/bert-base-german-cased.

  • A path to a directory containing model weights saved using save_pretrained(), e.g., ./my_model_directory/.

  • An already instantiated tokenizer, either an transformers.AutoTokenizer compatible class or a mangoes enhanced language model tokenizer.

device: int, optional, defaults to None

if -1, use cpu, if >= 0, use CUDA device number. If None, will use GPU if available

**keyword_args include arguments passed to transformers.AutoConfig

Methods

generate_outputs(questions, choices[, …])

Tokenize context and choices and pass them through the model and MC head, optionally outputting hidden states or attention matrices.

predict(questions, choices[, pre_tokenized])

Predicts the answer to the question(s) out of the possible choices.

save(output_directory[, save_tokenizer])

Method to save transformers model and optionally save tokenizer.

train([output_dir, train_question_texts, …])

Fine tune a pretrained model on a multiple choice dataset

generate_outputs(questions, choices, pre_tokenized=False, output_attentions=False, output_hidden_states=False, word_embeddings=False, **tokenizer_inputs)

Tokenize context and choices and pass them through the model and MC head, optionally outputting hidden states or attention matrices. Works for a single question/set of choices or a batch. If a single question/set of choices is given, a batch of size 1 will be created.

Follows these explanations for packing MC data and sending it through the model:

https://github.com/google-research/bert/issues/38 https://github.com/huggingface/transformers/issues/7701#issuecomment-7071495

Parameters
questions: str or List[str] if pre_tokenized=False, else List[str] or List[List[str]]

The question text. This can include the context together with a question, or (in the case of some datasets such as the SWAG dataset) just the context if there is no direct question. Can be a single question or list of questions.

choices: List[str] or List[List[str]] if pre_tokenized=False, else List[List[str]] or List[List[List[str]]]

The choices text. One instance of choices should be a list of strings (if not pre-tokenized) or a list of list of strings (if pre-tokenized). Can be a single choice instance or multiple. If batch is passed in (ie more than one question), assumes all questions have same number of choices.

pre_tokenized: Boolean

Whether or not the input text is pre-tokenized (ie, split on spaces)

output_attentions: Boolean, optional, defaults to False

Whether or not to return the attentions tensors of all attention layers.

output_hidden_states: Boolean, optional, defaults to False

Whether or not to return the hidden states of all layers.

word_embeddings: Boolean

Whether or not to filter special token embeddings and average subword embeddings (hidden states) into word embeddings. This functionality is not available for this task class. Use the feature extraction class instead.

tokenizer_inputs: tokenizer_inputs include arguments passed to tokenizer, such as presaved entity annotations

for enhanced models.

Returns
Dict containing (note that if single question/set of choices is passed as input, batch size will be 1):
logits: Tensor of shape (batch_size, num_choices)

Classification scores (before SoftMax). If batch

offset_mappings: Tensor of shape (batch_size, num_choices, sequence_length, 2)

Tensor containing (char_start, char_end) for each token, giving index into input strings of start and end character for each token. If input is pre-tokenized, start and end index maps into associated word. Note that special tokens are included with 0 for start and end indices, as these don’t map into input text because they are added inside the function. Offset mappings for both questions and choices are merged to one row, but indices still align to them separately.

hidden_states: Tuple (one for each layer) of torch.FloatTensor of size

(batch_size, num_choices, sequence_length, hidden_size) Hidden-states of the model at the output of each layer plus the initial embedding outputs. Only returned if output_hidden_states is True

attentions: Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_choices, num_heads,

sequence_length, sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. Only returned if output_attentions is True

predict(questions, choices, pre_tokenized=False)

Predicts the answer to the question(s) out of the possible choices.

Parameters
questions: str or List[str] if pre_tokenized=False, else List[str] or List[List[str]]

The question text. This can include the context together with a question, or (in the case of some datasets such as the SWAG dataset) just the context if there is no direct question. Can be a single question or list of questions.

choices: List[str] or List[List[str]] if pre_tokenized=False, else List[List[str]] or List[List[List[str]]]

The choices text. One instance of choices should be a list of strings (if not pre-tokenized) or a list of list of strings (if pre-tokenized). Can be a single choice instance or multiple. If batch is passed in (ie more than one question), assumes all questions have same number of choices.

pre_tokenized: Boolean

Whether or not the input text is pre-tokenized (ie, split on spaces)

Returns
List of answer prediction dicts, one for each question (returns list of length 1 if single question is passed as
input. Answer prediction dicts include:
score: float

The probability associated to the answer.

answer_index: int

The index of the predicted choice.

answer_text: str (if not pre_tokenized) or List[str] (if pre_tokenized)

The text corresponding to the predicted answer.

train(output_dir=None, train_question_texts=None, eval_question_texts=None, train_choices_texts=None, eval_choices_texts=None, train_labels=None, eval_labels=None, max_len=None, freeze_base=False, task_learn_rate=None, collator=None, train_dataset=None, eval_dataset=None, compute_metrics=None, trainer=None, **training_args)

Fine tune a pretrained model on a multiple choice dataset

Parameters
output_dir: str

Path to the output directory where the model predictions and checkpoints will be written. Used to instantiate Trainer if trainer argument is None.

train_question_texts, eval_question_texts: list of str

The texts corresponding to the questions/contexts.

train_choices_texts, eval_choices_texts: list of str

The texts corresponding to the answer choices

train_labels, eval_labels: list of int

The indices of the correct answers

max_len:int

The maximum total input sequence length after tokenization. Note that if a question answer pair sequence is longer than this length, it will be truncated.

freeze_base: Boolean

Whether to freeze the weights of the base model, so training only changes the task head weights. If true, the requires_grad flag for parameters of the base model will be set to false before training.

task_learn_rate: float

Learning rate to be used for task specific parameters, (base parameters will use the normal, ie already defined in **training_args, learning rate). If None, all parameters will use the same normal learning rate.

collator: Transformers.DataCollator

custom collator to use

train_dataset, eval_dataset: torch.Dataset

instantiated custom dataset object

compute_metrics: function

The function that will be used to compute metrics at evaluation. Must return a dictionary string to metric values. Used by the trainer, see https://huggingface.co/transformers/training.html#trainer for more info.

trainer: Transformers.Trainer

custom instantiated trainer to use

training_args:

keyword arguments for training. For complete list, see https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments

save(output_directory, save_tokenizer=False)

Method to save transformers model and optionally save tokenizer. The tokenizer is already saved (the input to this class includes a pretrained tokenizer), but this method will save the tokenizer as well if needed. Both the tokenizer files and model files will be saved to the output directory. The output directory can be inputted as an argument to the “load()” method of the inheriting classes (for the model and tokenizer arguments)

Parameters
output_directory: str

path to directory to save model

save_tokenizer: Boolean

whether to save tokenizer in directory or not, defaults to False

class mangoes.modeling.finetuning.TransformerForCoreferenceResolution(pretrained_model, pretrained_tokenizer, device=None, max_span_width=30, ffnn_hidden_size=1000, top_span_ratio=0.4, max_top_antecendents=50, use_metadata=False, metadata_feature_size=20, genres=('bc', 'bn', 'mz', 'nw', 'pt', 'tc', 'wb'), max_training_segments=5, coref_depth=1, coref_dropout=0.3, pretrained=False, **keyword_args)

Bases: mangoes.modeling.transformer_base.TransformerModel

Class for fine tuning a transformer model for the coreference resolution task.

The base model is an implementation of the independent variant of https://arxiv.org/pdf/1908.09091.pdf, which uses the fine tuning procedure described in https://arxiv.org/pdf/1804.05392.pdf

Variables
Parameters
pretrained_model: str, transformers.PretrainedModel subclass.
Either:
  • A string with the shortcut name of a pretrained model to load from cache or download, e.g., bert-base-uncased.

  • A string with the identifier name of a pretrained model that was user-uploaded to our S3, e.g., dbmdz/bert-base-german-cased.

  • A path to a directory containing model weights saved using save_pretrained(), e.g., ./my_model_directory/.

  • An already instantiated transformer model, either an transformers. compatible class or a mangoes enhanced language model.

pretrained_tokenizer: str, transformers.PreTrainedTokenizerBase subclass.
Either:
  • A string with the shortcut name of a pretrained model to load from cache or download, e.g., bert-base-uncased.

  • A string with the identifier name of a pretrained model that was user-uploaded to our S3, e.g., dbmdz/bert-base-german-cased.

  • A path to a directory containing model weights saved using save_pretrained(), e.g., ./my_model_directory/.

  • An already instantiated tokenizer, either an transformers.AutoTokenizer compatible class or a mangoes enhanced language model tokenizer.

device: int, optional, defaults to None

If -1, use cpu, if >= 0, use CUDA device number. If None, will use GPU if available.

max_span_width: int, defaults to 30

Maximum width (consecutive tokens) of candidate span.

ffnn_hidden_size: int, defaults to 1000

Size of hidden layers in dense mention scorer and slow antecedent scorer heads.

top_span_ratio: float, defaults to 0.4

Ratio of spans to consider after first sort on mention score.

max_top_antecendents” int, defaults to 50

Max number of antecedents to consider for each span after fast antecedent scorer.

use_metadata: Boolean, defaults to False

Whether to use metadata (speaker and genre information) in forward pass.

metadata_feature_size: int, defaults to 20

Size of metadata features (if using metadata)

genres: List of string, defaults to (“bc”, “bn”, “mz”, “nw”, “pt”, “tc”, “wb”)

List of possible genres (if using metadata). Defaults to genres in Ontonotes dataset.

max_training_segments: int, defaults to 5

Maximum number of segments in one document (aka one batch).

coref_depth: int, defaults to 2

Depth of higher order (aka slow) antecedent scoring.

coref_dropout: float, defaults to 0.3

Dropout probability for head layers.

**keyword_args

Methods

generate_outputs(text[, pre_tokenized, …])

Pass one batch (document) worth of text through the co-reference model, outputting mention scores and token indices for possible spans, the indices and scores of the top antecedents for the top mention scored spans, and optionally the hidden states or attention matrices from the base model.

predict(text[, pre_tokenized, speaker_ids, …])

Predict the co-reference clusters in text.

save(output_directory[, save_tokenizer])

Method to save transformers model and optionally save tokenizer.

train([output_dir, max_segment_len, …])

Fine tune a pretrained model on a co-reference resolution dataset.

save(output_directory, save_tokenizer=False)

Method to save transformers model and optionally save tokenizer. The tokenizer is already saved (the input to this class includes a pretrained tokenizer), but this method will save the tokenizer as well if needed. Both the tokenizer files and model files will be saved to the output directory. The output directory can be inputted as an argument to the “load()” method of the inheriting classes (for the model and tokenizer arguments)

Parameters
output_directory: str

path to directory to save model

save_tokenizer: Boolean

whether to save tokenizer in directory or not, defaults to False

predict(text, pre_tokenized=False, speaker_ids=None, genre=None, max_segment_len=256, max_segments=5)

Predict the co-reference clusters in text. Takes one document at a time. Internally calls generate_outputs and then processes the outputs.

Parameters
text: str or List[str] if pre_tokenized=False, else List[str] or List[List[str]]

The text to predict co-references for. If pre_tokenized, the text can be one sentence (list of words) or list of sentences (list of list of words). If not pre_tokenized, text can be one sentence (str) or list of sentences (list of strings).

pre_tokenized: Boolean

Whether or not the input text is pretokenized (ie, split on spaces). This method will still pass it through the tokenizer, in order to get subtokens, special characters, and attention masks.

speaker_ids: int or List[int] if pre_tokenized=False, else List[int] or List[List[int]]

Speaker ids for input text. If pre_tokenized, speaker_ids should be for each word in each sentence of input text (ie, list of int if one sentence, or list of list of int if multiple). If not pre_tokenized, speaker ids should be on a sentence basis, ie one int if one input sentence, or list of ints if multiple. Optional, needed only if the model has been trained/instantiated to accept metadata.

genre: Int or String

Genre of text. If string, will attempt to use the genre id mapping constructed by the model parameter to this object. Optional, needed only if the model has been trained/instantiated to accept metadata.

max_segment_len: int, defaults to 256

maximum number of sub-tokens for one segment

max_segments: int, defaults to 5

Maximum number of segments to return per document

Returns
List of dicts.
For each found co-reference cluster, a dict with the following keys:
cluster_tokens: List[List[str]]]

The text spans associated with the cluster. Spans are represented by the list of tokens.

cluster_ids: List[List[int]]

The id spans associated with the cluster. Spans are represented by the list of token ids.

generate_outputs(text, pre_tokenized=False, speaker_ids=None, genre=None, output_attentions=False, output_hidden_states=False, word_embeddings=False, max_segment_len=256, max_segments=5, **tokenizer_inputs)

Pass one batch (document) worth of text through the co-reference model, outputting mention scores and token indices for possible spans, the indices and scores of the top antecedents for the top mention scored spans, and optionally the hidden states or attention matrices from the base model.

Note that this functions does not return “offset_mappings”, and instead returns the flattened ids and text.

Parameters
text: str or List[str] if pre_tokenized=False, else List[str] or List[List[str]]

The text to predict co-references for. If pre_tokenized, the text can be one sentence (list of words) or list of sentences (list of list of words). If not pre_tokenized, text can be one sentence (str) or list of sentences (list of strings).

pre_tokenized: Boolean

Whether or not the input text is pretokenized (ie, split on spaces). This method will still pass it through the tokenizer, in order to get subtokens, special characters, and attention masks.

speaker_ids: int or List[int] if pre_tokenized=False, else List[int] or List[List[int]]

Speaker ids for input text. If pre_tokenized, speaker_ids should be for each word in each sentence of input text (ie, list of int if one sentence, or list of list of int if multiple). If not pre_tokenized, speaker ids should be on a sentence basis, ie one int if one input sentence, or list of ints if multiple. Optional, needed only if the model has been trained/instantiated to accept metadata.

genre: Int or String

Genre of text. If string, will attempt to use the genre id mapping constructed by the model parameter to this object. Optional, needed only if the model has been trained/instantiated to accept metadata.

output_attentions: Boolean, optional, defaults to False

Whether or not to return the attentions tensors of all attention layers.

output_hidden_states: Boolean, optional, defaults to False

Whether or not to return the hidden states of all layers.

word_embeddings: Boolean

Whether or not to filter special token embeddings and average subword embeddings (hidden states) into word embeddings. Note: this functionality is not available for this class because of consolidation of input in the forward pass of the model. Consider using mangoes.modeling.TransformerForFeatureExtraction class for word-level feature extractions.

max_segment_len: int, defaults to 256

maximum number of sub-tokens for one segment

max_segments: int, defaults to 5

Maximum number of segments to return per document

tokenizer_inputs: tokenizer_inputs include arguments passed to tokenizer, such as presaved entity annotations

for enhanced models.

Returns
Dict containing:
candidate_starts: tensor of size (num_spans)

start token indices in flattened document of candidate spans

candidate_ends: tensor of size (num_spans)

end token indices in flattened document of candidate spans

candidate_mention_scores: tensor of size (num_spans)

mention scores for each candidate span

top_span_starts: tensor of size (num_top_spans)

start token indices in flattened document of candidate spans with top mention scores

top_span_ends: tensor of size (num_top_spans)

end token indices in flattened document of candidate spans with top mention scores

top_antecedents: tensor of shape (num_top_spans, antecedent_candidates)

indices in top span candidates of top antecedents for each mention

top_antecedent_scores: tensor of shape (num_top_spans, 1 + antecedent_candidates)

final antecedent scores of top antecedents for each mention. The dummy score (for not a co-reference) is inserted at the start of each row. Thus, the score for top_antecedents[0][0] is top_antecedent_scores[0][1]. The span for first candidate is top_span_starts[0] to top_span_ends[0]. The span for the first top antecedent for the first candidate is top_span_starts[top_antecedents[0][0]] to top_span_ends[top_antecedents[0][0]].

flattened_ids: tensor of shape (num_tokens)

flattened ids of input sentences. The start and end candidate and span indices map into this tensor.

flattened_text: tensor of shape (num_tokens)

flattened tokens of input sentences. The start and end candidate and span indices map into this tensor.

hidden_states: Tuple (one for each layer) of torch.FloatTensor (batch_size, sequence_length, hidden_size).

Hidden-states of the model at the output of each layer plus the initial embedding outputs. Only returned if output_hidden_states is True

attentions: Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length,

sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. Only returned if output_attentions is True

train(output_dir=None, max_segment_len=256, max_segments=5, freeze_base=False, task_learn_rate=None, task_weight_decay=None, train_documents=None, train_cluster_ids=None, train_speaker_ids=None, train_genres=None, eval_documents=None, eval_cluster_ids=None, eval_speaker_ids=None, eval_genres=None, train_dataset=None, eval_dataset=None, compute_metrics=None, trainer=None, **training_args)

Fine tune a pretrained model on a co-reference resolution dataset. Users can input the raw coreference data and a torch dataset will be created, or they can input already instantiated dataset(s), or an already instantiated trainer.

Note that this implementation is based on the “independent” variant of the method introduced in https://arxiv.org/pdf/1908.09091.pdf, thus the batch size will be 1 (1 document per batch, with multiple segments per document), and a specific collator function will be used.

Parameters
output_dir: str

Path to the output directory where the model predictions and checkpoints will be written. Used to instantiate Trainer if trainer argument is None. Optional: needed if trainer is not provided

max_segment_len: int, defaults to 256

maximum number of sub-tokens for one segment

max_segments: int, defaults to 5

Maximum number of segments to return per document

freeze_base: Boolean

Whether to freeze the weights of the base model, so training only changes the task head weights. If true, the requires_grad flag for parameters of the base model will be set to false before training.

task_learn_rate: float

Learning rate to be used for task specific parameters, (base parameters will use the normal, ie already defined in **training_args, learning rate). If None, all parameters will use the same normal learning rate.

task_weight_decay: float

Weight decay parameter to be used for task specific parameters. (base parameters will use the normal, ie already defined in **training_args, weight decay). If None, all parameters will use the same normal weight decay.

train_documents: List of Lists of Lists of strings

Optional: needed if train_dataset or trainer is not provided Text for each document. As cluster ids are labeled by word, a document is a list of sentences. One sentence is a list of words (ie already split on whitespace/punctuation)

train_cluster_ids: List of Lists of Lists of (ints or Tuple(int, int))

Optional: needed if train_dataset or trainer is not provided Cluster ids for each word in documents argument. Assumes words that aren’t mentions have either None or -1 as id. In the case where a word belongs to two different spans (with different cluster ids), the cluster id for word should be a tuple of ints corresponding to the different cluster ids.

train_speaker_ids: List of Lists of Lists of ints

Optional: needed if train_dataset or trainer is not provided and model is using metadata Speaker id for each word in documents. Assumes positive ids (special tokens (such as [CLS] and [SEP] that are added at beginning and end of segments) will be assigned speaker ids of -1)

train_genres: List of ints

Optional: needed if train_dataset or trainer is not provided and model is using metadata Genre id for each document

eval_documents: List of Lists of Lists of strings

Optional: needed if train_dataset or trainer is not provided Text for each document. As cluster ids are labeled by word, a document is a list of sentences. One sentence is a list of words (ie already split on whitespace/punctuation)

eval_cluster_ids: List of Lists of Lists of (ints or Tuple(int, int))

Optional: needed if train_dataset or trainer is not provided Cluster ids for each word in documents argument. Assumes words that aren’t mentions have either None or -1 as id. In the case where a word belongs to two different spans (with different cluster ids), the cluster id for word should be a tuple of ints corresponding to the different cluster ids.

eval_speaker_ids: List of Lists of Lists of ints

Optional: needed if train_dataset or trainer is not provided and model is using metadata Speaker id for each word in documents. Assumes positive ids (special tokens (such as [CLS] and [SEP] that are added at beginning and end of segments) will be assigned speaker ids of -1)

eval_genres: List of ints

Optional: needed if train_dataset or trainer is not provided and model is using metadata Genre id for each document

train_dataset, eval_dataset: torch.Dataset

instantiated custom dataset object. Note that the model implementation and default trainer (mangoes.modeling.training_utils.CoreferenceFineTuneTrainer) are set up to work with mangoes.modeling.training_utils.MangoesCoreferenceDataset datasets, so take care when sending custom dataset arguments.

compute_metrics: function

The function that will be used to compute metrics at evaluation. Must return a dictionary string to metric values. Used by the trainer, see https://huggingface.co/transformers/training.html#trainer for more info.

trainer: Transformers.Trainer

custom instantiated trainer to use

training_args:

keyword arguments for training. For complete list, see https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments