mangoes.modeling.finetuning module¶
This module provides an interface into the transformers pretrained models for fine tuning. Note that for each fine-tuning class, any eligible model may be fine-tuned. See links in each class documentation for list of eligible models.
-
class
mangoes.modeling.finetuning.
TransformerForFeatureExtraction
(pretrained_model, pretrained_tokenizer, device=None, **keyword_args)¶ Bases:
mangoes.modeling.transformer_base.PipelineMixin
,mangoes.modeling.transformer_base.TransformerModel
Class for using a pretrained transformer model and tokenizer. As it uses AutoModel (and not AutoModelForQuestionAnswering, for example), it can only be used for base pretrained models, and not fine tuning tasks.
- Parameters
- pretrained_model: str, transformers.PretrainedModel subclass.
- Either:
A string with the shortcut name of a pretrained model to load from cache or download, e.g.,
bert-base-uncased
.A string with the identifier name of a pretrained model that was user-uploaded to our S3, e.g.,
dbmdz/bert-base-german-cased
.A path to a directory containing model weights saved using
save_pretrained()
, e.g.,./my_model_directory/
.An already instantiated transformer model, either an transformers.AutoModel compatible class or a mangoes enhanced language model.
- pretrained_tokenizer: str, transformers.PreTrainedTokenizerBase subclass.
- Either:
A string with the shortcut name of a pretrained model to load from cache or download, e.g.,
bert-base-uncased
.A string with the identifier name of a pretrained model that was user-uploaded to our S3, e.g.,
dbmdz/bert-base-german-cased
.A path to a directory containing model weights saved using
save_pretrained()
, e.g.,./my_model_directory/
.An already instantiated tokenizer, either an transformers.AutoTokenizer compatible class or a mangoes enhanced language model tokenizer.
- device: int, optional, defaults to None
if -1, use cpu, if >= 0, use CUDA device number. If None, will use GPU if available
- **keyword_args include arguments passed to transformers.AutoConfig
Methods
generate_outputs
(text[, pre_tokenized, …])Tokenize input text and pass it through the model, optionally outputting hidden states or attention matrices.
predict
(inputs, **kwargs)Run input text through the feature extraction pipeline, extracting the hidden states of each layer.
save
(output_directory[, save_tokenizer])Method to save transformers model and optionally save tokenizer.
train
([output_dir, train_dataset, …])This function does nothing, use a task specific class to pre-train or fine-tune
-
predict
(inputs, **kwargs)¶ Run input text through the feature extraction pipeline, extracting the hidden states of each layer.
- Parameters
- inputs: str or list of strs
inputs to extract features
- Returns
- nested list of float, hidden states.
-
train
(output_dir=None, train_dataset=None, eval_dataset=None, collator=None, trainer=None, **training_args)¶ This function does nothing, use a task specific class to pre-train or fine-tune
-
generate_outputs
(text, pre_tokenized=False, output_attentions=False, output_hidden_states=False, word_embeddings=False, **tokenizer_inputs)¶ Tokenize input text and pass it through the model, optionally outputting hidden states or attention matrices.
- Parameters
- text: str or List[str] if pre_tokenized=False, else List[str] or List[List[str]]
the text to compute features for.
- pre_tokenized: Boolean
whether or not the input text is pre-tokenized (ie, split on spaces)
- output_attentions: Boolean, optional, defaults to False
Whether or not to return the attentions tensors of all attention layers.
- output_hidden_states: Boolean, optional, defaults to False
Whether or not to return the hidden states of all layers.
- word_embeddings: Boolean
whether or not to filter special token embeddings and average sub-word embeddings (hidden states) into word embeddings. If pre-tokenized inputs, the sub-word embeddings will be averaged into the tokens pass as inputs. If pre-tokenized=False, the text will be split on whitespace and the sub-word embeddings will be averaged back into these words produced by splitting the text on whitespace. Only used if output_hidden_states = True. If False, number of output embeddings could be greater than (number of words + special tokens). If True, number of output embeddings == number of words, sub-words are averaged together to create word level embeddings and special token embeddings are excluded.
- tokenizer_inputs: tokenizer_inputs include arguments passed to tokenizer, such as presaved entity annotations
for enhanced models.
- Returns
- Dict containing (note that if single text sequence is passed as input, batch size will be 1):
- hidden_states: (Tuple (one for each layer) of torch.FloatTensor (batch_size, sequence_length, hidden_size)).
Hidden-states of the model at the output of each layer plus the initial embedding outputs. Only returned if output_hidden_states is True. If word_embeddings, the sequence length will be the number of words in the longest sentence, ie the maximum number of words. Shorter sequences will be padded with zeros.
- attentions: Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length,
sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. Only returned if output_attentions is True
- offset_mappings: Tensor of shape (batch_size, sequence_length, 2)
Tensor containing (char_start, char_end) for each token, giving index into input strings of start and end character for each token. If input is pre-tokenized, start and end index maps into associated word. Note that special tokens are included with 0 for start and end indices, as these don’t map into input text because they are added inside the function. This output is only available to tokenizers that inherit from transformers.PreTrainedTokenizerFast . This includes the tokenizer and most other common tokenizers, but not all possible tokenizers in the library. If the tokenizer did not inherit from this class, this output value will be None.
- if PretrainedTransformerModelForFeatureExtraction:
- last_hidden_state: (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size))
Sequence of hidden-states at the output of the last layer of the model.
- pooler_output: (torch.FloatTensor of shape (batch_size, hidden_size))
Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function.
- if TransformerForSequenceClassification:
- logits: (torch.FloatTensor of shape (batch_size, config.num_labels))
classification scores, before softmax
- if TransformerForTokenClassification:
- logits: (torch.FloatTensor of shape (batch_size, sequence_length, config.num_labels))
classification scores, before softmax
-
save
(output_directory, save_tokenizer=False)¶ Method to save transformers model and optionally save tokenizer. The tokenizer is already saved (the input to this class includes a pretrained tokenizer), but this method will save the tokenizer as well if needed. Both the tokenizer files and model files will be saved to the output directory. The output directory can be inputted as an argument to the “load()” method of the inheriting classes (for the model and tokenizer arguments)
- Parameters
- output_directory: str
path to directory to save model
- save_tokenizer: Boolean
whether to save tokenizer in directory or not, defaults to False
-
class
mangoes.modeling.finetuning.
TransformerForSequenceClassification
(pretrained_model, pretrained_tokenizer, labels=None, label2id=None, device=None, **keyword_args)¶ Bases:
mangoes.modeling.transformer_base.PipelineMixin
,mangoes.modeling.transformer_base.TransformerModel
Transformers model with Sequence classification head (linear layer on top of pooled output). For a list of eligible model classes, see: https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForSequenceClassification.from_config.config
- Variables
self.tokenizer – transformers.AutoTokenizerFast object, see https://huggingface.co/docs/transformers/model_doc/auto for more documentation
self.model – transformers.AutoForSequenceClassification object, see https://huggingface.co/docs/transformers/model_doc/auto
- Parameters
- pretrained_model: str, transformers.PretrainedModel subclass.
- Either:
A string with the shortcut name of a pretrained model to load from cache or download, e.g.,
bert-base-uncased
.A string with the identifier name of a pretrained model that was user-uploaded to our S3, e.g.,
dbmdz/bert-base-german-cased
.A path to a directory containing model weights saved using
save_pretrained()
, e.g.,./my_model_directory/
.An already instantiated transformer model, either an transformers.AutoModelForSequenceClassification compatible class or a mangoes enhanced language model.
- pretrained_tokenizer: str, transformers.PreTrainedTokenizerBase subclass.
- Either:
A string with the shortcut name of a pretrained model to load from cache or download, e.g.,
bert-base-uncased
.A string with the identifier name of a pretrained model that was user-uploaded to our S3, e.g.,
dbmdz/bert-base-german-cased
.A path to a directory containing model weights saved using
save_pretrained()
, e.g.,./my_model_directory/
.An already instantiated tokenizer, either an transformers.AutoTokenizer compatible class or a mangoes enhanced language model tokenizer.
- labels: list of str
List of class names. Will raise error if labels=None and label2id=None
- label2id: dict of str -> int
dict mapping class name to index of output layer if None, will create from labels.
- device: int, optional, defaults to None
if -1, use cpu, if >= 0, use CUDA device number. If None, will use GPU if available
- **keyword_args include arguments passed to transformers.AutoConfig
Methods
generate_outputs
(text[, pre_tokenized, …])Tokenize input text and pass it through the model, optionally outputting hidden states or attention matrices.
predict
(inputs[, return_all_scores])Predicts classes for input texts.
save
(output_directory[, save_tokenizer])Method to save transformers model and optionally save tokenizer.
train
([output_dir, train_text, …])Fine tune a transformer model on a text classification dataset
-
predict
(inputs, return_all_scores=False)¶ Predicts classes for input texts.
- Parameters
- inputs: str or list of strs
inputs to classify
- return_all_scores: Boolean (default=False)
Whether to return all scores or just the predicted class score
- Returns
- list of dict, or list of list of dict if return_all_scores=True
- If one sequence is passed as input, a list with one element (either a dict or list of dicts if
return_all_scores=True) will be returned.
- for each input, a dict containing:
label (str) – The label predicted. score (float) – The corresponding probability.
- if return_all_scores, dict will be returned for each class, for each input
-
train
(output_dir=None, train_text=None, train_targets=None, eval_text=None, eval_targets=None, max_len=None, freeze_base=False, task_learn_rate=None, collator=None, train_dataset=None, eval_dataset=None, compute_metrics=None, trainer=None, **training_args)¶ Fine tune a transformer model on a text classification dataset
- Parameters
- output_dir: str
Path to the output directory where the model predictions and checkpoints will be written. Used to instantiate Trainer if trainer argument is None.
- train_text: List[str]
list of training texts
- train_targets: List[int] or List [str]
corresponding list of classes for each training text. If strings, will use label2id to convert to output indices, else will assume already converted.
- eval_text: (Optional) str or List[str]
list of evaluation texts
- eval_targets: List[int]
corresponding list of classes for each evaluation text
- max_len: int
max length of input sequence. Will default to self.tokenizer.max_length() if None
- freeze_base: Boolean
Whether to freeze the weights of the base model, so training only changes the task head weights. If true, the requires_grad flag for parameters of the base model will be set to false before training.
- task_learn_rate: float
Learning rate to be used for task specific parameters, (base parameters will use the normal, ie already defined in **training_args, learning rate). If None, all parameters will use the same normal learning rate.
- collator: Transformers.DataCollator
custom collator to use
- train_dataset, eval_dataset: torch.Dataset
instantiated custom dataset object
- compute_metrics: function
The function that will be used to compute metrics at evaluation. Must return a dictionary string to metric values. Used by the trainer, see https://huggingface.co/transformers/training.html#trainer for more info.
- trainer: Transformers.Trainer
custom instantiated trainer to use
- training_args:
keyword arguments for training. For complete list, see https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments
-
generate_outputs
(text, pre_tokenized=False, output_attentions=False, output_hidden_states=False, word_embeddings=False, **tokenizer_inputs)¶ Tokenize input text and pass it through the model, optionally outputting hidden states or attention matrices.
- Parameters
- text: str or List[str] if pre_tokenized=False, else List[str] or List[List[str]]
the text to compute features for.
- pre_tokenized: Boolean
whether or not the input text is pre-tokenized (ie, split on spaces)
- output_attentions: Boolean, optional, defaults to False
Whether or not to return the attentions tensors of all attention layers.
- output_hidden_states: Boolean, optional, defaults to False
Whether or not to return the hidden states of all layers.
- word_embeddings: Boolean
whether or not to filter special token embeddings and average sub-word embeddings (hidden states) into word embeddings. If pre-tokenized inputs, the sub-word embeddings will be averaged into the tokens pass as inputs. If pre-tokenized=False, the text will be split on whitespace and the sub-word embeddings will be averaged back into these words produced by splitting the text on whitespace. Only used if output_hidden_states = True. If False, number of output embeddings could be greater than (number of words + special tokens). If True, number of output embeddings == number of words, sub-words are averaged together to create word level embeddings and special token embeddings are excluded.
- tokenizer_inputs: tokenizer_inputs include arguments passed to tokenizer, such as presaved entity annotations
for enhanced models.
- Returns
- Dict containing (note that if single text sequence is passed as input, batch size will be 1):
- hidden_states: (Tuple (one for each layer) of torch.FloatTensor (batch_size, sequence_length, hidden_size)).
Hidden-states of the model at the output of each layer plus the initial embedding outputs. Only returned if output_hidden_states is True. If word_embeddings, the sequence length will be the number of words in the longest sentence, ie the maximum number of words. Shorter sequences will be padded with zeros.
- attentions: Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length,
sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. Only returned if output_attentions is True
- offset_mappings: Tensor of shape (batch_size, sequence_length, 2)
Tensor containing (char_start, char_end) for each token, giving index into input strings of start and end character for each token. If input is pre-tokenized, start and end index maps into associated word. Note that special tokens are included with 0 for start and end indices, as these don’t map into input text because they are added inside the function. This output is only available to tokenizers that inherit from transformers.PreTrainedTokenizerFast . This includes the tokenizer and most other common tokenizers, but not all possible tokenizers in the library. If the tokenizer did not inherit from this class, this output value will be None.
- if PretrainedTransformerModelForFeatureExtraction:
- last_hidden_state: (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size))
Sequence of hidden-states at the output of the last layer of the model.
- pooler_output: (torch.FloatTensor of shape (batch_size, hidden_size))
Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function.
- if TransformerForSequenceClassification:
- logits: (torch.FloatTensor of shape (batch_size, config.num_labels))
classification scores, before softmax
- if TransformerForTokenClassification:
- logits: (torch.FloatTensor of shape (batch_size, sequence_length, config.num_labels))
classification scores, before softmax
-
save
(output_directory, save_tokenizer=False)¶ Method to save transformers model and optionally save tokenizer. The tokenizer is already saved (the input to this class includes a pretrained tokenizer), but this method will save the tokenizer as well if needed. Both the tokenizer files and model files will be saved to the output directory. The output directory can be inputted as an argument to the “load()” method of the inheriting classes (for the model and tokenizer arguments)
- Parameters
- output_directory: str
path to directory to save model
- save_tokenizer: Boolean
whether to save tokenizer in directory or not, defaults to False
-
class
mangoes.modeling.finetuning.
TransformerForTokenClassification
(pretrained_model, pretrained_tokenizer, labels=None, label2id=None, device=None, **keyword_args)¶ Bases:
mangoes.modeling.transformer_base.PipelineMixin
,mangoes.modeling.transformer_base.TransformerModel
Transformer model with Token classification head. For a list of eligible model classes, see: https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForTokenClassification.from_config.config
- Variables
self.tokenizer – transformers.AutoTokenizerFast object, see https://huggingface.co/docs/transformers/model_doc/auto for more documentation
self.model – transformers.AutoModelForTokenClassification object, see https://huggingface.co/docs/transformers/model_doc/auto
- Parameters
- pretrained_model: str, transformers.PretrainedModel subclass.
- Either:
A string with the shortcut name of a pretrained model to load from cache or download, e.g.,
bert-base-uncased
.A string with the identifier name of a pretrained model that was user-uploaded to our S3, e.g.,
dbmdz/bert-base-german-cased
.A path to a directory containing model weights saved using
save_pretrained()
, e.g.,./my_model_directory/
.An already instantiated transformer model, either an transformers.AutoModelForTokenClassification compatible class or a mangoes enhanced language model.
- pretrained_tokenizer: str, transformers.PreTrainedTokenizerBase subclass.
- Either:
A string with the shortcut name of a pretrained model to load from cache or download, e.g.,
bert-base-uncased
.A string with the identifier name of a pretrained model that was user-uploaded to our S3, e.g.,
dbmdz/bert-base-german-cased
.A path to a directory containing model weights saved using
save_pretrained()
, e.g.,./my_model_directory/
.An already instantiated tokenizer, either an transformers.AutoTokenizer compatible class or a mangoes enhanced language model tokenizer.
- labels: list of str
list of class names. Will raise error if labels=None and label2id=None.
- label2id: dict of str -> int
dict mapping class name to index of output layer if None, will create from labels.
- device: int, optional, defaults to None
if -1, use cpu, if >= 0, use CUDA device number. If None, will use GPU if available
- **keyword_args include arguments passed to transformers.AutoConfig
Methods
generate_outputs
(text[, pre_tokenized, …])Tokenize input text and pass it through the model, optionally outputting hidden states or attention matrices.
predict
(inputs)Predicts classes for input texts
save
(output_directory[, save_tokenizer])Method to save transformers model and optionally save tokenizer.
train
([output_dir, train_text, …])Fine tune a transformers model on a text classification dataset
-
predict
(inputs)¶ Predicts classes for input texts
- Parameters
- inputs: str or list of strs
inputs to classify
- Returns
- list of list of dict. For each sequence, a list of token prediction dicts. If a single sequence is passed as
- input, the output will be a list of 1 list of dictionaries.
- for each token, a dict containing:
word (str) – The token/word classified. score (float) – The corresponding probability for entity. entity (str) – The entity predicted for that token/word. index (int, only present when self.grouped_entities=False) – The index of the corresponding token in the
sentence.
-
train
(output_dir=None, train_text=None, train_targets=None, eval_text=None, eval_targets=None, max_len=None, freeze_base=False, task_learn_rate=None, collator=None, train_dataset=None, eval_dataset=None, compute_metrics=None, trainer=None, **training_args)¶ Fine tune a transformers model on a text classification dataset
- Parameters
- output_dir: str
Path to the output directory where the model predictions and checkpoints will be written. Used to instantiate Trainer if trainer argument is None.
- train_text: List[str]
list of training texts
- train_targets: List[List[int]]
corresponding list of classes for each token in each training text
- eval_text: str or List[str]
list of evaluation texts
- eval_targets: List[int]
corresponding list of classes for each evaluation text
- max_len: int
max length of input sequence. Will default to self.tokenizer.max_length() if None
- freeze_base: Boolean
Whether to freeze the weights of the base model, so training only changes the task head weights. If true, the requires_grad flag for parameters of the base model will be set to false before training.
- task_learn_rate: float
Learning rate to be used for task specific parameters, (base parameters will use the normal, ie already defined in **training_args, learning rate). If None, all parameters will use the same normal learning rate.
- collator: Transformers.DataCollator
custom collator to use. If none, will use transformers.DataCollatorForTokenClassification.
- train_dataset, eval_dataset: torch.Dataset
instantiated custom dataset object
- compute_metrics: function
The function that will be used to compute metrics at evaluation. Must return a dictionary string to metric values. Used by the trainer, see https://huggingface.co/transformers/training.html#trainer for more info.
- trainer: Transformers.Trainer
custom instantiated trainer to use
- training_args:
keyword arguments for training. For complete list, see https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments
-
generate_outputs
(text, pre_tokenized=False, output_attentions=False, output_hidden_states=False, word_embeddings=False, **tokenizer_inputs)¶ Tokenize input text and pass it through the model, optionally outputting hidden states or attention matrices.
- Parameters
- text: str or List[str] if pre_tokenized=False, else List[str] or List[List[str]]
the text to compute features for.
- pre_tokenized: Boolean
whether or not the input text is pre-tokenized (ie, split on spaces)
- output_attentions: Boolean, optional, defaults to False
Whether or not to return the attentions tensors of all attention layers.
- output_hidden_states: Boolean, optional, defaults to False
Whether or not to return the hidden states of all layers.
- word_embeddings: Boolean
whether or not to filter special token embeddings and average sub-word embeddings (hidden states) into word embeddings. If pre-tokenized inputs, the sub-word embeddings will be averaged into the tokens pass as inputs. If pre-tokenized=False, the text will be split on whitespace and the sub-word embeddings will be averaged back into these words produced by splitting the text on whitespace. Only used if output_hidden_states = True. If False, number of output embeddings could be greater than (number of words + special tokens). If True, number of output embeddings == number of words, sub-words are averaged together to create word level embeddings and special token embeddings are excluded.
- tokenizer_inputs: tokenizer_inputs include arguments passed to tokenizer, such as presaved entity annotations
for enhanced models.
- Returns
- Dict containing (note that if single text sequence is passed as input, batch size will be 1):
- hidden_states: (Tuple (one for each layer) of torch.FloatTensor (batch_size, sequence_length, hidden_size)).
Hidden-states of the model at the output of each layer plus the initial embedding outputs. Only returned if output_hidden_states is True. If word_embeddings, the sequence length will be the number of words in the longest sentence, ie the maximum number of words. Shorter sequences will be padded with zeros.
- attentions: Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length,
sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. Only returned if output_attentions is True
- offset_mappings: Tensor of shape (batch_size, sequence_length, 2)
Tensor containing (char_start, char_end) for each token, giving index into input strings of start and end character for each token. If input is pre-tokenized, start and end index maps into associated word. Note that special tokens are included with 0 for start and end indices, as these don’t map into input text because they are added inside the function. This output is only available to tokenizers that inherit from transformers.PreTrainedTokenizerFast . This includes the tokenizer and most other common tokenizers, but not all possible tokenizers in the library. If the tokenizer did not inherit from this class, this output value will be None.
- if PretrainedTransformerModelForFeatureExtraction:
- last_hidden_state: (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size))
Sequence of hidden-states at the output of the last layer of the model.
- pooler_output: (torch.FloatTensor of shape (batch_size, hidden_size))
Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function.
- if TransformerForSequenceClassification:
- logits: (torch.FloatTensor of shape (batch_size, config.num_labels))
classification scores, before softmax
- if TransformerForTokenClassification:
- logits: (torch.FloatTensor of shape (batch_size, sequence_length, config.num_labels))
classification scores, before softmax
-
save
(output_directory, save_tokenizer=False)¶ Method to save transformers model and optionally save tokenizer. The tokenizer is already saved (the input to this class includes a pretrained tokenizer), but this method will save the tokenizer as well if needed. Both the tokenizer files and model files will be saved to the output directory. The output directory can be inputted as an argument to the “load()” method of the inheriting classes (for the model and tokenizer arguments)
- Parameters
- output_directory: str
path to directory to save model
- save_tokenizer: Boolean
whether to save tokenizer in directory or not, defaults to False
-
class
mangoes.modeling.finetuning.
TransformerForQuestionAnswering
(pretrained_model, pretrained_tokenizer, device=None, **keyword_args)¶ Bases:
mangoes.modeling.transformer_base.PipelineMixin
,mangoes.modeling.transformer_base.TransformerModel
Transformer model with span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). For a list of eligible model classes, see: https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForQuestionAnswering.from_config.config
- Variables
self.tokenizer – transformers.AutoTokenizerFast object, see https://huggingface.co/docs/transformers/model_doc/auto for more documentation
self.model – transformers.AutoModelForQuestionAnswering object, see https://huggingface.co/docs/transformers/model_doc/auto
- Parameters
- pretrained_model: str, transformers.PretrainedModel subclass.
- Either:
A string with the shortcut name of a pretrained model to load from cache or download, e.g.,
bert-base-uncased
.A string with the identifier name of a pretrained model that was user-uploaded to our S3, e.g.,
dbmdz/bert-base-german-cased
.A path to a directory containing model weights saved using
save_pretrained()
, e.g.,./my_model_directory/
.An already instantiated transformer model, either an transformers.AutoModelForQuestionAnswering compatible class or a mangoes enhanced language model.
- pretrained_tokenizer: str, transformers.PreTrainedTokenizerBase subclass.
- Either:
A string with the shortcut name of a pretrained model to load from cache or download, e.g.,
bert-base-uncased
.A string with the identifier name of a pretrained model that was user-uploaded to our S3, e.g.,
dbmdz/bert-base-german-cased
.A path to a directory containing model weights saved using
save_pretrained()
, e.g.,./my_model_directory/
.An already instantiated tokenizer, either an transformers.AutoTokenizer compatible class or a mangoes enhanced language model tokenizer.
- device: int, optional, defaults to None
if -1, use cpu, if >= 0, use CUDA device number. If None, will use GPU if available
- **keyword_args include arguments passed to transformers.AutoConfig
Methods
generate_outputs
(question, context[, …])Tokenize questions and context and pass them through the transformer model and QA head, optionally outputting hidden states or attention matrices.
predict
([inputs, question, context])Answer the question(s) given as inputs by using the context(s).
save
(output_directory[, save_tokenizer])Method to save transformers model and optionally save tokenizer.
train
([output_dir, train_question_texts, …])Fine tune a transformers model on a question answering dataset
-
generate_outputs
(question, context, pre_tokenized=False, output_attentions=False, output_hidden_states=False, word_embeddings=False, doc_stride=128, **tokenizer_inputs)¶ Tokenize questions and context and pass them through the transformer model and QA head, optionally outputting hidden states or attention matrices. Works for single question/context or batch. If a single question/context is given, a batch of size 1 will be created. Note that offset_mappings are not returned for this class method: only tokenizers that inherit from transformers.PreTrainedTokenizerFast have this functionality, and these tokenizers are not compatible with QuestionAnswering functionality (see https://github.com/huggingface/transformers/issues/7735, and https://github.com/huggingface/transformers/issues/8787#issuecomment-779213050 for more details).
- Parameters
- question: str or List[str] if pre_tokenized=False, else List[str] or List[List[str]]
The question text
- context: str or List[str] if pre_tokenized=False, else List[str] or List[List[str]]
The context text
- pre_tokenized: Boolean
Whether or not the input text is pretokenized (ie, split on spaces)
- output_attentions: Boolean, optional, defaults to False
Whether or not to return the attentions tensors of all attention layers.
- output_hidden_states: Boolean, optional, defaults to False
Whether or not to return the hidden states of all layers.
- word_embeddings: Boolean
Whether or not to filter special token embeddings and average subword embeddings (hidden states) into word embeddings. This functionality is not available for this task class. Use the feature extraction class instead.
- doc_stride: int, defaults to 128
If the context is too long to fit with the question for the model, it will be split in several chunks with some overlap. This argument controls the size of that overlap.
- tokenizer_inputs: tokenizer_inputs include arguments passed to tokenizer, such as presaved entity annotations
for enhanced models.
- Returns
- Dict containing (note that if single text sequence is pass as input, batch size will be 1):
- start_logits: torch.FloatTensor of shape (batch_size, sequence_length)
Span-start scores (before SoftMax).
- end_logits: torch.FloatTensor of shape (batch_size, sequence_length)
Span-end scores (before SoftMax).
- offset_mappings: Tensor of shape (batch_size, sequence_length, 2)
Note that unlike other fine tuning classes and the base class, offset mappings are not returned for this class. See the class documentation for more details.
- hidden_states: Tuple (one for each layer) of torch.FloatTensor (batch_size, sequence_length, hidden_size).
Hidden-states of the model at the output of each layer plus the initial embedding outputs. Only returned if output_hidden_states is True
- attentions: Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length,
sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. Only returned if output_attentions is True
-
predict
(inputs=None, question=None, context=None, **kwargs)¶ Answer the question(s) given as inputs by using the context(s). Takes either transformers.SquadExample (or list of them) or lists of strings, see argument documentation.
- Parameters
- inputs: transformers.SquadExample or a list of SquadExample
One or several SquadExample containing the question and context.
- question: str or List[str]
One or several question(s) (must be used in conjunction with the context argument).
- context: str or List[str]
One or several context(s) associated with the question(s) (must be used in conjunction with the question argument).
- kwargs include:
- topk: int, defaults to 1
The number of answers to return (will be chosen by order of likelihood).
- doc_stride: int, defaults to 128
If the context is too long to fit with the question for the model, it will be split in several chunks with some overlap. This argument controls the size of that overlap.
- max_answer_len: int, defaults to 15
The maximum length of predicted answers (e.g., only answers with a shorter length are considered).
- max_seq_len: int, defaults to 384
The maximum length of the total sentence (context + question) after tokenization. The context will be split in several chunks (using doc_stride) if needed.
- max_question_len: int, defaults to 64
The maximum length of the question after tokenization. It will be truncated if needed.
- handle_impossible_answer: bool, defaults to False
Whether or not we accept impossible as an answer.
- Returns
- Returns a list of answers, one for each question. If only one question has been passed as input, returns a list
- with one dictionary.
- Each answer is a dict with:
- score: float
The probability associated to the answer.
- start: int
The start index of the answer (in the tokenized version of the input).
- end: int
The end index of the answer (in the tokenized version of the input).
- answer: str
The answer to the question.
-
train
(output_dir=None, train_question_texts=None, eval_question_texts=None, train_context_texts=None, eval_context_texts=None, train_answer_texts=None, eval_answer_texts=None, train_start_indices=None, eval_start_indices=None, max_seq_length=384, doc_stride=128, max_query_length=64, freeze_base=False, task_learn_rate=None, collator=None, train_dataset=None, eval_dataset=None, compute_metrics=None, trainer=None, **training_args)¶ Fine tune a transformers model on a question answering dataset
- Parameters
- output_dir: str
Path to the output directory where the model predictions and checkpoints will be written. Used to instantiate Trainer if trainer argument is None.
- train_question_texts, eval_question_texts: list of str
The texts corresponding to the questions
- train_context_texts, eval_context_texts: list of str
The texts corresponding to the contexts
- train_answer_texts, eval_answer_texts: list of str
The texts corresponding to the answers
- train_start_indices, eval_start_indices: list of int
The character positions of the start of the answers
- max_seq_length:int
The maximum total input sequence length after tokenization.
- doc_stride: int
When splitting up a long document into chunks, how much stride to take between chunks.
- max_query_length: int
The maximum number of tokens for the question.
- freeze_base: Boolean
Whether to freeze the weights of the base model, so training only changes the task head weights. If true, the requires_grad flag for parameters of the base model will be set to false before training.
- task_learn_rate: float
Learning rate to be used for task specific parameters, (base parameters will use the normal, ie already defined in **training_args, learning rate). If None, all parameters will use the same normal learning rate.
- collator: Transformers.DataCollator
custom collator to use
- train_dataset, eval_dataset: torch.Dataset
instantiated custom dataset object
- compute_metrics: function
The function that will be used to compute metrics at evaluation. Must return a dictionary string to metric values. Used by the trainer, see https://huggingface.co/transformers/training.html#trainer for more info.
- trainer: Transformers.Trainer
custom instantiated trainer to use
- training_args:
keyword arguments for training. For complete list, see https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments
-
save
(output_directory, save_tokenizer=False)¶ Method to save transformers model and optionally save tokenizer. The tokenizer is already saved (the input to this class includes a pretrained tokenizer), but this method will save the tokenizer as well if needed. Both the tokenizer files and model files will be saved to the output directory. The output directory can be inputted as an argument to the “load()” method of the inheriting classes (for the model and tokenizer arguments)
- Parameters
- output_directory: str
path to directory to save model
- save_tokenizer: Boolean
whether to save tokenizer in directory or not, defaults to False
-
class
mangoes.modeling.finetuning.
TransformerForMultipleChoice
(pretrained_model, pretrained_tokenizer, device=None, **keyword_args)¶ Bases:
mangoes.modeling.transformer_base.TransformerModel
Pretrained model with a multiple choice classification head on top (a linear layer on top of the pooled output and a softmax) e.g. for RocStories/SWAG tasks.
For a list of eligible model classes, see: https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForMultipleChoice.from_config.config
For information on how multiple choice datasets should be formatted for fine-tuning, see this explanation: https://github.com/google-research/bert/issues/38
And this link for explanation of Huggingface’s multiple choice models: https://github.com/huggingface/transformers/issues/7701#issuecomment-707149546
- Variables
self.tokenizer – transformers.AutoTokenizerFast object, see https://huggingface.co/docs/transformers/model_doc/auto for more documentation
self.model – transformers.AutoModelForMultipleChoice object, see https://huggingface.co/docs/transformers/model_doc/auto
- Parameters
- pretrained_model: str, transformers.PretrainedModel subclass.
- Either:
A string with the shortcut name of a pretrained model to load from cache or download, e.g.,
bert-base-uncased
.A string with the identifier name of a pretrained model that was user-uploaded to our S3, e.g.,
dbmdz/bert-base-german-cased
.A path to a directory containing model weights saved using
save_pretrained()
, e.g.,./my_model_directory/
.An already instantiated transformer model, either an transformers.AutoModelForMultipleChoice compatible class or a mangoes enhanced language model.
- pretrained_tokenizer: str, transformers.PreTrainedTokenizerBase subclass.
- Either:
A string with the shortcut name of a pretrained model to load from cache or download, e.g.,
bert-base-uncased
.A string with the identifier name of a pretrained model that was user-uploaded to our S3, e.g.,
dbmdz/bert-base-german-cased
.A path to a directory containing model weights saved using
save_pretrained()
, e.g.,./my_model_directory/
.An already instantiated tokenizer, either an transformers.AutoTokenizer compatible class or a mangoes enhanced language model tokenizer.
- device: int, optional, defaults to None
if -1, use cpu, if >= 0, use CUDA device number. If None, will use GPU if available
- **keyword_args include arguments passed to transformers.AutoConfig
Methods
generate_outputs
(questions, choices[, …])Tokenize context and choices and pass them through the model and MC head, optionally outputting hidden states or attention matrices.
predict
(questions, choices[, pre_tokenized])Predicts the answer to the question(s) out of the possible choices.
save
(output_directory[, save_tokenizer])Method to save transformers model and optionally save tokenizer.
train
([output_dir, train_question_texts, …])Fine tune a pretrained model on a multiple choice dataset
-
generate_outputs
(questions, choices, pre_tokenized=False, output_attentions=False, output_hidden_states=False, word_embeddings=False, **tokenizer_inputs)¶ Tokenize context and choices and pass them through the model and MC head, optionally outputting hidden states or attention matrices. Works for a single question/set of choices or a batch. If a single question/set of choices is given, a batch of size 1 will be created.
- Follows these explanations for packing MC data and sending it through the model:
https://github.com/google-research/bert/issues/38 https://github.com/huggingface/transformers/issues/7701#issuecomment-7071495
- Parameters
- questions: str or List[str] if pre_tokenized=False, else List[str] or List[List[str]]
The question text. This can include the context together with a question, or (in the case of some datasets such as the SWAG dataset) just the context if there is no direct question. Can be a single question or list of questions.
- choices: List[str] or List[List[str]] if pre_tokenized=False, else List[List[str]] or List[List[List[str]]]
The choices text. One instance of choices should be a list of strings (if not pre-tokenized) or a list of list of strings (if pre-tokenized). Can be a single choice instance or multiple. If batch is passed in (ie more than one question), assumes all questions have same number of choices.
- pre_tokenized: Boolean
Whether or not the input text is pre-tokenized (ie, split on spaces)
- output_attentions: Boolean, optional, defaults to False
Whether or not to return the attentions tensors of all attention layers.
- output_hidden_states: Boolean, optional, defaults to False
Whether or not to return the hidden states of all layers.
- word_embeddings: Boolean
Whether or not to filter special token embeddings and average subword embeddings (hidden states) into word embeddings. This functionality is not available for this task class. Use the feature extraction class instead.
- tokenizer_inputs: tokenizer_inputs include arguments passed to tokenizer, such as presaved entity annotations
for enhanced models.
- Returns
- Dict containing (note that if single question/set of choices is passed as input, batch size will be 1):
- logits: Tensor of shape (batch_size, num_choices)
Classification scores (before SoftMax). If batch
- offset_mappings: Tensor of shape (batch_size, num_choices, sequence_length, 2)
Tensor containing (char_start, char_end) for each token, giving index into input strings of start and end character for each token. If input is pre-tokenized, start and end index maps into associated word. Note that special tokens are included with 0 for start and end indices, as these don’t map into input text because they are added inside the function. Offset mappings for both questions and choices are merged to one row, but indices still align to them separately.
- hidden_states: Tuple (one for each layer) of torch.FloatTensor of size
(batch_size, num_choices, sequence_length, hidden_size) Hidden-states of the model at the output of each layer plus the initial embedding outputs. Only returned if output_hidden_states is True
- attentions: Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_choices, num_heads,
sequence_length, sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. Only returned if output_attentions is True
-
predict
(questions, choices, pre_tokenized=False)¶ Predicts the answer to the question(s) out of the possible choices.
- Parameters
- questions: str or List[str] if pre_tokenized=False, else List[str] or List[List[str]]
The question text. This can include the context together with a question, or (in the case of some datasets such as the SWAG dataset) just the context if there is no direct question. Can be a single question or list of questions.
- choices: List[str] or List[List[str]] if pre_tokenized=False, else List[List[str]] or List[List[List[str]]]
The choices text. One instance of choices should be a list of strings (if not pre-tokenized) or a list of list of strings (if pre-tokenized). Can be a single choice instance or multiple. If batch is passed in (ie more than one question), assumes all questions have same number of choices.
- pre_tokenized: Boolean
Whether or not the input text is pre-tokenized (ie, split on spaces)
- Returns
- List of answer prediction dicts, one for each question (returns list of length 1 if single question is passed as
- input. Answer prediction dicts include:
- score: float
The probability associated to the answer.
- answer_index: int
The index of the predicted choice.
- answer_text: str (if not pre_tokenized) or List[str] (if pre_tokenized)
The text corresponding to the predicted answer.
-
train
(output_dir=None, train_question_texts=None, eval_question_texts=None, train_choices_texts=None, eval_choices_texts=None, train_labels=None, eval_labels=None, max_len=None, freeze_base=False, task_learn_rate=None, collator=None, train_dataset=None, eval_dataset=None, compute_metrics=None, trainer=None, **training_args)¶ Fine tune a pretrained model on a multiple choice dataset
- Parameters
- output_dir: str
Path to the output directory where the model predictions and checkpoints will be written. Used to instantiate Trainer if trainer argument is None.
- train_question_texts, eval_question_texts: list of str
The texts corresponding to the questions/contexts.
- train_choices_texts, eval_choices_texts: list of str
The texts corresponding to the answer choices
- train_labels, eval_labels: list of int
The indices of the correct answers
- max_len:int
The maximum total input sequence length after tokenization. Note that if a question answer pair sequence is longer than this length, it will be truncated.
- freeze_base: Boolean
Whether to freeze the weights of the base model, so training only changes the task head weights. If true, the requires_grad flag for parameters of the base model will be set to false before training.
- task_learn_rate: float
Learning rate to be used for task specific parameters, (base parameters will use the normal, ie already defined in **training_args, learning rate). If None, all parameters will use the same normal learning rate.
- collator: Transformers.DataCollator
custom collator to use
- train_dataset, eval_dataset: torch.Dataset
instantiated custom dataset object
- compute_metrics: function
The function that will be used to compute metrics at evaluation. Must return a dictionary string to metric values. Used by the trainer, see https://huggingface.co/transformers/training.html#trainer for more info.
- trainer: Transformers.Trainer
custom instantiated trainer to use
- training_args:
keyword arguments for training. For complete list, see https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments
-
save
(output_directory, save_tokenizer=False)¶ Method to save transformers model and optionally save tokenizer. The tokenizer is already saved (the input to this class includes a pretrained tokenizer), but this method will save the tokenizer as well if needed. Both the tokenizer files and model files will be saved to the output directory. The output directory can be inputted as an argument to the “load()” method of the inheriting classes (for the model and tokenizer arguments)
- Parameters
- output_directory: str
path to directory to save model
- save_tokenizer: Boolean
whether to save tokenizer in directory or not, defaults to False
-
class
mangoes.modeling.finetuning.
TransformerForCoreferenceResolution
(pretrained_model, pretrained_tokenizer, device=None, max_span_width=30, ffnn_hidden_size=1000, top_span_ratio=0.4, max_top_antecendents=50, use_metadata=False, metadata_feature_size=20, genres=('bc', 'bn', 'mz', 'nw', 'pt', 'tc', 'wb'), max_training_segments=5, coref_depth=1, coref_dropout=0.3, pretrained=False, **keyword_args)¶ Bases:
mangoes.modeling.transformer_base.TransformerModel
Class for fine tuning a transformer model for the coreference resolution task.
The base model is an implementation of the independent variant of https://arxiv.org/pdf/1908.09091.pdf, which uses the fine tuning procedure described in https://arxiv.org/pdf/1804.05392.pdf
- Variables
self.tokenizer – transformers.AutoTokenizerFast object, see https://huggingface.co/docs/transformers/model_doc/auto for more documentation
self.model – transformers.AutoModel object, see https://huggingface.co/docs/transformers/model_doc/auto
- Parameters
- pretrained_model: str, transformers.PretrainedModel subclass.
- Either:
A string with the shortcut name of a pretrained model to load from cache or download, e.g.,
bert-base-uncased
.A string with the identifier name of a pretrained model that was user-uploaded to our S3, e.g.,
dbmdz/bert-base-german-cased
.A path to a directory containing model weights saved using
save_pretrained()
, e.g.,./my_model_directory/
.An already instantiated transformer model, either an transformers. compatible class or a mangoes enhanced language model.
- pretrained_tokenizer: str, transformers.PreTrainedTokenizerBase subclass.
- Either:
A string with the shortcut name of a pretrained model to load from cache or download, e.g.,
bert-base-uncased
.A string with the identifier name of a pretrained model that was user-uploaded to our S3, e.g.,
dbmdz/bert-base-german-cased
.A path to a directory containing model weights saved using
save_pretrained()
, e.g.,./my_model_directory/
.An already instantiated tokenizer, either an transformers.AutoTokenizer compatible class or a mangoes enhanced language model tokenizer.
- device: int, optional, defaults to None
If -1, use cpu, if >= 0, use CUDA device number. If None, will use GPU if available.
- max_span_width: int, defaults to 30
Maximum width (consecutive tokens) of candidate span.
- ffnn_hidden_size: int, defaults to 1000
Size of hidden layers in dense mention scorer and slow antecedent scorer heads.
- top_span_ratio: float, defaults to 0.4
Ratio of spans to consider after first sort on mention score.
- max_top_antecendents” int, defaults to 50
Max number of antecedents to consider for each span after fast antecedent scorer.
- use_metadata: Boolean, defaults to False
Whether to use metadata (speaker and genre information) in forward pass.
- metadata_feature_size: int, defaults to 20
Size of metadata features (if using metadata)
- genres: List of string, defaults to (“bc”, “bn”, “mz”, “nw”, “pt”, “tc”, “wb”)
List of possible genres (if using metadata). Defaults to genres in Ontonotes dataset.
- max_training_segments: int, defaults to 5
Maximum number of segments in one document (aka one batch).
- coref_depth: int, defaults to 2
Depth of higher order (aka slow) antecedent scoring.
- coref_dropout: float, defaults to 0.3
Dropout probability for head layers.
- **keyword_args
Methods
generate_outputs
(text[, pre_tokenized, …])Pass one batch (document) worth of text through the co-reference model, outputting mention scores and token indices for possible spans, the indices and scores of the top antecedents for the top mention scored spans, and optionally the hidden states or attention matrices from the base model.
predict
(text[, pre_tokenized, speaker_ids, …])Predict the co-reference clusters in text.
save
(output_directory[, save_tokenizer])Method to save transformers model and optionally save tokenizer.
train
([output_dir, max_segment_len, …])Fine tune a pretrained model on a co-reference resolution dataset.
-
save
(output_directory, save_tokenizer=False)¶ Method to save transformers model and optionally save tokenizer. The tokenizer is already saved (the input to this class includes a pretrained tokenizer), but this method will save the tokenizer as well if needed. Both the tokenizer files and model files will be saved to the output directory. The output directory can be inputted as an argument to the “load()” method of the inheriting classes (for the model and tokenizer arguments)
- Parameters
- output_directory: str
path to directory to save model
- save_tokenizer: Boolean
whether to save tokenizer in directory or not, defaults to False
-
predict
(text, pre_tokenized=False, speaker_ids=None, genre=None, max_segment_len=256, max_segments=5)¶ Predict the co-reference clusters in text. Takes one document at a time. Internally calls generate_outputs and then processes the outputs.
- Parameters
- text: str or List[str] if pre_tokenized=False, else List[str] or List[List[str]]
The text to predict co-references for. If pre_tokenized, the text can be one sentence (list of words) or list of sentences (list of list of words). If not pre_tokenized, text can be one sentence (str) or list of sentences (list of strings).
- pre_tokenized: Boolean
Whether or not the input text is pretokenized (ie, split on spaces). This method will still pass it through the tokenizer, in order to get subtokens, special characters, and attention masks.
- speaker_ids: int or List[int] if pre_tokenized=False, else List[int] or List[List[int]]
Speaker ids for input text. If pre_tokenized, speaker_ids should be for each word in each sentence of input text (ie, list of int if one sentence, or list of list of int if multiple). If not pre_tokenized, speaker ids should be on a sentence basis, ie one int if one input sentence, or list of ints if multiple. Optional, needed only if the model has been trained/instantiated to accept metadata.
- genre: Int or String
Genre of text. If string, will attempt to use the genre id mapping constructed by the model parameter to this object. Optional, needed only if the model has been trained/instantiated to accept metadata.
- max_segment_len: int, defaults to 256
maximum number of sub-tokens for one segment
- max_segments: int, defaults to 5
Maximum number of segments to return per document
- Returns
- List of dicts.
- For each found co-reference cluster, a dict with the following keys:
- cluster_tokens: List[List[str]]]
The text spans associated with the cluster. Spans are represented by the list of tokens.
- cluster_ids: List[List[int]]
The id spans associated with the cluster. Spans are represented by the list of token ids.
-
generate_outputs
(text, pre_tokenized=False, speaker_ids=None, genre=None, output_attentions=False, output_hidden_states=False, word_embeddings=False, max_segment_len=256, max_segments=5, **tokenizer_inputs)¶ Pass one batch (document) worth of text through the co-reference model, outputting mention scores and token indices for possible spans, the indices and scores of the top antecedents for the top mention scored spans, and optionally the hidden states or attention matrices from the base model.
Note that this functions does not return “offset_mappings”, and instead returns the flattened ids and text.
- Parameters
- text: str or List[str] if pre_tokenized=False, else List[str] or List[List[str]]
The text to predict co-references for. If pre_tokenized, the text can be one sentence (list of words) or list of sentences (list of list of words). If not pre_tokenized, text can be one sentence (str) or list of sentences (list of strings).
- pre_tokenized: Boolean
Whether or not the input text is pretokenized (ie, split on spaces). This method will still pass it through the tokenizer, in order to get subtokens, special characters, and attention masks.
- speaker_ids: int or List[int] if pre_tokenized=False, else List[int] or List[List[int]]
Speaker ids for input text. If pre_tokenized, speaker_ids should be for each word in each sentence of input text (ie, list of int if one sentence, or list of list of int if multiple). If not pre_tokenized, speaker ids should be on a sentence basis, ie one int if one input sentence, or list of ints if multiple. Optional, needed only if the model has been trained/instantiated to accept metadata.
- genre: Int or String
Genre of text. If string, will attempt to use the genre id mapping constructed by the model parameter to this object. Optional, needed only if the model has been trained/instantiated to accept metadata.
- output_attentions: Boolean, optional, defaults to False
Whether or not to return the attentions tensors of all attention layers.
- output_hidden_states: Boolean, optional, defaults to False
Whether or not to return the hidden states of all layers.
- word_embeddings: Boolean
Whether or not to filter special token embeddings and average subword embeddings (hidden states) into word embeddings. Note: this functionality is not available for this class because of consolidation of input in the forward pass of the model. Consider using mangoes.modeling.TransformerForFeatureExtraction class for word-level feature extractions.
- max_segment_len: int, defaults to 256
maximum number of sub-tokens for one segment
- max_segments: int, defaults to 5
Maximum number of segments to return per document
- tokenizer_inputs: tokenizer_inputs include arguments passed to tokenizer, such as presaved entity annotations
for enhanced models.
- Returns
- Dict containing:
- candidate_starts: tensor of size (num_spans)
start token indices in flattened document of candidate spans
- candidate_ends: tensor of size (num_spans)
end token indices in flattened document of candidate spans
- candidate_mention_scores: tensor of size (num_spans)
mention scores for each candidate span
- top_span_starts: tensor of size (num_top_spans)
start token indices in flattened document of candidate spans with top mention scores
- top_span_ends: tensor of size (num_top_spans)
end token indices in flattened document of candidate spans with top mention scores
- top_antecedents: tensor of shape (num_top_spans, antecedent_candidates)
indices in top span candidates of top antecedents for each mention
- top_antecedent_scores: tensor of shape (num_top_spans, 1 + antecedent_candidates)
final antecedent scores of top antecedents for each mention. The dummy score (for not a co-reference) is inserted at the start of each row. Thus, the score for top_antecedents[0][0] is top_antecedent_scores[0][1]. The span for first candidate is top_span_starts[0] to top_span_ends[0]. The span for the first top antecedent for the first candidate is top_span_starts[top_antecedents[0][0]] to top_span_ends[top_antecedents[0][0]].
- flattened_ids: tensor of shape (num_tokens)
flattened ids of input sentences. The start and end candidate and span indices map into this tensor.
- flattened_text: tensor of shape (num_tokens)
flattened tokens of input sentences. The start and end candidate and span indices map into this tensor.
- hidden_states: Tuple (one for each layer) of torch.FloatTensor (batch_size, sequence_length, hidden_size).
Hidden-states of the model at the output of each layer plus the initial embedding outputs. Only returned if output_hidden_states is True
- attentions: Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length,
sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. Only returned if output_attentions is True
-
train
(output_dir=None, max_segment_len=256, max_segments=5, freeze_base=False, task_learn_rate=None, task_weight_decay=None, train_documents=None, train_cluster_ids=None, train_speaker_ids=None, train_genres=None, eval_documents=None, eval_cluster_ids=None, eval_speaker_ids=None, eval_genres=None, train_dataset=None, eval_dataset=None, compute_metrics=None, trainer=None, **training_args)¶ Fine tune a pretrained model on a co-reference resolution dataset. Users can input the raw coreference data and a torch dataset will be created, or they can input already instantiated dataset(s), or an already instantiated trainer.
Note that this implementation is based on the “independent” variant of the method introduced in https://arxiv.org/pdf/1908.09091.pdf, thus the batch size will be 1 (1 document per batch, with multiple segments per document), and a specific collator function will be used.
- Parameters
- output_dir: str
Path to the output directory where the model predictions and checkpoints will be written. Used to instantiate Trainer if trainer argument is None. Optional: needed if trainer is not provided
- max_segment_len: int, defaults to 256
maximum number of sub-tokens for one segment
- max_segments: int, defaults to 5
Maximum number of segments to return per document
- freeze_base: Boolean
Whether to freeze the weights of the base model, so training only changes the task head weights. If true, the requires_grad flag for parameters of the base model will be set to false before training.
- task_learn_rate: float
Learning rate to be used for task specific parameters, (base parameters will use the normal, ie already defined in **training_args, learning rate). If None, all parameters will use the same normal learning rate.
- task_weight_decay: float
Weight decay parameter to be used for task specific parameters. (base parameters will use the normal, ie already defined in **training_args, weight decay). If None, all parameters will use the same normal weight decay.
- train_documents: List of Lists of Lists of strings
Optional: needed if train_dataset or trainer is not provided Text for each document. As cluster ids are labeled by word, a document is a list of sentences. One sentence is a list of words (ie already split on whitespace/punctuation)
- train_cluster_ids: List of Lists of Lists of (ints or Tuple(int, int))
Optional: needed if train_dataset or trainer is not provided Cluster ids for each word in documents argument. Assumes words that aren’t mentions have either None or -1 as id. In the case where a word belongs to two different spans (with different cluster ids), the cluster id for word should be a tuple of ints corresponding to the different cluster ids.
- train_speaker_ids: List of Lists of Lists of ints
Optional: needed if train_dataset or trainer is not provided and model is using metadata Speaker id for each word in documents. Assumes positive ids (special tokens (such as [CLS] and [SEP] that are added at beginning and end of segments) will be assigned speaker ids of -1)
- train_genres: List of ints
Optional: needed if train_dataset or trainer is not provided and model is using metadata Genre id for each document
- eval_documents: List of Lists of Lists of strings
Optional: needed if train_dataset or trainer is not provided Text for each document. As cluster ids are labeled by word, a document is a list of sentences. One sentence is a list of words (ie already split on whitespace/punctuation)
- eval_cluster_ids: List of Lists of Lists of (ints or Tuple(int, int))
Optional: needed if train_dataset or trainer is not provided Cluster ids for each word in documents argument. Assumes words that aren’t mentions have either None or -1 as id. In the case where a word belongs to two different spans (with different cluster ids), the cluster id for word should be a tuple of ints corresponding to the different cluster ids.
- eval_speaker_ids: List of Lists of Lists of ints
Optional: needed if train_dataset or trainer is not provided and model is using metadata Speaker id for each word in documents. Assumes positive ids (special tokens (such as [CLS] and [SEP] that are added at beginning and end of segments) will be assigned speaker ids of -1)
- eval_genres: List of ints
Optional: needed if train_dataset or trainer is not provided and model is using metadata Genre id for each document
- train_dataset, eval_dataset: torch.Dataset
instantiated custom dataset object. Note that the model implementation and default trainer (mangoes.modeling.training_utils.CoreferenceFineTuneTrainer) are set up to work with mangoes.modeling.training_utils.MangoesCoreferenceDataset datasets, so take care when sending custom dataset arguments.
- compute_metrics: function
The function that will be used to compute metrics at evaluation. Must return a dictionary string to metric values. Used by the trainer, see https://huggingface.co/transformers/training.html#trainer for more info.
- trainer: Transformers.Trainer
custom instantiated trainer to use
- training_args:
keyword arguments for training. For complete list, see https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments