Hyperparameters¶
Both corpus preprocessing and word representation construction provide different parameters that can be tuned with mangoes.
References : LEVY, Omer, GOLDBERG, Yoav, et DAGAN, Ido. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 2015, vol. 3, p. 211-225.
Description |
Params |
Values |
Effect |
---|---|---|---|
CORPUS |
|||
Text normalisation |
lower |
boolean (default = False) |
Convert input corpus to lower case |
digit |
boolean (default = False) |
Replace all numeric values with 0 |
|
ignore_punctuation |
boolean (default = False) |
Ignore punctuation |
|
COUNTING |
|||
Vocabulary and features selection |
words |
words to represent |
|
context or vocabulary param of the context param |
words to use as features |
||
If vocabulary is extracted from the corpus : |
|||
Vocabulary filters
|
filters |
function (default = None) |
Filter most or least frequent words, remove punctuation, … |
Context definition |
context |
callable class (default =
|
from a sentence return the words to be considered as co-occurring for each word in the sentence |
If using window-like contexts : |
|||
Size of the window
|
window_half_size |
int (default = 1) |
size of the window |
Fixed size or dynamic
|
dynamic |
boolean (default = False) |
Fixed size of window or random between 1 and window_half_size |
Symmetric or asymmetric
|
symmetric |
boolean (default = True) |
The window can be centered around a word or asymmetrical |
Clean or dirty
|
dirty |
boolean (default = False) |
If dirty, remove ignored word before creating the window |
Subsampling |
subsampling |
boolean or float defining the threshold (default = False) |
Downsample the words more frequent than the threshold |
EMBEDDING |
|||
Transformations applied to the co-occurrence matrix |
transformations |
list of functions (default = None) |
Apply weighting and dimensionality reduction to counts |
Dimension of the vectors |
dimensions |
int |
Size of the vectors |
If using PMI or variant |
|||
Context Distribution
Smoothing
|
alpha |
float (default = 1 for not smoothed) |
Raise context counts to the power of alpha to “smooth” the contexts’ distribution |
Shift
|
shift |
int >= 1 (default = 1 for no shift) |
Shift the matrix of log(shift) |
If using SVD ( |
|||
Eigenvalue weighting
|
weight |
int (default = 1) |
Weighting exponent to apply to the eigenvalues |
Add context vectors
|
add_context_vectors |
boolean (default = False) |
Use the context vectors in addition to the words vectors |
Symmetric weighting
|
symmetric |
boolean (default = False) |
Way to compute the context vectors |