Bases: OptiModule
Momentum gradient-acceleration module.
This module impements the following algorithm:
Init(beta):
velocity = 0
Step(grads):
velocity = beta * velocity + grads
if nesterov:
grads = beta * velocity + grads
else:
grads = velocity
Note that this contrasts with the canonical implementation of momentum
by Sutskever et. al. [1]. The learning rate is applied to the whole output
of the algorithm above, in the Optmizer class, rather than only to the
gradient part of it, following the pytorch implementation.
The nesterov variant's implementation is equivalently adapted.
This formulation is equivalent to the canonical one for constant learning
rare (eta), with both approaches outputting:
$$ w_{t+1} = w_t - \eta \sum_{k=1}^t \beta^{t-k} \nabla_k $$
It may however yield differences when \(\eta\) changes through training:
- (can.) $$ w_{t+1} = w_t - \sum_{k=1}^t \eta_k \beta^{t-k} \nabla_k $$
- (ours) $$ w_{t+1} = w_t - \eta_t \sum_{k=1}^t \beta^{t-k} \nabla_k $$
References
[1] Sutskever et. al., 2013.
On the importance of initialization and momentum in deep learning
https://proceedings.mlr.press/v28/sutskever13.pdf
Source code in declearn/optimizer/modules/_momentum.py
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118 | class MomentumModule(OptiModule):
"""Momentum gradient-acceleration module.
This module impements the following algorithm:
Init(beta):
velocity = 0
Step(grads):
velocity = beta * velocity + grads
if nesterov:
grads = beta * velocity + grads
else:
grads = velocity
Note that this contrasts with the canonical implementation of momentum
by Sutskever et. al. [1]. The learning rate is applied to the whole output
of the algorithm above, in the Optmizer class, rather than only to the
gradient part of it, following the [pytorch implementation](\
https://pytorch.org/docs/stable/generated/torch.optim.SGD.html).
The nesterov variant's implementation is equivalently adapted.
This formulation is equivalent to the canonical one for constant learning
rare (eta), with both approaches outputting:
$$ w_{t+1} = w_t - \\eta \\sum_{k=1}^t \\beta^{t-k} \\nabla_k $$
It may however yield differences when $\\eta$ changes through training:
- (can.) $$ w_{t+1} = w_t - \\sum_{k=1}^t \\eta_k \\beta^{t-k} \\nabla_k $$
- (ours) $$ w_{t+1} = w_t - \\eta_t \\sum_{k=1}^t \\beta^{t-k} \\nabla_k $$
References
----------
[1] Sutskever et. al., 2013.
On the importance of initialization and momentum in deep learning
https://proceedings.mlr.press/v28/sutskever13.pdf
"""
name: ClassVar[str] = "momentum"
def __init__(
self,
beta: float = 0.9,
nesterov: bool = False,
) -> None:
"""Instantiate the Momentum gradients-adaptation module.
Parameters
----------
beta: float, default=0.9
Momentum coefficient parameterizing the weight of the velocity.
nesterov : bool, default=False
Whether to use Nesterov-accelerated momentum.
"""
if not isinstance(beta, float):
raise TypeError("'beta' should be of type float.")
if not 0 <= beta < 1:
raise ValueError("'beta' value should be in [0, 1[.")
self.beta = beta
self.nesterov = nesterov
self.velocity = 0.0 # type: Union[Vector, float]
def get_config(
self,
) -> Dict[str, Any]:
return {"beta": self.beta, "nesterov": self.nesterov}
def run(
self,
gradients: Vector,
) -> Vector:
self.velocity = (self.beta * self.velocity) + gradients
if self.nesterov:
return (self.beta * self.velocity) + gradients
return self.velocity
def get_state(
self,
) -> Dict[str, Any]:
return {"velocity": self.velocity}
def set_state(
self,
state: Dict[str, Any],
) -> None:
if "velocity" not in state:
raise KeyError("Missing required state variable 'velocity'.")
self.velocity = state["velocity"]
|
__init__(beta=0.9, nesterov=False)
Instantiate the Momentum gradients-adaptation module.
Parameters:
Name |
Type |
Description |
Default |
beta |
float
|
Momentum coefficient parameterizing the weight of the velocity. |
0.9
|
nesterov |
bool, default
|
Whether to use Nesterov-accelerated momentum. |
False
|
Source code in declearn/optimizer/modules/_momentum.py
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91 | def __init__(
self,
beta: float = 0.9,
nesterov: bool = False,
) -> None:
"""Instantiate the Momentum gradients-adaptation module.
Parameters
----------
beta: float, default=0.9
Momentum coefficient parameterizing the weight of the velocity.
nesterov : bool, default=False
Whether to use Nesterov-accelerated momentum.
"""
if not isinstance(beta, float):
raise TypeError("'beta' should be of type float.")
if not 0 <= beta < 1:
raise ValueError("'beta' value should be in [0, 1[.")
self.beta = beta
self.nesterov = nesterov
self.velocity = 0.0 # type: Union[Vector, float]
|