Skip to content

declearn.optimizer.modules.MomentumModule

Bases: OptiModule

Momentum gradient-acceleration module.

This module impements the following algorithm:

Init(beta):
    velocity = 0
Step(grads):
    velocity = beta * velocity + grads
    if nesterov:
        grads = beta * velocity + grads
    else:
        grads = velocity

Note that this contrasts with the canonical implementation of momentum by Sutskever et. al. [1]. The learning rate is applied to the whole output of the algorithm above, in the Optmizer class, rather than only to the gradient part of it, following the pytorch implementation. The nesterov variant's implementation is equivalently adapted.

This formulation is equivalent to the canonical one for constant learning rare (eta), with both approaches outputting: $$ w_{t+1} = w_t - \eta \sum_{k=1}^t \beta^{t-k} \nabla_k $$

It may however yield differences when \(\eta\) changes through training:

  • (can.) $$ w_{t+1} = w_t - \sum_{k=1}^t \eta_k \beta^{t-k} \nabla_k $$
  • (ours) $$ w_{t+1} = w_t - \eta_t \sum_{k=1}^t \beta^{t-k} \nabla_k $$

References

[1] Sutskever et. al., 2013. On the importance of initialization and momentum in deep learning https://proceedings.mlr.press/v28/sutskever13.pdf

Source code in declearn/optimizer/modules/_momentum.py
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
class MomentumModule(OptiModule):
    """Momentum gradient-acceleration module.

    This module impements the following algorithm:

        Init(beta):
            velocity = 0
        Step(grads):
            velocity = beta * velocity + grads
            if nesterov:
                grads = beta * velocity + grads
            else:
                grads = velocity

    Note that this contrasts with the canonical implementation of momentum
    by Sutskever et. al. [1]. The learning rate is applied to the whole output
    of the algorithm above, in the Optmizer class, rather than only to the
    gradient part of it, following the [pytorch implementation](\
    https://pytorch.org/docs/stable/generated/torch.optim.SGD.html).
    The nesterov variant's implementation is equivalently adapted.

    This formulation is equivalent to the canonical one for constant learning
    rare (eta), with both approaches outputting:
        $$ w_{t+1} = w_t - \\eta \\sum_{k=1}^t \\beta^{t-k} \\nabla_k $$

    It may however yield differences when $\\eta$ changes through training:

    - (can.) $$ w_{t+1} = w_t - \\sum_{k=1}^t \\eta_k \\beta^{t-k} \\nabla_k $$
    - (ours) $$ w_{t+1} = w_t - \\eta_t \\sum_{k=1}^t \\beta^{t-k} \\nabla_k $$

    References
    ----------
    [1] Sutskever et. al., 2013.
        On the importance of initialization and momentum in deep learning
        https://proceedings.mlr.press/v28/sutskever13.pdf
    """

    name: ClassVar[str] = "momentum"

    def __init__(
        self,
        beta: float = 0.9,
        nesterov: bool = False,
    ) -> None:
        """Instantiate the Momentum gradients-adaptation module.

        Parameters
        ----------
        beta: float, default=0.9
            Momentum coefficient parameterizing the weight of the velocity.
        nesterov : bool, default=False
            Whether to use Nesterov-accelerated momentum.
        """
        if not isinstance(beta, float):
            raise TypeError("'beta' should be of type float.")
        if not 0 <= beta < 1:
            raise ValueError("'beta' value should be in [0, 1[.")
        self.beta = beta
        self.nesterov = nesterov
        self.velocity = 0.0  # type: Union[Vector, float]

    def get_config(
        self,
    ) -> Dict[str, Any]:
        return {"beta": self.beta, "nesterov": self.nesterov}

    def run(
        self,
        gradients: Vector,
    ) -> Vector:
        self.velocity = (self.beta * self.velocity) + gradients
        if self.nesterov:
            return (self.beta * self.velocity) + gradients
        return self.velocity

    def get_state(
        self,
    ) -> Dict[str, Any]:
        return {"velocity": self.velocity}

    def set_state(
        self,
        state: Dict[str, Any],
    ) -> None:
        if "velocity" not in state:
            raise KeyError("Missing required state variable 'velocity'.")
        self.velocity = state["velocity"]

__init__(beta=0.9, nesterov=False)

Instantiate the Momentum gradients-adaptation module.

Parameters:

Name Type Description Default
beta float

Momentum coefficient parameterizing the weight of the velocity.

0.9
nesterov bool, default

Whether to use Nesterov-accelerated momentum.

False
Source code in declearn/optimizer/modules/_momentum.py
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
def __init__(
    self,
    beta: float = 0.9,
    nesterov: bool = False,
) -> None:
    """Instantiate the Momentum gradients-adaptation module.

    Parameters
    ----------
    beta: float, default=0.9
        Momentum coefficient parameterizing the weight of the velocity.
    nesterov : bool, default=False
        Whether to use Nesterov-accelerated momentum.
    """
    if not isinstance(beta, float):
        raise TypeError("'beta' should be of type float.")
    if not 0 <= beta < 1:
        raise ValueError("'beta' value should be in [0, 1[.")
    self.beta = beta
    self.nesterov = nesterov
    self.velocity = 0.0  # type: Union[Vector, float]