`declearn.optimizer.modules.MomentumModule`

Bases: OptiModule

Momentum gradient-acceleration module.

This module impements the following algorithm:

Init(beta):
    velocity = 0
Step(grads):
    velocity = beta * velocity + grads
    if nesterov:
        grads = beta * velocity + grads
    else:
        grads = velocity

Note that this contrasts with the canonical implementation of momentum by Sutskever et. al. [1]. The learning rate is applied to the whole output of the algorithm above, in the Optmizer class, rather than only to the gradient part of it, following the pytorch implementation. The nesterov variant's implementation is equivalently adapted.

This formulation is equivalent to the canonical one for constant learning rare (eta), with both approaches outputting: $$ w_{t+1} = w_t - \eta \sum_{k=1}^t \beta^{t-k} \nabla_k $$

It may however yield differences when $\eta$ changes through training:

(can.) $$ w_{t+1} = w_t - \sum_{k=1}^t \eta_k \beta^{t-k} \nabla_k $$
(ours) $$ w_{t+1} = w_t - \eta_t \sum_{k=1}^t \beta^{t-k} \nabla_k $$

References

[1] Sutskever et. al., 2013. On the importance of initialization and momentum in deep learning https://proceedings.mlr.press/v28/sutskever13.pdf

Source code in declearn/optimizer/modules/_momentum.py

class MomentumModule(OptiModule):
    """Momentum gradient-acceleration module.

    This module impements the following algorithm:

        Init(beta):
            velocity = 0
        Step(grads):
            velocity = beta * velocity + grads
            if nesterov:
                grads = beta * velocity + grads
            else:
                grads = velocity

    Note that this contrasts with the canonical implementation of momentum
    by Sutskever et. al. [1]. The learning rate is applied to the whole output
    of the algorithm above, in the Optmizer class, rather than only to the
    gradient part of it, following the [pytorch implementation](\
    https://pytorch.org/docs/stable/generated/torch.optim.SGD.html).
    The nesterov variant's implementation is equivalently adapted.

    This formulation is equivalent to the canonical one for constant learning
    rare (eta), with both approaches outputting:
        $$ w_{t+1} = w_t - \\eta \\sum_{k=1}^t \\beta^{t-k} \\nabla_k $$

    It may however yield differences when $\\eta$ changes through training:

    - (can.) $$ w_{t+1} = w_t - \\sum_{k=1}^t \\eta_k \\beta^{t-k} \\nabla_k $$
    - (ours) $$ w_{t+1} = w_t - \\eta_t \\sum_{k=1}^t \\beta^{t-k} \\nabla_k $$

    References
    ----------
    [1] Sutskever et. al., 2013.
        On the importance of initialization and momentum in deep learning
        https://proceedings.mlr.press/v28/sutskever13.pdf
    """

    name: ClassVar[str] = "momentum"

    def __init__(
        self,
        beta: float = 0.9,
        nesterov: bool = False,
    ) -> None:
        """Instantiate the Momentum gradients-adaptation module.

        Parameters
        ----------
        beta: float, default=0.9
            Momentum coefficient parameterizing the weight of the velocity.
        nesterov : bool, default=False
            Whether to use Nesterov-accelerated momentum.
        """
        if not isinstance(beta, float):
            raise TypeError("'beta' should be of type float.")
        if not 0 <= beta < 1:
            raise ValueError("'beta' value should be in [0, 1[.")
        self.beta = beta
        self.nesterov = nesterov
        self.velocity = 0.0  # type: Union[Vector, float]

    def get_config(
        self,
    ) -> Dict[str, Any]:
        return {"beta": self.beta, "nesterov": self.nesterov}

    def run(
        self,
        gradients: Vector,
    ) -> Vector:
        self.velocity = (self.beta * self.velocity) + gradients
        if self.nesterov:
            return (self.beta * self.velocity) + gradients
        return self.velocity

    def get_state(
        self,
    ) -> Dict[str, Any]:
        return {"velocity": self.velocity}

    def set_state(
        self,
        state: Dict[str, Any],
    ) -> None:
        if "velocity" not in state:
            raise KeyError("Missing required state variable 'velocity'.")
        self.velocity = state["velocity"]

`init(beta=0.9, nesterov=False)`

Instantiate the Momentum gradients-adaptation module.

Parameters:

Name	Type	Description	Default
`beta`	`float`	Momentum coefficient parameterizing the weight of the velocity.	`0.9`
`nesterov`	`bool, default`	Whether to use Nesterov-accelerated momentum.	`False`

Source code in declearn/optimizer/modules/_momentum.py

def __init__(
    self,
    beta: float = 0.9,
    nesterov: bool = False,
) -> None:
    """Instantiate the Momentum gradients-adaptation module.

    Parameters
    ----------
    beta: float, default=0.9
        Momentum coefficient parameterizing the weight of the velocity.
    nesterov : bool, default=False
        Whether to use Nesterov-accelerated momentum.
    """
    if not isinstance(beta, float):
        raise TypeError("'beta' should be of type float.")
    if not 0 <= beta < 1:
        raise ValueError("'beta' value should be in [0, 1[.")
    self.beta = beta
    self.nesterov = nesterov
    self.velocity = 0.0  # type: Union[Vector, float]

declearn.optimizer.modules.MomentumModule

References

__init__(beta=0.9, nesterov=False)

`declearn.optimizer.modules.MomentumModule`

`init(beta=0.9, nesterov=False)`