`declearn.optimizer.modules.AdamModule`

Bases: OptiModule

Adaptive Moment Estimation (Adam) module.

This module implements the following algorithm:

Init(beta_1, beta_2, eps):
    state_m = 0
    state_v = 0
Step(grads, step):
    state_m = beta_1*state_m + (1-beta_1)*grads
    state_v = beta_2*state_v + (1-beta_2)*(grads**2)
    m_hat = state_m / (1 - beta_1**step)
    v_hat = state_v / (1 - beta_2**step)
    grads = state_m / (sqrt(v_hat) + eps)

In other words, gradients are first momentum-corrected, as is the accumulated sum of squared past gradients. Both are bias-corrected, then the former are scaled down based upon the latter AdaGrad-style (indirectly adapting the learning rate) and returned. This is the Adam [1] algorithm.

Optionally, the AMSGrad [2] algorithm may be implemented, with a similar formula but using the element-wise maximum of present-and-past v_hat values as a scaling factor. This guarantees that the learning rate is shrinked across time, at least from the point of view of this module (a warm-up schedule might for example counteract this).

References

[1] Kingma and Ba, 2014. Adam: A Method for Stochastic Optimization. https://arxiv.org/abs/1412.6980
[2] Reddi et al., 2018. On the Convergence of Adam and Beyond. https://arxiv.org/abs/1904.09237

Source code in declearn/optimizer/modules/_adaptive.py

class AdamModule(OptiModule):
    """Adaptive Moment Estimation (Adam) module.

    This module implements the following algorithm:

        Init(beta_1, beta_2, eps):
            state_m = 0
            state_v = 0
        Step(grads, step):
            state_m = beta_1*state_m + (1-beta_1)*grads
            state_v = beta_2*state_v + (1-beta_2)*(grads**2)
            m_hat = state_m / (1 - beta_1**step)
            v_hat = state_v / (1 - beta_2**step)
            grads = state_m / (sqrt(v_hat) + eps)

    In other words, gradients are first momentum-corrected, as
    is the accumulated sum of squared past gradients. Both are
    bias-corrected, then the former are scaled down based upon
    the latter AdaGrad-style (indirectly adapting the learning
    rate) and returned. This is the Adam [1] algorithm.

    Optionally, the AMSGrad [2] algorithm may be implemented,
    with a similar formula but using the element-wise maximum
    of present-and-past v_hat values as a scaling factor. This
    guarantees that the learning rate is shrinked across time,
    at least from the point of view of this module (a warm-up
    schedule might for example counteract this).

    References
    ----------
    - [1]
        Kingma and Ba, 2014.
        Adam: A Method for Stochastic Optimization.
        https://arxiv.org/abs/1412.6980
    - [2]
        Reddi et al., 2018.
        On the Convergence of Adam and Beyond.
        https://arxiv.org/abs/1904.09237
    """

    name: ClassVar[str] = "adam"

    def __init__(
        self,
        beta_1: float = 0.9,
        beta_2: float = 0.99,
        amsgrad: bool = False,
        eps: float = 1e-7,
    ) -> None:
        """Instantiate the Adam gradients-adaptation module.

        Parameters
        ----------
        beta_1: float
            Beta parameter for the momentum correction
            applied to the input gradients.
        beta_2: float
            Beta parameter for the momentum correction
            applied to the adaptive scaling term.
        amsgrad: bool, default=False
            Whether to implement the AMSGrad algorithm
            rather than the base Adam one.
        eps: float, default=1e-7
            Numerical-stability improvement term, added
            to the (divisor) adapative scaling term.
        """
        self.ewma_1 = EWMAModule(beta=beta_1)
        self.ewma_2 = EWMAModule(beta=beta_2)
        self.steps = 0
        self.eps = eps
        self.amsgrad = amsgrad
        self.vmax = None  # type: Optional[Vector]

    def get_config(
        self,
    ) -> Dict[str, Any]:
        return {
            "beta_1": self.ewma_1.beta,
            "beta_2": self.ewma_2.beta,
            "amsgrad": self.amsgrad,
            "eps": self.eps,
        }

    def run(
        self,
        gradients: Vector,
    ) -> Vector:
        # Compute momentum-corrected state variables.
        m_t = self.ewma_1.run(gradients)
        v_t = self.ewma_2.run(gradients**2)
        # Apply bias correction to the previous terms.
        m_h = m_t / (1 - (self.ewma_1.beta ** (self.steps + 1)))
        v_h = v_t / (1 - (self.ewma_2.beta ** (self.steps + 1)))
        # Optionally implement the AMSGrad algorithm.
        if self.amsgrad:
            if self.vmax is not None:
                v_h = v_h.maximum(self.vmax)
            self.vmax = v_h
        # Compute and return the adapted gradients.
        gradients = m_h / ((v_h**0.5) + self.eps)
        self.steps += 1
        return gradients

    def get_state(
        self,
    ) -> Dict[str, Any]:
        state = {
            "steps": self.steps,
            "vmax": self.vmax,
        }  # type: Dict[str, Any]
        state["momentum"] = self.ewma_1.get_state()
        state["velocity"] = self.ewma_2.get_state()
        return state

    def set_state(
        self,
        state: Dict[str, Any],
    ) -> None:
        for key in ("momentum", "velocity", "steps", "vmax"):
            if key not in state:
                raise KeyError(f"Missing required state variable '{key}'.")
        self.ewma_1.set_state(state["momentum"])
        self.ewma_2.set_state(state["velocity"])
        self.steps = state["steps"]
        self.vmax = state["vmax"]

`init(beta_1=0.9, beta_2=0.99, amsgrad=False, eps=1e-07)`

Instantiate the Adam gradients-adaptation module.

Parameters:

Name	Type	Description	Default
`beta_1`	`float`	Beta parameter for the momentum correction applied to the input gradients.	`0.9`
`beta_2`	`float`	Beta parameter for the momentum correction applied to the adaptive scaling term.	`0.99`
`amsgrad`	`bool`	Whether to implement the AMSGrad algorithm rather than the base Adam one.	`False`
`eps`	`float`	Numerical-stability improvement term, added to the (divisor) adapative scaling term.	`1e-07`

Source code in declearn/optimizer/modules/_adaptive.py

def __init__(
    self,
    beta_1: float = 0.9,
    beta_2: float = 0.99,
    amsgrad: bool = False,
    eps: float = 1e-7,
) -> None:
    """Instantiate the Adam gradients-adaptation module.

    Parameters
    ----------
    beta_1: float
        Beta parameter for the momentum correction
        applied to the input gradients.
    beta_2: float
        Beta parameter for the momentum correction
        applied to the adaptive scaling term.
    amsgrad: bool, default=False
        Whether to implement the AMSGrad algorithm
        rather than the base Adam one.
    eps: float, default=1e-7
        Numerical-stability improvement term, added
        to the (divisor) adapative scaling term.
    """
    self.ewma_1 = EWMAModule(beta=beta_1)
    self.ewma_2 = EWMAModule(beta=beta_2)
    self.steps = 0
    self.eps = eps
    self.amsgrad = amsgrad
    self.vmax = None  # type: Optional[Vector]

declearn.optimizer.modules.AdamModule

References

__init__(beta_1=0.9, beta_2=0.99, amsgrad=False, eps=1e-07)

`declearn.optimizer.modules.AdamModule`

`init(beta_1=0.9, beta_2=0.99, amsgrad=False, eps=1e-07)`