Skip to content

declearn.optimizer.modules.YogiModule

Bases: AdamModule

Yogi additive adaptive moment estimation module.

This module implements the following algorithm:

Init(beta_1, beta_2, eps):
    state_m = 0
    state_v = 0
Step(grads, step):
    state_m = beta_1*state_m + (1-beta_1)*grads
    sign_uv = sign(state_v - grads**2)
    state_v = state_v + sign_uv*(1-beta_2)*(grads**2)
    m_hat = state_m / (1 - beta_1**step)
    v_hat = state_v / (1 - beta_2**step)
    grads = state_m / (sqrt(v_hat) + eps)

In other words, Yogi [1] implements the Adam [2] algorithm, but modifies the update rule of the 'v' state variable that is used to scale the learning rate.

Note that this implementation allows combining the Yogi modification of Adam with the AMSGrad [3] one.

References

  • [1] Zaheer and Reddi et al., 2018. Adaptive Methods for Nonconvex Optimization.
  • [2] Kingma and Ba, 2014. Adam: A Method for Stochastic Optimization. https://arxiv.org/abs/1412.6980
  • [3] Reddi et al., 2018. On the Convergence of Adam and Beyond. https://arxiv.org/abs/1904.09237
Source code in declearn/optimizer/modules/_adaptive.py
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
class YogiModule(AdamModule):
    """Yogi additive adaptive moment estimation module.

    This module implements the following algorithm:

        Init(beta_1, beta_2, eps):
            state_m = 0
            state_v = 0
        Step(grads, step):
            state_m = beta_1*state_m + (1-beta_1)*grads
            sign_uv = sign(state_v - grads**2)
            state_v = state_v + sign_uv*(1-beta_2)*(grads**2)
            m_hat = state_m / (1 - beta_1**step)
            v_hat = state_v / (1 - beta_2**step)
            grads = state_m / (sqrt(v_hat) + eps)

    In other words, Yogi [1] implements the Adam [2] algorithm,
    but modifies the update rule of the 'v' state variable that
    is used to scale the learning rate.

    Note that this implementation allows combining the Yogi
    modification of Adam with the AMSGrad [3] one.

    References
    ----------
    - [1]
        Zaheer and Reddi et al., 2018.
        Adaptive Methods for Nonconvex Optimization.
    - [2]
        Kingma and Ba, 2014.
        Adam: A Method for Stochastic Optimization.
        https://arxiv.org/abs/1412.6980
    - [3]
        Reddi et al., 2018.
        On the Convergence of Adam and Beyond.
        https://arxiv.org/abs/1904.09237
    """

    name: ClassVar[str] = "yogi"

    def __init__(
        self,
        beta_1: float = 0.9,
        beta_2: float = 0.99,
        amsgrad: bool = False,
        eps: float = 1e-7,
    ) -> None:
        """Instantiate the Yogi gradients-adaptation module.

        Parameters
        ----------
        beta_1: float
            Beta parameter for the momentum correction
            applied to the input gradients.
        beta_2: float
            Beta parameter for the (Yogi-specific) momentum
            correction applied to the adaptive scaling term.
        amsgrad: bool, default=False
            Whether to implement the Yogi modification on top
            of the AMSGrad algorithm rather than the Adam one.
        eps: float, default=1e-7
            Numerical-stability improvement term, added
            to the (divisor) adapative scaling term.
        """
        super().__init__(beta_1, beta_2, amsgrad=amsgrad, eps=eps)
        self.ewma_2 = YogiMomentumModule(beta=beta_2)

__init__(beta_1=0.9, beta_2=0.99, amsgrad=False, eps=1e-07)

Instantiate the Yogi gradients-adaptation module.

Parameters:

Name Type Description Default
beta_1 float

Beta parameter for the momentum correction applied to the input gradients.

0.9
beta_2 float

Beta parameter for the (Yogi-specific) momentum correction applied to the adaptive scaling term.

0.99
amsgrad bool

Whether to implement the Yogi modification on top of the AMSGrad algorithm rather than the Adam one.

False
eps float

Numerical-stability improvement term, added to the (divisor) adapative scaling term.

1e-07
Source code in declearn/optimizer/modules/_adaptive.py
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
def __init__(
    self,
    beta_1: float = 0.9,
    beta_2: float = 0.99,
    amsgrad: bool = False,
    eps: float = 1e-7,
) -> None:
    """Instantiate the Yogi gradients-adaptation module.

    Parameters
    ----------
    beta_1: float
        Beta parameter for the momentum correction
        applied to the input gradients.
    beta_2: float
        Beta parameter for the (Yogi-specific) momentum
        correction applied to the adaptive scaling term.
    amsgrad: bool, default=False
        Whether to implement the Yogi modification on top
        of the AMSGrad algorithm rather than the Adam one.
    eps: float, default=1e-7
        Numerical-stability improvement term, added
        to the (divisor) adapative scaling term.
    """
    super().__init__(beta_1, beta_2, amsgrad=amsgrad, eps=eps)
    self.ewma_2 = YogiMomentumModule(beta=beta_2)