Available Optimizers

AccSGD

class torch_optimizer.AccSGD(params, lr=0.001, kappa=1000.0, xi=10.0, small_const=0.7, weight_decay=0)[source]

Implements AccSGD algorithm.

It has been proposed in On the insufficiency of existing momentum schemes for Stochastic Optimization and Accelerating Stochastic Gradient Descent For Least Squares Regression

Parameters
  • params (Union[Iterable[Tensor], Iterable[Dict[str, Any]]]) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float) – learning rate (default: 1e-3)

  • kappa (float) – ratio of long to short step (default: 1000)

  • xi (float) – statistical advantage parameter (default: 10)

  • small_const (float) – any value <=1 (default: 0.7)

  • weight_decay (float) – weight decay (L2 penalty) (default: 0)

Example

>>> import torch_optimizer as optim
>>> optimizer = optim.AccSGD(model.parameters(), lr=0.1)
>>> optimizer.zero_grad()
>>> loss_fn(model(input), target).backward()
>>> optimizer.step()
step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (Optional[Callable[[], float]]) – A closure that reevaluates the model and returns the loss.

Return type

Optional[float]

AdaBound

class torch_optimizer.AdaBound(params, lr=0.001, betas=0.9, 0.999, final_lr=0.1, gamma=0.001, eps=1e-08, weight_decay=0, amsbound=False)[source]

Implements AdaBound algorithm.

It has been proposed in Adaptive Gradient Methods with Dynamic Bound of Learning Rate.

Parameters
  • params (Union[Iterable[Tensor], Iterable[Dict[str, Any]]]) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float) – learning rate (default: 1e-3)

  • betas (Tuple[float, float]) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • final_lr (float) – final (SGD) learning rate (default: 0.1)

  • gamma (float) – convergence speed of the bound functions (default: 1e-3)

  • eps (float) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float) – weight decay (L2 penalty) (default: 0)

  • amsbound (bool) – whether to use the AMSBound variant of this algorithm

Example

>>> import torch_optimizer as optim
>>> optimizer = optim.AdaBound(model.parameters(), lr=0.1)
>>> optimizer.zero_grad()
>>> loss_fn(model(input), target).backward()
>>> optimizer.step()

Note

Reference code: https://github.com/Luolc/AdaBound

step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (Optional[Callable[[], float]]) – A closure that reevaluates the model and returns the loss.

Return type

Optional[float]

AdaMod

class torch_optimizer.AdaMod(params, lr=0.001, betas=0.9, 0.999, beta3=0.999, eps=1e-08, weight_decay=0)[source]

Implements AdaMod algorithm.

It has been proposed in Adaptive and Momental Bounds for Adaptive Learning Rate Methods.

Parameters
  • params (Union[Iterable[Tensor], Iterable[Dict[str, Any]]]) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float) – learning rate (default: 1e-3)

  • betas (Tuple[float, float]) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • beta3 (float) – smoothing coefficient for adaptive learning rates (default: 0.9999)

  • eps (float) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float) – weight decay (L2 penalty) (default: 0)

Example

>>> import torch_optimizer as optim
>>> optimizer = optim.AdaMod(model.parameters(), lr=0.1)
>>> optimizer.zero_grad()
>>> loss_fn(model(input), target).backward()
>>> optimizer.step()

Note

Reference code: https://github.com/lancopku/AdaMod

step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (Optional[Callable[[], float]]) – A closure that reevaluates the model and returns the loss.

Return type

Optional[float]

Adafactor

class torch_optimizer.Adafactor(params, lr=None, eps2=1e-30, 0.001, clip_threshold=1.0, decay_rate=- 0.8, beta1=None, weight_decay=0.0, scale_parameter=True, relative_step=True, warmup_init=False)[source]

Implements Adafactor algorithm.

It has been proposed in: Adafactor: Adaptive Learning Rates with Sublinear Memory Cost.

Parameters
  • params (Union[Iterable[Tensor], Iterable[Dict[str, Any]]]) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (Optional[float]) – external learning rate (default: None)

  • eps2 (Tuple[float, float]) – regularization constans for square gradient and parameter scale respectively (default: (1e-30, 1e-3))

  • clip_threshold (float) – threshold of root mean square of final gradient update (default: 1.0)

  • decay_rate (float) – coefficient used to compute running averages of square gradient (default: -0.8)

  • beta1 (Optional[float]) – coefficient used for computing running averages of gradient (default: None)

  • weight_decay (float) – weight decay (L2 penalty) (default: 0)

  • scale_parameter (bool) – if true, learning rate is scaled by root mean square of parameter (default: True)

  • relative_step (bool) – if true, time-dependent learning rate is computed instead of external learning rate (default: True)

  • warmup_init (bool) – time-dependent learning rate computation depends on whether warm-up initialization is being used (default: False)

Example

>>> import torch_optimizer as optim
>>> optimizer = optim.Adafactor(model.parameters())
>>> optimizer.zero_grad()
>>> loss_fn(model(input), target).backward()
>>> optimizer.step()
step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (Optional[Callable[[], float]]) – A closure that reevaluates the model and returns the loss.

Return type

Optional[float]

AdamP

class torch_optimizer.AdamP(params, lr=0.001, betas=0.9, 0.999, eps=1e-08, weight_decay=0, delta=0.1, wd_ratio=0.1, nesterov=False)[source]

Implements AdamP algorithm.

It has been proposed in Slowing Down the Weight Norm Increase in Momentum-based Optimizers

Parameters
  • params (Union[Iterable[Tensor], Iterable[Dict[str, Any]]]) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float) – learning rate (default: 1e-3)

  • betas (Tuple[float, float]) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • eps (float) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float) – weight decay (L2 penalty) (default: 0)

  • delta (float) – threhold that determines whether a set of parameters is scale invariant or not (default: 0.1)

  • wd_ratio (float) – relative weight decay applied on scale-invariant parameters compared to that applied on scale-variant parameters (default: 0.1)

  • nesterov (bool) – enables Nesterov momentum (default: False)

Example

>>> import torch_optimizer as optim
>>> optimizer = optim.AdamP(model.parameters(), lr=0.1)
>>> optimizer.zero_grad()
>>> loss_fn(model(input), target).backward()
>>> optimizer.step()

Note

Reference code: https://github.com/clovaai/AdamP

step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (Optional[Callable[[], float]]) – A closure that reevaluates the model and returns the loss.

Return type

Optional[float]

AggMo

class torch_optimizer.AggMo(params, lr=0.001, betas=0.0, 0.9, 0.99, weight_decay=0)[source]

Implements Aggregated Momentum Gradient Descent.

It has been proposed in Aggregated Momentum: Stability Through Passive Damping

Example

>>> import torch_optimizer as optim
>>> optimizer = optim.AggMo(model.parameters(), lr=0.1)
>>> optimizer.zero_grad()
>>> loss_fn(model(input), target).backward()
>>> optimizer.step()
step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (Optional[Callable[[], float]]) – A closure that reevaluates the model and returns the loss.

Return type

Optional[float]

DiffGrad

class torch_optimizer.DiffGrad(params, lr=0.001, betas=0.9, 0.999, eps=1e-08, weight_decay=0.0)[source]

Implements DiffGrad algorithm.

It has been proposed in DiffGrad: An Optimization Method for Convolutional Neural Networks.

Parameters
  • params (Union[Iterable[Tensor], Iterable[Dict[str, Any]]]) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float) – learning rate (default: 1e-3)

  • betas (Tuple[float, float]) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • eps (float) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float) – weight decay (L2 penalty) (default: 0)

Example

>>> import torch_optimizer as optim
>>> optimizer = optim.DiffGrad(model.parameters(), lr=0.1)
>>> optimizer.zero_grad()
>>> loss_fn(model(input), target).backward()
>>> optimizer.step()
step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (Optional[Callable[[], float]]) – A closure that reevaluates the model and returns the loss.

Return type

Optional[float]

Lamb

class torch_optimizer.Lamb(params, lr=0.001, betas=0.9, 0.999, eps=1e-06, weight_decay=0, clamp_value=10, adam=False, debias=False)[source]

Implements Lamb algorithm.

It has been proposed in Large Batch Optimization for Deep Learning: Training BERT in 76 minutes.

Parameters
  • params (Union[Iterable[Tensor], Iterable[Dict[str, Any]]]) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float) – learning rate (default: 1e-3)

  • betas (Tuple[float, float]) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • eps (float) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float) – weight decay (L2 penalty) (default: 0)

  • clamp_value (float) – clamp weight_norm in (0,clamp_value) (default: 10) set to a high value to avoid it (e.g 10e3)

  • adam (bool) – always use trust ratio = 1, which turns this into Adam. Useful for comparison purposes. (default: False)

  • debias (bool) – debias adam by (1 - beta**step) (default: False)

Example

>>> import torch_optimizer as optim
>>> optimizer = optim.Lamb(model.parameters(), lr=0.1)
>>> optimizer.zero_grad()
>>> loss_fn(model(input), target).backward()
>>> optimizer.step()
step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (Optional[Callable[[], float]]) – A closure that reevaluates the model and returns the loss.

Return type

Optional[float]

NovoGrad

class torch_optimizer.NovoGrad(params, lr=0.001, betas=0.95, 0, eps=1e-08, weight_decay=0, grad_averaging=False, amsgrad=False)[source]

Implements Novograd optimization algorithm.

It has been proposed in Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks.

Parameters
  • params (Union[Iterable[Tensor], Iterable[Dict[str, Any]]]) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float) – learning rate (default: 1e-3)

  • betas (Tuple[float, float]) – coefficients used for computing running averages of gradient and its square (default: (0.95, 0))

  • eps (float) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float) – weight decay (L2 penalty) (default: 0)

  • grad_averaging (bool) – gradient averaging (default: False)

  • amsgrad (bool) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False)

Example

>>> import torch_optimizer as optim
>>> optimizer = optim.Yogi(model.parameters(), lr=0.1)
>>> optimizer.zero_grad()
>>> loss_fn(model(input), target).backward()
>>> scheduler = StepLR(optimizer, step_size=1, gamma=0.7)
>>> optimizer.step()
>>> scheduler.step()
step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (Optional[Callable[[], float]]) – A closure that reevaluates the model and returns the loss.

Return type

Optional[float]

PID

class torch_optimizer.PID(params, lr=0.001, momentum=0.0, dampening=0, weight_decay=0.0, integral=5.0, derivative=10.0)[source]

Implements PID optimization algorithm.

It has been proposed in A PID Controller Approach for Stochastic Optimization of Deep Networks.

Parameters
  • params (Union[Iterable[Tensor], Iterable[Dict[str, Any]]]) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float) – learning rate (default: 1e-3)

  • momentum (float) – momentum factor (default: 0.0)

  • weight_decay (float) – weight decay (L2 penalty) (default: 0.0)

  • dampening (float) – dampening for momentum (default: 0.0)

  • derivative (float) – D part of the PID (default: 10.0)

  • integral (float) – I part of the PID (default: 5.0)

Example

>>> import torch_optimizer as optim
>>> optimizer = optim.PID(model.parameters(), lr=0.001, momentum=0.1)
>>> optimizer.zero_grad()
>>> loss_fn(model(input), target).backward()
>>> optimizer.step()
step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (Optional[Callable[[], float]]) – A closure that reevaluates the model and returns the loss.

Return type

Optional[float]

QHAdam

class torch_optimizer.QHAdam(params, lr=0.001, betas=0.9, 0.999, nus=1.0, 1.0, weight_decay=0.0, decouple_weight_decay=False, eps=1e-08)[source]

Implements the QHAdam optimization algorithm.

It has been proposed in Adaptive methods for Nonconvex Optimization.

Parameters
  • params (Union[Iterable[Tensor], Iterable[Dict[str, Any]]]) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float) – learning rate (default: 1e-3)

  • betas (Tuple[float, float]) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • nus (Tuple[float, float]) – immediate discount factors used to estimate the gradient and its square (default: (1.0, 1.0))

  • eps (float) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float) – weight decay (L2 penalty) (default: 0)

  • decouple_weight_decay (bool) – whether to decouple the weight decay from the gradient-based optimization step (default: False)

Example

>>> import torch_optimizer as optim
>>> optimizer = optim.QHAdam(model.parameters(), lr=0.1)
>>> optimizer.zero_grad()
>>> loss_fn(model(input), target).backward()
>>> optimizer.step()
step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (Optional[Callable[[], float]]) – A closure that reevaluates the model and returns the loss.

Return type

Optional[float]

QHM

class torch_optimizer.QHM(params, lr=0.001, momentum=0.0, nu=0.7, weight_decay=0.0, weight_decay_type='grad')[source]
DIRECT = 'direct'

Implements quasi-hyperbolic momentum (QHM) optimization algorithm.

It has been proposed in Quasi-hyperbolic momentum and Adam for deep learning.

Parameters
  • params – iterable of parameters to optimize or dicts defining parameter groups

  • lr – learning rate (default: 1e-3)

  • momentum – momentum factor (\(\beta\) from the paper)

  • nu – immediate discount factor (\(\nu\) from the paper)

  • weight_decay – weight decay (L2 regularization coefficient, times two) (default: 0.0)

  • weight_decay_type – method of applying the weight decay: "grad" for accumulation in the gradient (same as torch.optim.SGD) or "direct" for direct application to the parameters (default: "grad")

Example

>>> import torch_optimizer as optim
>>> optimizer = optim.QHM(model.parameters(), lr=0.1, momentum=0.9)
>>> optimizer.zero_grad()
>>> loss_fn(model(input), target).backward()
>>> optimizer.step()
step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (Optional[Callable[[], float]]) – A closure that reevaluates the model and returns the loss.

Return type

Optional[float]

RAdam

class torch_optimizer.RAdam(params, lr=0.001, betas=0.9, 0.999, eps=1e-08, weight_decay=0)[source]

Implements RAdam optimization algorithm.

It has been proposed in On the Variance of the Adaptive Learning Rate and Beyond.

Parameters
  • params (Union[Iterable[Tensor], Iterable[Dict[str, Any]]]) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float) – learning rate (default: 1e-3)

  • betas (Tuple[float, float]) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • eps (float) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float) – weight decay (L2 penalty) (default: 0)

Example

>>> import torch_optimizer as optim
>>> optimizer = optim.RAdam(model.parameters(), lr=0.1)
>>> optimizer.zero_grad()
>>> loss_fn(model(input), target).backward()
>>> optimizer.step()
step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (Optional[Callable[[], float]]) – A closure that reevaluates the model and returns the loss.

Return type

Optional[float]

SGDP

class torch_optimizer.SGDP(params, lr=0.001, momentum=0, dampening=0, eps=1e-08, weight_decay=0, delta=0.1, wd_ratio=0.1, nesterov=False)[source]

Implements SGDP algorithm.

It has been proposed in Slowing Down the Weight Norm Increase in Momentum-based Optimizers

Parameters
  • params (Union[Iterable[Tensor], Iterable[Dict[str, Any]]]) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float) – learning rate (default: 1e-3)

  • momentum (float) – momentum factor (default: 0)

  • dampening (float) – dampening for momentum (default: 0)

  • eps (float) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float) – weight decay (L2 penalty) (default: 0)

  • delta (float) – threhold that determines whether a set of parameters is scale invariant or not (default: 0.1)

  • wd_ratio (float) – relative weight decay applied on scale-invariant parameters compared to that applied on scale-variant parameters (default: 0.1)

  • nesterov (bool) – enables Nesterov momentum (default: False)

Example

>>> import torch_optimizer as optim
>>> optimizer = optim.SGDP(model.parameters(), lr=0.1)
>>> optimizer.zero_grad()
>>> loss_fn(model(input), target).backward()
>>> optimizer.step()

Note

Reference code: https://github.com/clovaai/AdamP

step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (Optional[Callable[[], float]]) – A closure that reevaluates the model and returns the loss.

Return type

Optional[float]

SGDW

class torch_optimizer.SGDW(params, lr=0.001, momentum=0.0, dampening=0.0, weight_decay=0.0, nesterov=False)[source]

Implements SGDW algorithm.

It has been proposed in Decoupled Weight Decay Regularization.

Parameters
  • params (Union[Iterable[Tensor], Iterable[Dict[str, Any]]]) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float) – learning rate (default: 1e-3)

  • momentum (float) – momentum factor (default: 0)

  • weight_decay (float) – weight decay (L2 penalty) (default: 0)

  • dampening (float) – dampening for momentum (default: 0)

  • nesterov (bool) – enables Nesterov momentum (default: False)

Example

>>> import torch_optimizer as optim
>>> optimizer = optim.SGDW(model.parameters(), lr=0.1, momentum=0.9)
>>> optimizer.zero_grad()
>>> loss_fn(model(input), target).backward()
>>> optimizer.step()
step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (Optional[Callable[[], float]]) – A closure that reevaluates the model and returns the loss.

Return type

Optional[float]

Shampoo

class torch_optimizer.Shampoo(params, lr=0.1, momentum=0.0, weight_decay=0.0, epsilon=0.0001, update_freq=1)[source]

Implements Shampoo Optimizer Algorithm.

It has been proposed in Shampoo: Preconditioned Stochastic Tensor Optimization.

Parameters
  • params (Union[Iterable[Tensor], Iterable[Dict[str, Any]]]) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float) – learning rate (default: 1e-3)

  • momentum (float) – momentum factor (default: 0)

  • weight_decay (float) – weight decay (L2 penalty) (default: 0)

  • epsilon (float) – epsilon added to each mat_gbar_j for numerical stability (default: 1e-4)

  • update_freq (int) – update frequency to compute inverse (default: 1)

Example

>>> import torch_optimizer as optim
>>> optimizer = optim.Shampoo(model.parameters(), lr=0.01)
>>> optimizer.zero_grad()
>>> loss_fn(model(input), target).backward()
>>> optimizer.step()
step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (Optional[Callable[[], float]]) – A closure that reevaluates the model and returns the loss.

Return type

Optional[float]

SWATS

class torch_optimizer.SWATS(params, lr=0.001, betas=0.9, 0.999, eps=0.001, weight_decay=0, amsgrad=False, nesterov=False)[source]

Implements SWATS Optimizer Algorithm. It has been proposed in Improving Generalization Performance by Switching from Adam to SGD.

Parameters
  • params (Union[Iterable[Tensor], Iterable[Dict[str, Any]]]) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float) – learning rate (default: 1e-2)

  • betas (Tuple[float, float]) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • eps (float) – term added to the denominator to improve numerical stability (default: 1e-3)

  • weight_decay (float) – weight decay (L2 penalty) (default: 0)

  • amsgrad (bool) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False)

  • nesterov (bool) – enables Nesterov momentum (default: False)

Example

>>> import torch_optimizer as optim
>>> optimizer = optim.SWATS(model.parameters(), lr=0.01)
>>> optimizer.zero_grad()
>>> loss_fn(model(input), target).backward()
>>> optimizer.step()
step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (Optional[Callable[[], float]]) – A closure that reevaluates the model and returns the loss.

Return type

Optional[float]

Yogi

class torch_optimizer.Yogi(params, lr=0.01, betas=0.9, 0.999, eps=0.001, initial_accumulator=1e-06, weight_decay=0)[source]

Implements Yogi Optimizer Algorithm. It has been proposed in Adaptive methods for Nonconvex Optimization.

Parameters
  • params (Union[Iterable[Tensor], Iterable[Dict[str, Any]]]) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float) – learning rate (default: 1e-2)

  • betas (Tuple[float, float]) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • eps (float) – term added to the denominator to improve numerical stability (default: 1e-8)

  • initial_accumulator (float) – initial values for first and second moments (default: 1e-6)

  • weight_decay (float) – weight decay (L2 penalty) (default: 0)

Example

>>> import torch_optimizer as optim
>>> optimizer = optim.Yogi(model.parameters(), lr=0.01)
>>> optimizer.zero_grad()
>>> loss_fn(model(input), target).backward()
>>> optimizer.step()
step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (Optional[Callable[[], float]]) – A closure that reevaluates the model and returns the loss.

Return type

Optional[float]