zae_engine.trainer.addons package¶

Submodules¶

zae_engine.trainer.addons.core module¶

class zae_engine.trainer.addons.core.AddOnBase[source]¶

Bases: ABC

Base class for defining add-ons for the Trainer class.

Add-ons allow you to extend the functionality of the Trainer class dynamically. By inheriting from AddOnBase, subclasses can implement custom functionality that can be integrated into the Trainer workflow.

apply(cls, base_cls)[source]¶: Apply the add-on to the specified base class, modifying its behavior or adding new features. This method must be implemented by subclasses.

Notes

Add-ons are designed to be composable, meaning you can apply multiple add-ons to a single Trainer class.
Subclasses must override the apply method to specify how the add-on modifies the behavior of the Trainer class.

Examples

Creating a custom add-on:

>>> from zae_engine.trainer.addons import AddOnBase

>>> class CustomLoggerAddon(AddOnBase):
>>>     @classmethod
>>>     def apply(cls, base_cls):
>>>         class TrainerWithCustomLogger(base_cls):
>>>             def __init__(self, *args, custom_param=None, **kwargs):
>>>                 super().__init__(*args, **kwargs)
>>>                 self.custom_param = custom_param
>>>             def logging(self, step_dict):
>>>                 super().logging(step_dict)
>>>                 print(f"Custom logging: {step_dict}")
>>>         return TrainerWithCustomLogger

Applying the custom add-on to a Trainer:

>>> from zae_engine.trainer import Trainer
>>> MyTrainer = Trainer.add_on(CustomLoggerAddon)
>>> trainer = MyTrainer(
>>>     model=my_model,
>>>     device='cuda',
>>>     optimizer=my_optimizer,
>>>     scheduler=my_scheduler,
>>>     custom_param="Log everything"
>>> )
>>> trainer.run(n_epoch=10, loader=train_loader)

abstract classmethod apply(base_cls: Type[T]) → Type[T][source]¶

Apply the add-on modifications to a base class.

This method is used to inject additional functionality or modify the behavior of the Trainer class. It returns a new class that combines the original Trainer with the behavior defined in the add-on.

Parameters:: base_cls (Type[T]) – The base class to which the add-on modifications will be applied. This is typically the Trainer class or a subclass of it.
Returns:: A new class that includes the functionality of the base class along with the additional behavior defined by the add-on.
Return type:: Type[T]

Notes

Subclasses of AddOnBase must implement this method to define how the add-on modifies the base class.
This method is typically called indirectly through the Trainer.add_on method.

Examples

Custom implementation in an add-on:

>>> class CustomAddon(AddOnBase):
>>>     @classmethod
>>>     def apply(cls, base_cls):
>>>         class TrainerWithCustomAddon(base_cls):
>>>             def custom_method(self):
>>>                 print("Custom method called")
>>>         return TrainerWithCustomAddon

Adding the custom add-on to a Trainer:

>>> from zae_engine.trainer import Trainer
>>> MyTrainer = Trainer.add_on(CustomAddon)
>>> trainer = MyTrainer(model=my_model, device='cuda', optimizer=my_optimizer)
>>> trainer.custom_method()  # Output: Custom method called

zae_engine.trainer.addons.mix_precision module¶

class zae_engine.trainer.addons.mix_precision.PrecisionMixerAddon[source]¶

Bases: AddOnBase

Add-on for mixed precision training.

This add-on enables mixed precision training to improve computational efficiency and reduce memory usage. It supports automatic precision selection or user-defined precision settings (e.g., ‘fp32’, ‘fp16’, ‘bf16’).

Parameters:: precision (Union[str, list], optional) – The precision setting for training. Default is “auto”. - “auto”: Automatically selects the best precision based on hardware capabilities. - “fp32”: Uses full precision (default in PyTorch). - “fp16”: Uses half precision for accelerated computation. - “bf16”: Uses Brain Float 16 precision for supported hardware. - List: Specifies a priority order for precision (e.g., [“bf16”, “fp16”]).

run_batch(batch, \*\*kwargs)¶: Override the batch processing method to apply mixed precision.

Notes

Mixed precision improves training speed and reduces memory usage by performing certain operations in lower precision (e.g., FP16) while maintaining stability in others (e.g., FP32 for loss calculation).
Automatically handles loss scaling via torch.cuda.amp.GradScaler to prevent overflow issues when using FP16.

Examples

Using PrecisionMixerAddon with auto precision:

>>> from zae_engine.trainer import Trainer
>>> from zae_engine.trainer.addons import PrecisionMixerAddon

>>> MyTrainer = Trainer.add_on(PrecisionMixerAddon)
>>> trainer = MyTrainer(
>>>     model=my_model,
>>>     device='cuda',
>>>     optimizer=my_optimizer,
>>>     scheduler=my_scheduler,
>>>     precision='auto'  # Automatically selects best precision
>>> )
>>> trainer.run(n_epoch=10, loader=train_loader, valid_loader=valid_loader)

Using a priority list for precision:

>>> trainer = MyTrainer(
>>>     model=my_model,
>>>     device='cuda',
>>>     optimizer=my_optimizer,
>>>     scheduler=my_scheduler,
>>>     precision=["bf16", "fp16"]  # Tries bf16 first, falls back to fp16
>>> )
>>> trainer.run(n_epoch=10, loader=train_loader)

classmethod apply(base_cls: Type[T]) → Type[T][source]¶

Apply the add-on modifications to a base class.

This method is used to inject additional functionality or modify the behavior of the Trainer class. It returns a new class that combines the original Trainer with the behavior defined in the add-on.

Parameters:: base_cls (Type[T]) – The base class to which the add-on modifications will be applied. This is typically the Trainer class or a subclass of it.
Returns:: A new class that includes the functionality of the base class along with the additional behavior defined by the add-on.
Return type:: Type[T]

Notes

Subclasses of AddOnBase must implement this method to define how the add-on modifies the base class.
This method is typically called indirectly through the Trainer.add_on method.

Examples

Custom implementation in an add-on:

>>> class CustomAddon(AddOnBase):
>>>     @classmethod
>>>     def apply(cls, base_cls):
>>>         class TrainerWithCustomAddon(base_cls):
>>>             def custom_method(self):
>>>                 print("Custom method called")
>>>         return TrainerWithCustomAddon

Adding the custom add-on to a Trainer:

>>> from zae_engine.trainer import Trainer
>>> MyTrainer = Trainer.add_on(CustomAddon)
>>> trainer = MyTrainer(model=my_model, device='cuda', optimizer=my_optimizer)
>>> trainer.custom_method()  # Output: Custom method called

zae_engine.trainer.addons.mpu module¶

class zae_engine.trainer.addons.mpu.MultiGPUAddon[source]¶

Bases: AddOnBase

Add-on for distributed multi-GPU training.

This add-on enables distributed training across multiple GPUs using PyTorch’s Distributed Data Parallel (DDP). It handles process initialization, data distribution, and model synchronization for efficient multi-GPU training.

Parameters:: init_method (str, optional) – Initialization method for the distributed process group, typically a URL in the format tcp://hostname:port. Default is ‘tcp://localhost:12355’.

run(n_epoch, loader, valid_loader=None, \*\*aux_run_kwargs)¶: Run the distributed training or testing process across multiple GPUs.

train_process(rank, device_list, init_method, n_epoch, loader, valid_loader, aux_run_kwargs)¶: Train the model on a specific GPU in the distributed setup.

Notes

This add-on requires multiple GPUs to be available and properly configured.

Examples

Using MultiGPUAddon for distributed training:

>>> from zae_engine.trainer import Trainer
>>> from zae_engine.trainer.addons import MultiGPUAddon

>>> MyTrainer = Trainer.add_on(MultiGPUAddon)
>>> trainer = MyTrainer(
>>>     model=my_model,
>>>     device=[torch.device('cuda:0'), torch.device('cuda:1')],
>>>     optimizer=my_optimizer,
>>>     scheduler=my_scheduler
>>> )
>>> trainer.run(n_epoch=10, loader=train_loader)

classmethod apply(base_cls: Type[T]) → Type[T][source]¶

Apply the add-on modifications to a base class.

This method is used to inject additional functionality or modify the behavior of the Trainer class. It returns a new class that combines the original Trainer with the behavior defined in the add-on.

Parameters:: base_cls (Type[T]) – The base class to which the add-on modifications will be applied. This is typically the Trainer class or a subclass of it.
Returns:: A new class that includes the functionality of the base class along with the additional behavior defined by the add-on.
Return type:: Type[T]

Notes

Subclasses of AddOnBase must implement this method to define how the add-on modifies the base class.
This method is typically called indirectly through the Trainer.add_on method.

Examples

Custom implementation in an add-on:

>>> class CustomAddon(AddOnBase):
>>>     @classmethod
>>>     def apply(cls, base_cls):
>>>         class TrainerWithCustomAddon(base_cls):
>>>             def custom_method(self):
>>>                 print("Custom method called")
>>>         return TrainerWithCustomAddon

Adding the custom add-on to a Trainer:

>>> from zae_engine.trainer import Trainer
>>> MyTrainer = Trainer.add_on(CustomAddon)
>>> trainer = MyTrainer(model=my_model, device='cuda', optimizer=my_optimizer)
>>> trainer.custom_method()  # Output: Custom method called

zae_engine.trainer.addons.state_manager module¶

class zae_engine.trainer.addons.state_manager.StateManagerAddon(save_path: str, save_format: str = 'ckpt')[source]¶

Bases: AddOnBase

Add-on to manage model, optimizer, and scheduler state.

This add-on provides functionality for saving and loading the state of the model, optimizer, and scheduler during training. It supports .ckpt and .safetensor formats for storing model weights.

Parameters:

save_path (str) – Path to the directory where the model, optimizer, and scheduler states will be saved.
save_format (str, optional) – Format to save the model state, either ‘ckpt’ or ‘safetensor’. Default is ‘ckpt’.

save_state()¶: Save the state of the model, optimizer, and scheduler.

load_state()¶: Load the state of the model, optimizer, and scheduler.

save_model(filename: str)¶: Save the model’s state dictionary to a file.

save_optimizer()¶: Save the optimizer’s state dictionary.

save_scheduler()¶: Save the scheduler’s state dictionary.

Notes

This add-on automatically saves the model state whenever a better state is detected during training based on loss.

Examples

Adding StateManagerAddon to a Trainer:

>>> from zae_engine.trainer import Trainer
>>> from zae_engine.trainer.addons import StateManagerAddon

>>> MyTrainer = Trainer.add_on(StateManagerAddon)
>>> trainer = MyTrainer(
>>>     model=my_model,
>>>     device='cuda',
>>>     optimizer=my_optimizer,
>>>     scheduler=my_scheduler,
>>>     save_path='./checkpoints'
>>> )
>>> trainer.run(n_epoch=10, loader=train_loader)

classmethod apply(base_cls: Type[T]) → Type[T][source]¶

Apply the add-on modifications to a base class.

This method is used to inject additional functionality or modify the behavior of the Trainer class. It returns a new class that combines the original Trainer with the behavior defined in the add-on.

Parameters:: base_cls (Type[T]) – The base class to which the add-on modifications will be applied. This is typically the Trainer class or a subclass of it.
Returns:: A new class that includes the functionality of the base class along with the additional behavior defined by the add-on.
Return type:: Type[T]

Notes

Subclasses of AddOnBase must implement this method to define how the add-on modifies the base class.
This method is typically called indirectly through the Trainer.add_on method.

Examples

Custom implementation in an add-on:

>>> class CustomAddon(AddOnBase):
>>>     @classmethod
>>>     def apply(cls, base_cls):
>>>         class TrainerWithCustomAddon(base_cls):
>>>             def custom_method(self):
>>>                 print("Custom method called")
>>>         return TrainerWithCustomAddon

Adding the custom add-on to a Trainer:

>>> from zae_engine.trainer import Trainer
>>> MyTrainer = Trainer.add_on(CustomAddon)
>>> trainer = MyTrainer(model=my_model, device='cuda', optimizer=my_optimizer)
>>> trainer.custom_method()  # Output: Custom method called

zae_engine.trainer.addons.web_logger module¶

class zae_engine.trainer.addons.web_logger.NeptuneLoggerAddon[source]¶

Bases: AddOnBase

Add-on for real-time logging with Neptune.

This add-on integrates Neptune into the training process, enabling real-time logging of metrics and other training details. It also provides functionality to monitor and track experiments remotely.

Parameters:: web_logger (dict, optional) – Configuration dictionary for initializing Neptune. Must include a key ‘neptune’ with Neptune initialization parameters, such as ‘project_name’ and ‘api_tkn’.

logging(step_dict: Dict[str, torch.Tensor])¶: Log metrics to Neptune during each step.

init_neptune(params: dict)¶: Initialize a Neptune run with the given parameters.

Notes

This add-on requires Neptune to be installed and a valid API token to be available. Ensure your Neptune project is properly set up to track experiments.

Examples

Using NeptuneLoggerAddon for real-time logging:

>>> from zae_engine.trainer import Trainer
>>> from zae_engine.trainer.addons import NeptuneLoggerAddon

>>> MyTrainer = Trainer.add_on(NeptuneLoggerAddon)
>>> trainer = MyTrainer(
>>>     model=my_model,
>>>     device='cuda',
>>>     optimizer=my_optimizer,
>>>     scheduler=my_scheduler,
>>>     web_logger={"neptune": {"project_name": "my_workspace/my_project", "api_tkn": "your_api_token"}}
>>> )
>>> trainer.run(n_epoch=10, loader=train_loader)

Adding multiple loggers, including Neptune:

>>> from zae_engine.trainer.addons import WandBLoggerAddon

>>> MyTrainerWithLoggers = Trainer.add_on(WandBLoggerAddon, NeptuneLoggerAddon)
>>> trainer_with_loggers = MyTrainerWithLoggers(
>>>     model=my_model,
>>>     device='cuda',
>>>     optimizer=my_optimizer,
>>>     scheduler=my_scheduler,
>>>     web_logger={
>>>         "wandb": {"project": "my_wandb_project"},
>>>         "neptune": {"project_name": "my_workspace/my_neptune_project", "api_tkn": "your_api_token"}
>>>     }
>>> )
>>> trainer_with_loggers.run(n_epoch=10, loader=train_loader)

classmethod apply(base_cls: T) → T[source]¶

Apply the add-on modifications to a base class.

This method is used to inject additional functionality or modify the behavior of the Trainer class. It returns a new class that combines the original Trainer with the behavior defined in the add-on.

Parameters:: base_cls (Type[T]) – The base class to which the add-on modifications will be applied. This is typically the Trainer class or a subclass of it.
Returns:: A new class that includes the functionality of the base class along with the additional behavior defined by the add-on.
Return type:: Type[T]

Notes

Subclasses of AddOnBase must implement this method to define how the add-on modifies the base class.
This method is typically called indirectly through the Trainer.add_on method.

Examples

Custom implementation in an add-on:

>>> class CustomAddon(AddOnBase):
>>>     @classmethod
>>>     def apply(cls, base_cls):
>>>         class TrainerWithCustomAddon(base_cls):
>>>             def custom_method(self):
>>>                 print("Custom method called")
>>>         return TrainerWithCustomAddon

Adding the custom add-on to a Trainer:

>>> from zae_engine.trainer import Trainer
>>> MyTrainer = Trainer.add_on(CustomAddon)
>>> trainer = MyTrainer(model=my_model, device='cuda', optimizer=my_optimizer)
>>> trainer.custom_method()  # Output: Custom method called

class zae_engine.trainer.addons.web_logger.WandBLoggerAddon[source]¶

Bases: AddOnBase

Add-on for real-time logging with Weights & Biases (WandB).

This add-on integrates WandB into the training process, allowing users to log metrics and monitor training progress in real-time.

Parameters:: web_logger (dict, optional) – Configuration dictionary for initializing WandB. Must include a key ‘wandb’ with WandB initialization parameters.

logging(step_dict: Dict[str, torch.Tensor])¶: Log metrics to WandB during each step.

init_wandb(params: dict)¶: Initialize WandB with the given parameters.

Notes

This add-on requires WandB to be installed and a valid API key to be available.

Examples

Using WandBLoggerAddon for real-time logging:

>>> from zae_engine.trainer import Trainer
>>> from zae_engine.trainer.addons import WandBLoggerAddon

>>> MyTrainer = Trainer.add_on(WandBLoggerAddon)
>>> trainer = MyTrainer(
>>>     model=my_model,
>>>     device='cuda',
>>>     optimizer=my_optimizer,
>>>     scheduler=my_scheduler,
>>>     web_logger={"wandb": {"project": "my_project"}}
>>> )
>>> trainer.run(n_epoch=10, loader=train_loader)

classmethod apply(base_cls: T) → T[source]¶

Apply the add-on modifications to a base class.

This method is used to inject additional functionality or modify the behavior of the Trainer class. It returns a new class that combines the original Trainer with the behavior defined in the add-on.

Parameters:: base_cls (Type[T]) – The base class to which the add-on modifications will be applied. This is typically the Trainer class or a subclass of it.
Returns:: A new class that includes the functionality of the base class along with the additional behavior defined by the add-on.
Return type:: Type[T]

Notes

Subclasses of AddOnBase must implement this method to define how the add-on modifies the base class.
This method is typically called indirectly through the Trainer.add_on method.

Examples

Custom implementation in an add-on:

>>> class CustomAddon(AddOnBase):
>>>     @classmethod
>>>     def apply(cls, base_cls):
>>>         class TrainerWithCustomAddon(base_cls):
>>>             def custom_method(self):
>>>                 print("Custom method called")
>>>         return TrainerWithCustomAddon

Adding the custom add-on to a Trainer:

>>> from zae_engine.trainer import Trainer
>>> MyTrainer = Trainer.add_on(CustomAddon)
>>> trainer = MyTrainer(model=my_model, device='cuda', optimizer=my_optimizer)
>>> trainer.custom_method()  # Output: Custom method called

zae_engine.trainer.addons package¶

Submodules¶

zae_engine.trainer.addons.core module¶

zae_engine.trainer.addons.mix_precision module¶

zae_engine.trainer.addons.mpu module¶

zae_engine.trainer.addons.state_manager module¶

zae_engine.trainer.addons.web_logger module¶

Module contents¶