zae_engine.data.dataset package¶

Submodules¶

zae_engine.data.dataset.parquet module¶

class zae_engine.data.dataset.parquet.ParquetDataset(parquet_paths: List[str] | Tuple[str, ...], fs, columns: List[str] = None, shuffle: bool = False)[source]¶

Bases: Dataset

Custom PyTorch Dataset for loading and accessing data from multiple Parquet files efficiently.

This dataset handles multiple Parquet files by caching them and provides indexing to access individual samples. It supports shuffling of data and selecting specific columns for use.

Parameters:

parquet_paths (List[str] | Tuple[str, ...]) – List or tuple of paths to Parquet files.
fs – Filesystem object (e.g., fsspec filesystem) to handle file operations.
columns (List[str], optional) – Columns to read from the Parquet files. Defaults to None, which reads all columns.
shuffle (bool, optional) – Whether to shuffle the dataset indices. Defaults to False.

zae_engine.data.dataset package¶

Submodules¶

zae_engine.data.dataset.parquet module¶

Module contents¶