Data Preparation
π Data Loading
The Image ML Pod leverages HuggingFaceβs datasets
library for robust image data handling. Specifically, the HFImageFolderDataSet
is used as a convenient wrapper around the ImageFolder
dataset from HuggingFaceβs library, enabling seamless loading of datasets stored in a folder structure.
Key Features:
- Compatibility with HuggingFaceβs datasets
.
- Easy handling of standard folder-based datasets.
- Read-Only: The dataset does not support saving changes.
Directory Structure
Organize your dataset with the following structure:
data_dir/
βββ train/
β βββ class_a/
β β βββ image1.jpg
β β βββ image2.jpg
β βββ ...
βββ validation/
β βββ class_a/
β β βββ image3.jpg
β β βββ image4.jpg
β βββ ...
βββ test/
β βββ class_a/
β β βββ image5.jpg
β β βββ image6.jpg
β βββ ...
- Splits:
train
,validation
, andtest
directories.
- Labels: Subdirectories represent class labels (e.g.,
class_a
,class_b
).
Using HFImageFolderDataSet
To integrate HFImageFolderDataSet
into your Kedro project, add the following entry to your catalog.yml
:
my_image_dataset:
type: image_ml_pod.datasets.HFImageFolderDataSet
data_dir: data/01_raw/images
π Reference: Hugging Face ImageFolder Dataset
Processed Dataset Handling
For saving and loading processed datasets, use HFDatasetWrapper
. This ensures datasets are stored in a format suitable for training pipelines or inference workflows.
To include it in your Kedro project:
my_huggingface_dataset:
type: image_ml_pod.datasets.HFDatasetWrapper
dataset_path: data/processed/my_dataset
π Reference: Hugging Face Dataset Documentation
π Data Preprocessing
The data_preprocessing
pipeline provides a modular approach for preparing datasets for training. It includes nodes for:
- Loading Data: Reads images from the source directory or dataset.
- Applying Transforms: Incorporates
torchvision.transforms
for preprocessing tasks such as resizing, normalization, and augmentation.
- Saving Processed Data: Stores processed datasets for use in downstream tasks.
Customization
- Template Nodes: Edit template nodes for specific preprocessing tasks such as custom augmentation or data splitting.
- Dataset Format: Use the
set_format
method of theDataset
class to convert data into formats liketorch.Tensor
ornumpy.ndarray
. Note: By default, images loaded fromImageFolder
are in the PIL format.
- Single-Image Support: Modify nodes to accept single-image inputs, enabling reusability for inference workflows.
Example: Adding a custom resize and normalize transform:
from torchvision import transforms
= transforms.Compose([
transform 224, 224)),
transforms.Resize((
transforms.ToTensor(),=[0.485, 0.456, 0.406],
transforms.Normalize(mean=[0.229, 0.224, 0.225]),
std ])