PyTorch DataLoader Guide: Basics, Advanced Tips & Error Fixes

1. Introduction

PyTorch is one of the most popular deep learning frameworks and is widely used in research and industry. In particular, it provides a tool called “DataLoader” to streamline data preprocessing and mini‑batch management. This article provides an in‑depth look at the role and usage of PyTorch’s DataLoader, as well as how to create custom datasets. It also covers common errors and their solutions, making it useful for everyone from beginners to intermediate users.

By reading this article, you will learn:

  • The basic role and usage examples of PyTorch’s DataLoader
  • How to create custom datasets and apply them
  • Common errors and their solutions
If you’re planning to start using PyTorch or are already using it but struggling with data management, please read on to the end.

2. What is DataLoader? Its Role and Importance

What is DataLoader? PyTorch’s DataLoader is a tool that efficiently extracts data from a dataset and supplies it in a format suitable for model training. Its main features include the following points.
  • Mini-batch processing: Split large datasets into small batches that fit GPU memory size for processing.
  • Shuffle functionality: Randomly reorder data to prevent overfitting.
  • Parallel processing: Load data with multiple threads to reduce training time.
Why DataLoader is needed In machine learning models, data preprocessing and batching occur frequently. Managing all of this manually is time‑consuming and makes the code cumbersome. Using DataLoader provides the following benefits.
  1. Efficient data management: Automate batch division and order control of data.
  2. Flexible customization: Easily implement preprocessing and transformations tailored to specific tasks.
  3. High versatility: Works with diverse datasets regardless of data type or format.
侍エンジニア塾

3. Relationship between Dataset and DataLoader

Role of the Dataset class Dataset class serves as the foundation for data management in PyTorch, making it easy to load and customize datasets. Key features of Dataset
  1. Data storage: Efficiently stores data in memory or on disk.
  2. Access capability: Provides data retrieval via indexing.
  3. Customizable: Supports creating user-defined datasets.
Below is an example of a Dataset built into PyTorch.
from torchvision import datasets, transforms

transform = transforms.Compose([transforms.ToTensor()])
train_dataset = datasets.MNIST(root='./data', train=True, transform=transform, download=True)
Integration with DataLoader The Dataset defines the data itself, while the DataLoader is responsible for feeding that data to the model. For example, the code below processes the MNIST dataset with a DataLoader.
from torch.utils.data import DataLoader

train_loader = DataLoader(dataset=train_dataset, batch_size=64, shuffle=True)
Thus, the DataLoader provides a convenient interface for retrieving data from a Dataset and supplying it to the model in batches.

4. Basic Usage of DataLoader

Here we explain how to use PyTorch’s DataLoader in detail. By understanding the basic syntax and configuration options, you can acquire practical skills.

1. Basic Syntax of DataLoader

Below is a basic code example of DataLoader.
import torch
from torch.utils.data import DataLoader, TensorDataset

# Sample data
data = torch.randn(100, 10)  # 100 samples, each sample is 10-dimensional
labels = torch.randint(0, 2, (100,))  # Labels of 0 or 1

# Create dataset with TensorDataset
dataset = TensorDataset(data, labels)

# DataLoader configuration
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
Key points:
  1. TensorDataset: Used to handle data and labels as pairs.
  2. batch_size=32: Sets the mini-batch size to 32.
  3. shuffle=True: Randomly shuffles data to prevent training bias.

2. Main Arguments and Settings of DataLoader

DataLoader has the following important arguments.
ArgumentDescriptionExample
batch_sizeSpecifies the number of samples to retrieve per iteration.batch_size=64
shuffleSpecifies whether to randomly reorder the data. Default is False.shuffle=True
num_workersSpecifies the number of parallel processes used for loading data. Default is 0 (single process).num_workers=4
drop_lastWhether to discard the last batch if it contains fewer than batch_size samples.drop_last=True
pin_memoryLoads data into pinned memory to accelerate transfer to GPU.pin_memory=True (effective when using GPU)
Example: The following code creates a DataLoader with parallel processing and pinned memory enabled.
dataloader = DataLoader(dataset, batch_size=64, shuffle=True, num_workers=4, pin_memory=True)

3. Example of Data Retrieval with DataLoader

Let’s see how to retrieve data from a DataLoader.
for batch_idx, (inputs, targets) in enumerate(dataloader):
    print(f"Batch {batch_idx+1}")
    print("Inputs:", inputs.shape)  # Display the shape of data within the batch
    print("Targets:", targets.shape)  # Display the shape of labels within the batch
This code loops over each batch’s index and data.
  • inputs.shape: You can see the shape per batch size, e.g., (32, 10).
  • targets.shape: The number and shape of labels can also be inspected similarly.

4. Why Shuffle the Dataset

The DataLoader option shuffle=True randomizes the order of data. This provides the following benefits.
  • Prevent bias: Training on data in the same order can cause overfitting to specific patterns, so shuffling ensures randomness.
  • Improved generalization: Randomizing data order enables the model to learn a variety of data patterns.

5. How to Create a Custom Dataset

In PyTorch, you may need to use data that isn’t included in the standard datasets. In such cases, you create a custom Dataset and use it together with a DataLoader. This section explains the steps for creating a custom Dataset in detail.

1. When a Custom Dataset Is Needed

A custom Dataset is required in situations such as the following.
  • Data in a proprietary format: Images, text, CSV files, or other formats not covered by standard datasets.
  • When you want to automate data preprocessing: Applying specific preprocessing steps such as scaling or filtering.
  • Complex label structures: Cases where labels consist of multiple values or where data pairs images with text.

2. Basic Structure of a Custom Dataset

To create a custom Dataset in PyTorch, subclass torch.utils.data.Dataset and implement the following three methods.
  1. __init__: Initialize the dataset. Define file loading and preprocessing.
  2. __len__: Return the number of samples in the dataset.
  3. __getitem__: Return the data and label for a given index.

3. Concrete Example of a Custom Dataset

Below is an example that handles data stored in a CSV file. Example: Custom Dataset Using a CSV File
import torch
from torch.utils.data import Dataset
import pandas as pd

class CustomDataset(Dataset):
    def __init__(self, csv_file):
        # Load data
        self.data = pd.read_csv(csv_file)
        # Split features and labels
        self.features = self.data.iloc[:, :-1].values  # Use all columns except the last as features
        self.labels = self.data.iloc[:, -1].values     # Use the last column as labels

    def __len__(self):
        # Return the number of samples
        return len(self.features)

    def __getitem__(self, idx):
        # Return data and label for the given index
        sample = torch.tensor(self.features[idx], dtype=torch.float32)
        label = torch.tensor(self.labels[idx], dtype=torch.long)
        return sample, label
Key Points:
  1. __init__: Loads the CSV file and splits features and labels.
  2. __len__: Returns the number of samples so the DataLoader knows the size.
  3. __getitem__: Returns the data and label accessed by index as tensors.

4. Integration with DataLoader

Here is an example of using the custom Dataset with a DataLoader.
# Instantiate the dataset
dataset = CustomDataset(csv_file='data.csv')

# Configure the DataLoader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

# Example of retrieving data
for inputs, labels in dataloader:
    print("Inputs:", inputs.shape)
    print("Labels:", labels.shape)
Key Points:
  • batch_size=32: Sets the mini-batch size to 32.
  • shuffle=True: Randomizes the order of the data.
In this way, you can manage custom datasets flexibly.

5. Example Application: Custom Dataset for Image Data

Below is an example of a custom Dataset that handles image data and labels.
from PIL import Image
import os

class ImageDataset(Dataset):
    def __init__(self, image_dir, transform=None):
        self.image_dir = image_dir
        self.image_files = os.listdir(image_dir)
        self.transform = transform

    def __len__(self):
        return len(self.image_files)

    def __getitem__(self, idx):
        img_path = os.path.join(self.image_dir, self.image_files[idx])
        image = Image.open(img_path)

        # Transformation
        if (self.transform):
            image = self.transform(image)

        label = 1 if 'dog' in img_path else 0  # Labeling based on the filename
        return image, label
Key Points:
  • Image transformation: The transform parameter lets you easily apply preprocessing such as resizing or normalization.
  • Labeling by filename: An example of a simple label generation method.

6. Advanced Techniques and Best Practices

In this section, we introduce advanced techniques and best practices for using PyTorch’s DataLoader more efficiently. By incorporating these techniques, you can significantly improve the speed and flexibility of data processing.

1. Speeding Up Data Loading with Parallel Processing

Problem: When the dataset becomes large, loading data with a single process is inefficient. In particular, data such as images or audio take time to load, which can slow down training. Solution: Set the num_workers argument to load data concurrently across multiple processes, improving processing speed. Example: DataLoader with Multiple Processes
from torch.utils.data import DataLoader

# DataLoader configuration
dataloader = DataLoader(dataset, batch_size=64, shuffle=True, num_workers=4)

for batch in dataloader:
    # Data processing
    pass
Key Points:
  • num_workers=4: Sets the number of parallel data loading processes to 4. Adjust appropriately based on the size of your data.
  • Note: On Windows, special care is needed when configuring multiprocessing. Using if __name__ == '__main__': helps prevent errors.

2. Optimizing Memory Usage Efficiency

Problem: When using a GPU, transferring data from the CPU to the GPU can become a bottleneck due to transfer speed. Solution: Setting pin_memory=True places data in pinned memory, enabling faster transfers. Example: Fast Transfer Configuration
dataloader = DataLoader(dataset, batch_size=64, shuffle=True, pin_memory=True)
Key Points:
  • Especially effective when using a GPU. Not needed in CPU‑only environments.

3. Controlling Data with Samplers

Problem: When there is class imbalance or you want to use only data that meets specific conditions, standard shuffling is insufficient. Solution: Use a sampler to control the selection and distribution of data. Example: Handling Imbalanced Data with WeightedRandomSampler
from torch.utils.data import WeightedRandomSampler

# Set weights
weights = [0.1 if label == 0 else 0.9 for label in dataset.labels]
sampler = WeightedRandomSampler(weights, num_samples=len(weights), replacement=True)

# DataLoader configuration
dataloader = DataLoader(dataset, batch_size=32, sampler=sampler)
Key Points:
  • Handling Imbalanced Data: Adjust class occurrence frequencies to improve training balance.
  • Random Sampling: Retrieve data randomly based on specified conditions.

4. Improving Training Accuracy with Data Augmentation

Problem: Small datasets can lead to poor generalization performance. Solution: Apply augmentation (expansion) to image or text data to increase data diversity. Example: Image Processing with torchvision.transforms
from torchvision import transforms

transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),   # Random horizontal flip
    transforms.RandomRotation(10),      # Random rotation up to 10 degrees
    transforms.ToTensor(),              # Convert to tensor
    transforms.Normalize((0.5,), (0.5,)) # Normalization
])
Key Points:
  • Data augmentation is effective for preventing overfitting and improving accuracy.
  • Augmentation can be flexibly applied by combining it with custom Datasets.

5. Batch Processing and Distributed Training for Large Datasets

Problem: With large datasets, memory and compute resources can reach their limits. Solution: Leverage batch processing and distributed training to learn efficiently. Example: Distributed Processing with torch.utils.data.DistributedSampler
from torch.utils.data.distributed import DistributedSampler

sampler = DistributedSampler(dataset)
dataloader = DataLoader(dataset, batch_size=32, sampler=sampler)
Key Points:
  • In a distributed training environment, you can spread the computational load across multiple GPUs or nodes.
  • Combining samplers with DataLoader enables efficient data handling.

7. Common Errors and Their Solutions

PyTorch’s DataLoader is a handy tool, but errors can occur during real-world usage. This section provides a detailed explanation of common errors, their causes, and how to address them.

1. Error 1: Out‑of‑Memory Error

Error Message:
RuntimeError: CUDA out of memory. 
Cause:
  • Batch size is too large.
  • Attempting to process high‑resolution images or a large dataset all at once.
  • GPU memory cache has not been released.
Solution:
  1. Reduce the batch size.
   dataloader = DataLoader(dataset, batch_size=16, shuffle=True)
  1. Lighten the model’s data type (switch to half‑precision floating point).
   model.half()
   inputs = inputs.half()
  1. Explicitly free memory.
   import torch
   torch.cuda.empty_cache()
  1. Use pin_memory=True to optimize transfer speed.
   dataloader = DataLoader(dataset, batch_size=16, shuffle=True, pin_memory=True)

2. Error 2: Data Loading Parallelism Error

Error Message:
RuntimeError: DataLoader worker (pid 12345) is killed by signal: 9
Cause:
  • The value of num_workers is too high, exceeding system resource limits.
  • Memory shortage or data contention is occurring.
Solution:
  1. Reduce num_workers.
   dataloader = DataLoader(dataset, batch_size=32, num_workers=2)
  1. If data loading is too heavy, consider splitting the workload.
  2. For Windows environments, add the following setting.
   if __name__ == '__main__':
       dataloader = DataLoader(dataset, batch_size=32, num_workers=2)

3. Error 3: Data Format Error

Error Message:
IndexError: list index out of range
Cause:
  • The custom Dataset’s __getitem__ method accesses a non‑existent index.
  • Accessing beyond the dataset’s index range.
Solution:
  1. Verify that the __len__ method returns the correct length.
   def __len__(self):
       return len(self.data)
  1. Add code to check the index range.
   def __getitem__(self, idx):
       if idx >= len(self.data):
           raise IndexError("Index out of range")
       return self.data[idx]

4. Error 4: Type Error

Error Message:
TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'str'>
Cause:
  • The data returned by the custom Dataset is not a Tensor but an incompatible type such as a string.
Solution:
  1. Convert the data type to a Tensor.
   import torch

   def __getitem__(self, idx):
       feature = torch.tensor(self.features[idx], dtype=torch.float32)
       label = torch.tensor(self.labels[idx], dtype=torch.long)
       return feature, label
  1. Create a custom collate function. For handling complex data formats, create a custom function as shown below.
   def custom_collate(batch):
       inputs, labels = zip(*batch)
       return torch.stack(inputs), torch.tensor(labels)

   dataloader = DataLoader(dataset, batch_size=32, collate_fn=custom_collate)

5. Error 5: Shuffle and Seed Fixing Issue

Error Message:
Randomness in shuffling produces inconsistent results.
Cause:
  • The random seed is not fixed, compromising experiment reproducibility.
Solution:
  1. Fix the seed to obtain consistent results.
   import torch
   import numpy as np
   import random

   def seed_everything(seed=42):
       random.seed(seed)
       np.random.seed(seed)
       torch.manual_seed(seed)
       torch.cuda.manual_seed_all(seed)

   seed_everything(42)
   dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

8. Practical Example: Applying Data Preprocessing and Model Training

In this section, we present a concrete example of using PyTorch’s DataLoader to preprocess data while training a model. As an example, we use the well-known CIFAR-10 dataset for image classification tasks and explain the training process of a neural network model.

1. Preparing and Preprocessing the Dataset

First, download the CIFAR-10 dataset and perform preprocessing.
import torch
import torchvision
import torchvision.transforms as transforms

# Data preprocessing
transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),  # Randomly flip images
    transforms.RandomCrop(32, padding=4),  # Random crop
    transforms.ToTensor(),  # Convert to tensor
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))  # Normalize
])

# Download and apply dataset
train_dataset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
test_dataset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
Key Points:
  1. Data augmentation: Add diversity with random flips and crops to prevent overfitting.
  2. Normalization: Normalize image pixel values to 0.5 to improve computational efficiency.
  3. CIFAR-10: A small image classification dataset consisting of 10 classes.

2. Configuring the DataLoader

Next, we use DataLoader to batch-process the dataset.
from torch.utils.data import DataLoader

# Configure DataLoader
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True, num_workers=4, pin_memory=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False, num_workers=4, pin_memory=True)
Key Points:
  • Batch size: Supply data in mini-batch units. Process 64 samples at a time during training.
  • shuffle=True: Randomly shuffle training data while keeping test data order.
  • Parallel processing: Improves data loading speed with num_workers=4.

3. Building the Model

We create a simple convolutional neural network (CNN).
import torch.nn as nn
import torch.nn.functional as F

class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.fc1 = nn.Linear(64 * 8 * 8, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, 2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2)
        x = x.view(-1, 64 * 8 * 8)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x
Key Points:
  1. Convolutional layer (Conv2d): Performs feature extraction and learns important patterns.
  2. Pooling layer (MaxPooling): Reduces feature dimensionality and provides translational invariance.
  3. Fully connected layer (Linear): The final layer that performs class classification.

4. Training the Model

We train the model on the training data.
import torch.optim as optim

# Prepare model and optimizer
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = CNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(10):  # Set number of epochs to 10
    model.train()
    running_loss = 0.0
    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        optimizer.zero_grad()  # Initialize gradients
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()

    print(f"Epoch {epoch+1}, Loss: {running_loss/len(train_loader)}")
Key Points:
  • Device configuration: Run computations on GPU if CUDA is available.
  • Adam optimizer: Uses an effective method for learning rate adaptation.
  • Loss function: Uses cross-entropy loss for class classification.

5. Evaluating the Model

We evaluate the model’s accuracy on the test data.
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for inputs, labels in test_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = model(inputs)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f"Test Accuracy: {100 * correct / total:.2f}%")
Key Points:
  • Evaluation mode: Disable gradient computation and switch to inference mode.
  • Accuracy calculation: Compute classification accuracy from the number of correct predictions and total samples.

9. Summary and Next Steps

In the sections so far, we have provided a detailed explanation of PyTorch’s DataLoader from basics to advanced usage. In this final section, we will review what we have covered and suggest the next steps to try.

1. Article Review

Chapters 1 – 4:
  • DataLoader Basics: We learned how PyTorch’s DataLoader works and how it streamlines data management and preprocessing.
  • Integration with Datasets: We confirmed that combining standard and custom datasets enables flexible data handling.
Chapters 5 – 6:
  • Creating Custom Datasets: We learned how to build custom Datasets to handle proprietary data formats, showcasing examples with images and CSV files using concrete code snippets.
  • Advanced Techniques and Best Practices: We mastered parallel processing, memory optimization, and the use of samplers for flexible data management to improve performance.
Chapters 7 – 8:
  • Errors and Troubleshooting: We presented common error causes and solutions, strengthening our ability to handle issues.
  • Practical Example: Using the CIFAR-10 dataset, we implemented an image classification task, practicing the full workflow from training to evaluation.

2. Advice for Applying in Production

1. Customize the Code The code presented in the article is a basic version, but real projects often have more complex requirements. Keep the following points in mind when customizing.
  • Strengthen data augmentation to prevent overfitting.
  • Add learning rate scheduling and regularization to improve model generalization.
  • Incorporate distributed training for large datasets to boost processing efficiency.
2. Try Other Datasets Besides MNIST and CIFAR-10, try the following datasets as well.
  • Image Classification: ImageNet and COCO datasets.
  • Natural Language Processing: Text datasets such as IMDB and SNLI.
  • Speech Recognition: Audio datasets like Librispeech.
3. Perform Hyperparameter Tuning In DataLoader, batch size and num_workers significantly affect training speed. Practice adjusting these values to find optimal settings. 4. Change the Model Architecture Beyond CNNs, trying the following models can deepen your understanding.
  • RNN/LSTM: Apply to time-series data and NLP.
  • Transformer: Achieve powerful results with state-of-the-art NLP models.
  • ResNet and EfficientNet: Use as high-accuracy image classification models.

3. Next Steps

1. Leverage the Official PyTorch Documentation You can find the latest features and detailed API references by consulting the official documentation. It is accessible via the link below. 2. Develop a Practical Project Based on what you’ve learned, try tackling projects like the following.
  • Image Classification App: Implement image classification in mobile or web applications.
  • NLP Model: Build sentiment analysis or chatbot systems.
  • Reinforcement Learning Model: Apply to game AI or optimization tasks.
3. Publish and Share Your Code Use GitHub or Kaggle to publish your code and exchange feedback with other developers. You’ll not only improve your own skills but also gain opportunities to learn from others.

4. In Closing

PyTorch’s DataLoader is a powerful tool essential for data handling and training efficiency. This article systematically covered fundamentals to advanced topics for beginners to intermediate users. Key Takeaways:
  1. DataLoader streamlines data management and integrates seamlessly with datasets.
  2. Creating custom Datasets enables handling of any data format.
  3. Advanced techniques like acceleration and samplers achieve production‑grade processing efficiency.
  4. Practical code examples help you master the end‑to‑end workflow of model building and evaluation.
If you want to use PyTorch for machine learning or deep learning, start applying the knowledge from this article to real projects. By continuing to learn, you’ll acquire more advanced model design and data processing skills. As a next step, deepen your knowledge by tackling new projects.
侍エンジニア塾