Skip to content

Tensor size mismatch while training Qwen Image Edit 2509 with batch size > 1 #487

@pft-JoeyYang

Description

@pft-JoeyYang

This is for bugs only

Did you already ask in the discord?

Yes/No

You verified that this is a bug and not a feature request or question by asking in the discord?

Yes/No

Describe the bug

Got the following error while training Qwen-Image-Edit-2509 with batch_size > 1:

========================================
Result:
 - 0 completed jobs
 - 1 failure
========================================
Traceback (most recent call last):
Traceback (most recent call last):
  File "/data/ai-toolkit/run.py", line 120, in <module>
  File "/data/ai-toolkit/run.py", line 120, in <module>
        main()main()
  File "/data/ai-toolkit/run.py", line 108, in main
  File "/data/ai-toolkit/run.py", line 108, in main
        raise eraise e
  File "/data/ai-toolkit/run.py", line 96, in main
  File "/data/ai-toolkit/run.py", line 96, in main
        job.run()job.run()
  File "/data/ai-toolkit/jobs/ExtensionJob.py", line 22, in run
  File "/data/ai-toolkit/jobs/ExtensionJob.py", line 22, in run
        process.run()process.run()
  File "/data/ai-toolkit/jobs/process/BaseSDTrainProcess.py", line 2208, in run
  File "/data/ai-toolkit/jobs/process/BaseSDTrainProcess.py", line 2208, in run
        batch = next(dataloader_iterator)batch = next(dataloader_iterator)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/ai-toolkit/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 733, in __next__
  File "/data/ai-toolkit/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 733, in __next__
        data = self._next_data()data = self._next_data()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/ai-toolkit/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1515, in _next_data
  File "/data/ai-toolkit/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1515, in _next_data
        return self._process_data(data, worker_id)return self._process_data(data, worker_id)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/ai-toolkit/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1550, in _process_data
  File "/data/ai-toolkit/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1550, in _process_data
        data.reraise()data.reraise()
  File "/data/ai-toolkit/venv/lib/python3.12/site-packages/torch/_utils.py", line 750, in reraise
  File "/data/ai-toolkit/venv/lib/python3.12/site-packages/torch/_utils.py", line 750, in reraise
        raise exceptionraise exception
RuntimeErrorRuntimeError: : Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/data/ai-toolkit/venv/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
           ^^^^^^^^^^^^^^^^^^^^
  File "/data/ai-toolkit/venv/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
    return self.collate_fn(data)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/data/ai-toolkit/toolkit/data_loader.py", line 642, in dto_collation
    batch = DataLoaderBatchDTO(
            ^^^^^^^^^^^^^^^^^^^
  File "/data/ai-toolkit/toolkit/data_transfer_object/data_loader.py", line 306, in __init__
    raise e
  File "/data/ai-toolkit/toolkit/data_transfer_object/data_loader.py", line 180, in __init__
    self.control_tensor = torch.cat([x.unsqueeze(0) for x in control_tensors])
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 1851 but got size 1819 for tensor number 1 in the list.
Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/data/ai-toolkit/venv/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
           ^^^^^^^^^^^^^^^^^^^^
  File "/data/ai-toolkit/venv/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
    return self.collate_fn(data)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/data/ai-toolkit/toolkit/data_loader.py", line 642, in dto_collation
    batch = DataLoaderBatchDTO(
            ^^^^^^^^^^^^^^^^^^^
  File "/data/ai-toolkit/toolkit/data_transfer_object/data_loader.py", line 306, in __init__
    raise e
  File "/data/ai-toolkit/toolkit/data_transfer_object/data_loader.py", line 180, in __init__
    self.control_tensor = torch.cat([x.unsqueeze(0) for x in control_tensors])
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 1851 but got size 1819 for tensor number 1 in the list.

The following is my training config:

job: "extension"
config:
  name: "qwen_image_edit_v3.7"
  process:
    - type: "diffusion_trainer"
      training_folder: "/data/ai-toolkit/output"
      sqlite_db_path: "./aitk_db.db"
      device: "cuda"
      trigger_word: null
      performance_log_every: 10
      network:
        type: "lora"
        linear: 32
        linear_alpha: 32
        conv: 16
        conv_alpha: 16
        lokr_full_rank: true
        lokr_factor: -1
        network_kwargs:
          ignore_if_contains: []
      save:
        dtype: "bf16"
        save_every: 250
        max_step_saves_to_keep: 4
        save_format: "diffusers"
        push_to_hub: false
      datasets:
        - folder_path: "/data/ai-toolkit/datasets/target_v2_0"
          mask_path: null
          mask_min_value: 0.1
          caption_ext: "txt"
          caption_dropout_rate: 0.05
          cache_latents_to_disk: false
          is_reg: false
          network_weight: 1
          resolution:
            - 1024
            - 512
            - 768
          controls: []
          shrink_video_to_frames: true
          num_frames: 1
          do_i2v: true
          flip_x: false
          flip_y: false
          control_path_1: "/data/ai-toolkit/datasets/masked_v2_0"
      train:
        batch_size: 8
        bypass_guidance_embedding: false
        steps: 6000
        gradient_accumulation: 1
        train_unet: true
        train_text_encoder: false
        gradient_checkpointing: true
        noise_scheduler: "flowmatch"
        optimizer: "adamw8bit"
        timestep_type: "weighted"
        content_or_style: "balanced"
        optimizer_params:
          weight_decay: 0.0001
        unload_text_encoder: false
        cache_text_embeddings: true
        lr: 0.0001
        ema_config:
          use_ema: false
          ema_decay: 0.99
        skip_first_sample: false
        force_first_sample: false
        disable_sampling: false
        dtype: "bf16"
        diff_output_preservation: false
        diff_output_preservation_multiplier: 1
        diff_output_preservation_class: "person"
        switch_boundary_every: 1
        loss_type: "mse"
      model:
        name_or_path: "Qwen/Qwen-Image-Edit-2509"
        quantize: true
        qtype: "qfloat8"
        quantize_te: true
        qtype_te: "qfloat8"
        arch: "qwen_image_edit_plus"
        low_vram: false
        model_kwargs:
          match_target_res: false
        layer_offloading: false
        layer_offloading_text_encoder_percent: 1
        layer_offloading_transformer_percent: 1
        neg: ""
        seed: 42
        walk_seed: true
        guidance_scale: 4
        sample_steps: 25
        num_frames: 1
        fps: 1
meta:
  name: "[name]"
  version: "1.0"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions