# Dataset Preparation

- [Dataset Preparation](#dataset-preparation)
  - [Supported Datasets](#supported-datasets)
  - [Resources](#resources)
  - [Interface of Dataloader](#interface-of-dataloader)
  - [Specific Datasets and Dataloader](#specific-datasets-and-dataloader)
  - [Test Dataloader](#test-dataloader)

## Supported Datasets

The following datasets can be loaded with the current codes after downloaded (see example [scripts](../options/example_benchmark_data_opts.yml)):

| FR Dataset | Description | NR Dataset       | Description        |
| ---------- | ----------- | ---------------- | ------------------ |
| PIPAL      | *2AFC*      | FLIVE(PaQ-2-PiQ) | *Tech & Aesthetic* |
| BAPPS      | *2AFC*      | SPAQ             | *Mobile*           |
| PieAPP     | *2AFC*      | AVA              | *Aesthetic*        |
| KADID-10k  |             | KonIQ-10k(++)    |                    |
| LIVEM      |             | LIVEChallenge    |                    |
| LIVE       |             | [PIQ2023](https://github.com/DXOMARK-Research/PIQ2023)| Portrait dataset   |
| TID2013    |             | [GFIQA](http://database.mmsp-kn.de/gfiqa-20k-database.html)| Face IQA Dataset   |
| TID2008    |             |                  |                    |
| CSIQ       |             |                  |                    |

Please see more details at [Awesome Image Quality Assessment](https://github.com/chaofengc/Awesome-Image-Quality-Assessment)

## Resources

Here are some other resources to download the dataset:
- [**Our huggingface archive 🤗**](https://huggingface.co/datasets/chaofengc/IQA-Toolbox-Datasets/tree/main)
- [**Waterloo Bayesian IQA project**](http://ivc.uwaterloo.ca/research/bayesianIQA/). [ [IQA-Dataset](https://github.com/icbcbicc/IQA-Dataset) | [download links](http://ivc.uwaterloo.ca/database/IQADataset) ]

## Interface of Dataloader

We create general interfaces for FR and NR datasets in `pyiqa/data/general_fr_dataset.py` and `pyiqa/data/general_nr_dataset.py`. The main arguments are

- `opt` contains all dataset options, including
    - `dataroot_target`: path of target image folder.
    - `dataroot_ref [optional]`: path of reference image folder.
    - `meta_info_file`: file containing meta information of images, including relative image paths, mos labels and other labels.
    - `augment [optional]` data augmentation transform list
        - `hflip`: flip input images or pairs
        - `random_crop`: int or tuple, random crop input images or pairs
    - `split_file [optional]`: `train/val/test` split file `*.pkl`. If not specified, will use the split information in meta csv file or load the whole dataset.
    - `split_index [optional]`: `str` or `int`, which split to use, valid when `split_file` is specified or corresponding split information exits in meta csv file.
    - `dmos max`: some dataset use difference of mos. Set this to non-zero will change dmos to mos with `mos = dmos_max - dmos`.
    - `phase`: phase labels [train, val, test]

The above interface requires the `meta_info_file` to provide the dataset information and the train/val/test split. The `meta_info_file` are `.csv` files, and has the following general format
```
- For NR datasets: name, mos(mean), std, split_name
    ```
    100.bmp   	32.56107532210109   	19.12472638223644   train/val/test
    ```

- For FR datasets: ref_name, dist_name, mos(mean), std, split_name 
    ```
    I01.bmp        I01_01_1.bmp   5.51429        0.13013 train/val/test

    ```
```

Note that we generate `train/val/test` splits follow the principles below:

- For datasets which has official splits, we follow their splits.
- For official split which has no `val` part, e.g., AVA dataset, we random separate 5% from training data as validation.
- For small datasets which requires n-split results, we use `train:val=8:2`  ratio.
- All random seeds are set to `123` when needed.

According to these rules, the `split_name` is named as follows:

- The official split is saved in a column named `official_split`.
- [if necessary] Ten random splits are generated and stored using the format `ratio[split_ratio]_seed[seed number]_split[split index:02d]`. For example, for a split ratio of `train/val/test=8:0:2`, a seed number of 123, and the first split, the entry would be `ratio802_seed123_split01`.
- You can also use other custom split names, such as the `ILGnet_split` for the AVA dataset.

### Using separate split file

You may also use the `split_file` to specify the split information. The `split_file` are `.pkl` files which contains the `train/val/test` information with python dictionary in the following format:
```
{
    train_index: {
        train: [train_index_list]
        val: [val_index_list] # blank if no validation split
        test: [test_index_list] # blank if no test split
    }
}
```
The train_index starts from `1`. And the sample indexes correspond to the row index of `meta_info_file`, starting from `0`. We already generate the files for mainstream public datasets with scripts in folder [./scripts/](./scripts/).

## Specific Datasets and Dataloader

Some of the supported datasets have different label formats and file organizations, and we create specific dataloader for them:

- Live Challenge. The first 7 samples are usually removed in the related works.
- AVA. Different label formats.
- PieAPP. Different label formats.
- BAPPS. Different label formats.

## Test Dataloader

You may use `tests/test_datasets.py` to test whether a dataset can be correctly loaded.