GitHub - adumrewal/iiit-5k-word-coco-dataset: IIIT5K dataset converted to coco format along with python readable original label files. Original dataset is in matlab format, which might have been an issue for some potential users, hence this repository.

Overview

This repository contains the IIIT5K dataset. The original dataset shared by IIIT is in matlab format. In this repository, we have converted the dataset to readable .csv and coco format for easy loading into python codes.

This dataset contains:

Cropped word images split into training and test sets
Ground truth annotation, small and medium sized lexicons
Lexicon with 0.5 million words (from Weinman et al. 2009)
Character bounding box level annotations

The lexicon used to compute language priors is in the file sample/og_labels/lexicon.txt. This lexicon was provided by Weinman et al. 2009. The cited article should be cited when using this lexicon.

Sample dataset

Train dataset

img_1		img_2		img_3		img_4

Test dataset

img_1		img_2

Folder structure

sample/ : contains sample dataset structure to help understand what you're downloading
- images/ : images folder with train/test split
- labels/ : labels folder with train/test split in coco format
- og_labels/ : original label files shared by the authors in csv format.
  - lexicon.txt
  - testCharBound.csv
  - testdata.csv
  - trainCharBound.csv
  - traindata.csv
- test.txt : list of test image files (coco format)
- train.txt : list of train image files (coco format)

Steps to access complete dataset:

Clone this repo: git clone https://github.com/adumrewal/iiit-5k-word-coco-dataset.git
Setup git-lfs
- sudo apt-get install git-lfs or brew install git-lfs
- git lfs install (inside the cloned repo)
git lfs pull (pulls the .zip file onto your system)
unzip IIIT5K_coco.zip -d .

Post Script

Thanks to IIIT5K for open-sourcing the dataset.
Incase you need the script to convert from csv to coco format, please feel free to reach out.
If you have any comments/suggestions, please feel free to drop an e-mail or raise an issue in this repo.
If you like what I've provided here, it would be great if you could star this repo.

Citations

IIIT-5K Dataset

Please mention the following citation if you plan on using this dataset. More details can be found on original dataset webpage.

@InProceedings{MishraBMVC12,
 author   = "Mishra, A. and Alahari, K. and Jawahar, C.~V.",
 title    = "Scene Text Recognition using Higher Order Language Priors",
 booktitle= "BMVC",
 year     = "2012"
}

Lexicon

@article{Weinman09,
    author = {Jerod J. Weinman and Erik Learned-Miller and Allen Hanson},
    title  = {Scene Text Recognition using Similarity and a Lexicon with Sparse Belief Propagation},
    journal= {IEEE Trans. Pattern Analysis and Machine Intelligence},
    volume = {31},
    number = {10},
    pages  = {1733--1746},
    month  = {Oct},
    year   = {2009}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
sample		sample
.gitattributes		.gitattributes
.gitignore		.gitignore
IIIT5K_coco.zip		IIIT5K_coco.zip
README.md		README.md
dataset_readme.txt		dataset_readme.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Overview

This dataset contains:

Sample dataset

Train dataset

Test dataset

Folder structure

Steps to access complete dataset:

Post Script

Citations

IIIT-5K Dataset

Lexicon

About

Uh oh!

Releases

Packages

adumrewal/iiit-5k-word-coco-dataset

Folders and files

Latest commit

History

Repository files navigation

Overview

This dataset contains:

Sample dataset

Train dataset

Test dataset

Folder structure

Steps to access complete dataset:

Post Script

Citations

IIIT-5K Dataset

Lexicon

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages