In our work, titled "Inter-Media Hashing for Large-scale Retrieval from Heterogenous Data Sources" in SIGMOD 2013, two image datasets and one text collection are used for evaluation. Image datsets include ImageNet corpus and NUS_WIDE Image Collection. For text dataset, we have collected a set of Web documents from the Web.
1. NUS-WIDE is a web image dataset containing 269,648 images downloaded from Flickr. Tagging ground-truth for 81 semantic concepts is provided for evaluation. After removing those images without tags, we have 267,465 images left. We randomly chose 10,000 images with their tags from the dataset as the training data to serve for the purpose of bridging the following image dataset and text document dataset. The rest 257,465 images are taken as the small image dataset for testing in the experiments.
2. ImageNet is a very large image database organized according to the WordNet hierarchy, in which each node of the hierarchy is depicted by a set of images. Currently, they have an average of over 500 images per node. We use the 5, 018 tags provided in NUS-WIDE dataset as keywords to search ImageNet, and we finally get 2,904 synsets (or concepts). By using these synsets as queries, we collect about 2.4 millions images from ImageNet. 3 images from each synset are randomly selected as the training image data and the rest are taken as the large image dataset for testing.
There are 2,413,987 images downloaded from imageNet in total.
3. Web Documents are crawled by using Google search engine. The 2,904 concepts obtained from ImageNet are used as text queries. For each query, we download top-100 web pages returned by Google. We then randomly select 3 web documents from each query’s results as the training text data and the rest are used for testing.
There are 240,903 documents collected using google.
This dataset, named UQ_IMH which is used to evaluate inter-media retrieval performance, includes the following files:
A. NUS_WIDE image features (cm_nus.mat): 267465*150
B. ImageNet image features (cm_imgnet.mat): 2413987*150
C. All text features (text_all.mat): 511272*144 = ( 2904 + 267465 + 240903 ) *144. The first 2904 text features are for 2904 synsets (concepts), and the following 267465 lines are for NUS-WIDE associated tags, and the last 240903 are for google documents.
D. Queries for three tasks on NUS_WIDE dataset (seeds_nus.mat).
E. Groundtruth for NUS_WIDE dataset (gt_nus.mat): text->image and image->text search's groundtruth are generated during the search. image->image search's groundtruth is given by GT.
F. Queries for three tasks on ImageNet dataset (seeds_imgnet.mat).
G. Groundtruth for ImageNet dataset (gt_imgnet.mat).
Download the sample matlab code here (4k).
Any problem, please contact Prof Heng Tao Shen at email@example.com.
Citation: Jingkuan Song, Yang Yang, Yi Yang, Zi Huang, Heng Tao Shen: Inter-Media Hashing for Large-scale Retrieval from Heterogenous Data Sources. ACM SIGMOD, pages 785-796, 2013.