UQ_VIDEO dataset and code

  1. *************************************
    The dataset and code for 'Multiple Feature Hasing for Real time Large Scale Near Duplicate VIdeo Retrieval'.
    Citation: Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, Richang Hong: Multiple feature hashing for real-time large scale near-duplicate video retrieval. ACM Multimedia, pages 423-432, 2011.
    Updated: 18/11/2014

    1. The Combined Dataset, named as UQ_VIDEO, is a video dataset created by ourselves by adding CC_WEB_VIDEO to our videos downloaded from YouTube.
    We choose the most popular 400 queries to query the videos from YouTube. Those queries are selected from Google Zeitgeist. Each year, Google examines billions of queries that people around the world have typed into Google search to discover the Zeitgeist saved in Google Zeitgeist Archives. We collect Google Zeitgeist Archives from 2004 to 2009, and choose the most popular 400 queries to search YouTube. The downloaded number of videos for each query is up to 1000. We crawled more than 200K YouTube videos from July 2010 to September 2010.

    After filtering out the videos whose sizes are greater than 10M, the Combined Dataset contains 169,952 videos in total. To our best knowledge, this is the biggest Web video dataset for experimental purpose. We further extract 3,305,525 keyframes from these videos. This dataset is released to public so that other researchers will be able to use it as a test bed.

    2. In the Combined dataset, it includes the following files:
    A. Query video file (idx_query.mat): The IDs of 24 query videos.
    B. Ground Truth File (groundTruth.mat): The ground truth for the 24 queries.
    C. HSV feature file for all the keyframes (hsvYouTubeNor.mat): The HSV feature for all the keyframes
    D. LBP feature file for all the keyframes (lbpYouTubeNor.mat): The LBP feature for all the keyframes
    E. Video to keyframe indexing file (kf_start_end_YouTube.mat, idx_wu.mat): The indexing positions of the starting and ending keyframes in the feature files for each video.

    3. Demo: demo_mfh.m is the main function. Download the dataset and type 'demo_mfh()' to run this program. Turn on the 'parallel' option in 'demo_mfh.m' if necessary.
    Explanation: The first seed video is video 1, and it's keyframes are 51-55 in feature files. Then we use these keyframes to search all the keyframes and return the corresponding near-duplicate videos. The groundTruth.mat file shows that there are 337 near-duplicate videos in this dataset. Finally, we use this ground truth file to evaluate the performance of the search results.

Download the files (both dataset and code) from Google drive at https://drive.google.com/open?id=0B_ji3X99V5yzSkpIbkQ2QzRuYjA&authuser=0 . We also provide another dataset which is used for cross-media retrieval. Any problem, please contact Prof Heng Tao Shen at shenht@itee.uq.edu.au.