Description of the CHIME Chart Image Dataset


The CHIME Chart Image Dataset is created by the Center for Information Mining and Extraction (CHIME), School of Computing, National University of Singapore. The aim of providing this dataset is to promote research activities in the area of chart image recognition and understanding. Public dataset with ground truth is important for performance evaluation and comparison of different systems developed.

The dataset consists of two subsets:

  • 200 Real-life chart images, with ground truth extracted using a semi-automatic system [paper]. (Thanks Ms. Yang Li for implementing the system)

    Link to the dataset: here

  • 3200 synthetic chart images, which were created by an automatic chart image generation system. Ground truth data were also captured using the same system [paper]. (Thanks Mr. Zhao Jiuzhou for implementing the system)

    Link to the dataset: here

    Notes on the synthetic chart image dataset:

    All clean images were zipped into a single file.

    Degraded images were put into different zip files according to their types.

    Naming convention for degraded images: filename = "D" + chart_type + 3 digit degradation number + 2 digit sequence number. For the 3 digit degradation number:

  • Digit1: 0 = "no skew angle" 1 = "skew angle"

  • Digit2: 0 = "no shearing" 1 = "shearing"

  • Digit3: 0 = "no motion blur" 1 = "motion blur"

    The tool for automatically creating synthetic chart image is also available for people to try, which can be found here. The tool is far from commercial graphical software that generates charts. However, it does generate major components of four commonly used types of chart. Furthermore, it is able to record down low-level graphical and textual details as ground truth data. Documentation on how to use this tool can be found here.

    The text bounding box were calculated using the GDI+ Graphics.MeasureString() function. However, the bounding box is not really the tight bounding box, due to the way the text were drawn. A method to improve this situation (not completely tight text bounding box is obtained still) is here

    As the text strings are randomly created, there is a chance for two text strings to overlap using the current tool. This happens when the space between two graphical symbols is too small (more often with pie charts), and the corresponding labels of the two symbol are placed too near to each other. This bug may be fixed by adding constraint checking and displacement calculation in the related function.

    Undate on 17th September 2007: all images in the synthetic dataset are now in PNG format which is a lossless compression format. Although the total size of the ZIP files does not reduce, the size of each individual image is now much smaller.

    Last Modified: 17th September 2007

    by Huang Weihua