Data Sampling in Multi-view and Multi-class Scatterplots via Set Cover Optimization

IEEE Transactions on Visualization and Computer Graphics (Proceedings of InfoVis 2019)


Ruizhen Hu1    Tingkai Sha1    Oliver van Kaick2    Oliver Deussen3    Hui Huang1*

1Shenzhen University     2Carleton University     3University of Konstanz




Fig. 1. Our method creates a sub-sampled point set that is optimized for different views of a SPLOM and per class. In contrast to blue noise scatterplot sampling (and other sampling methods), which create different point sets for different views, our method uses joint optimization to yield a single point set for multiple views. This way, it not only optimizes for multi-view and multi-class scatterplots simultaneously, but also presents results perceptually similar to the original data distributions while reducing overdraw.



Abstract

We present a method for data sampling in scatterplots by jointly optimizing point selection for different views or classes. Our method uses space-filling curves (Z-order curves) that partition a point set into subsets that, when covered each by one sample, provide a sampling or coreset with good approximation guarantees in relation to the original point set. For scatterplot matrices with multiple views, different views provide different space-filling curves, leading to different partitions of the given point set. For multi-class scatterplots, the focus on either per-class distribution or global distribution provides two different partitions of the given point set that need to be considered in the selection of the coreset. For both cases, we convert the coreset selection problem into an Exact Cover Problem (ECP), and demonstrate with quantitative and qualitative evaluations that an approximate solution that solves the ECP efficiently is able to provide high-quality samplings.



Fig. 4. Sampling with outlier inclusion. Original scatterplot (a) with two classes (blue and red), where outliers are marked with green stars. The samplings (b) and (c) both cover each subset of the individual views (per-class and global view), but (c) selects the outliers of the subsets.




Fig. 5. Example of view selection for sampling optimization. After comparing the clusterings of each view provided by the Z-order method, the three views marked with red squares in (a) are selected as representatives of the dataset. When sampling only with the selected views as in (c), the results are similar to sampling with all the views as in (b), both visually and in terms of approximation error. However, the sampling (c) can be computed in less time than (b) and requires less points.




Fig. 9. Comparison of different sampling methods for multi-class scatterplots on six selected datasets. For each dataset, we show on the left the scatterplot drawn with the full point set as reference, while on the right the boxplots present the distribution of KDE errors, indicating the median, minimum, maximum, first and third quartiles, and outliers. We show one boxplot for each of the four sampling methods, both when considering the global distribution of points (“Vglobal ”’) and individual classes (“Vlocal ”’). Note that the lowest errors are consistently provided by our sampling method that considers the local and global views (blue boxplots), while all the Z-order-based samplings provide better results than the blue noise method.



Acknowledgement

We thank the reviewers for their valuable comments. This work was supported in parts by NSFC (61872250, 61602311), GD Science and Technology Program (2015A030312015), GD Leading Talents Program (00201509), Shenzhen Innovation Program (JCYJ20170302153208613, KQJSCX20170727101233642), LHTD (20170003), NSERC (2015- 05407), DFG (422037984), and the National Engineering Laboratory for Big Data System Computing Technology.


Bibtex
@article{Scatterplots19,
title = {
Data Sampling in Multi-view and Multi-class Scatterplots via Set Cover Optimization},
author = {Ruizhen Hu, Tingkai Sha, Oliver van Kaick, Oliver Deussen, Hui Huang},
journal = {},
volume = {},
number = {},

pages = {}, 

year = {2019},



Downloads(faster for people in China)

Downloads(faster for people in other places)