Blog
About

  • Record: found
  • Abstract: found
  • Article: found
Is Open Access

Exploiting Features for Data Source Quality Estimation

Preprint

Read this article at

Bookmark
      There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

      Abstract

      We study the problem of estimating the quality of data sources in data fusion settings. In contrast to existing models that rely only on conflicting observations across sources to infer quality (internal signals), we propose a data fusion model, called FUSE, that combines internal signals with external data-source features. We show both theoretically and empirically, that FUSE yields better quality estimates with rigorous guarantees; in contrast, models which utilize only internal signals have weaker or no guarantees. We study different approaches for learning FUSE's parameters, (i) empirical risk minimization (ERM), which utilizes ground truth and relies on fast convex optimization methods, and (ii) expectation maximization (EM), which assumes no ground truth and uses slow iterative optimization procedures. EM is the standard approach used in most existing methods. An implication of our theoretical analysis is that features allow FUSE to obtain low-error estimates with limited ground truth on the correctness of source observations. We study the tradeoff between the statistical efficiency and the runtime of data fusion models along two directions: (i) whether or not the model uses features (ii) the amount of ground truth available. We empirically show that features allow FUSE with ERM to obtain estimates of similar or better quality than feature-less models, and also FUSE with EM, with only a few training examples (in some cases as few as \(50\)) while being much faster; in our experiments we observe speedups of \(27\times\). We evaluate FUSE on real data and show that it outperforms feature-less baselines, and can yield reductions of more than \(30\%\) in the source accuracy estimation error and improvements of more than \(10\%\) in the F1-score when resolving conflicts across sources.

      Related collections

      Author and article information

      Journal
      1512.06474

      Databases

      Comments

      Comment on this article