+1 Recommend
0 collections
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Exploiting Features for Data Source Quality Estimation


      Read this article at

          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.


          We study the problem of estimating the quality of data sources in data fusion settings. In contrast to existing models that rely only on conflicting observations across sources to infer quality (internal signals), we propose a data fusion model, called FUSE, that combines internal signals with external data-source features. We show both theoretically and empirically, that FUSE yields better quality estimates with rigorous guarantees; in contrast, models which utilize only internal signals have weaker or no guarantees. We study different approaches for learning FUSE's parameters, (i) empirical risk minimization (ERM), which utilizes ground truth and relies on fast convex optimization methods, and (ii) expectation maximization (EM), which assumes no ground truth and uses slow iterative optimization procedures. EM is the standard approach used in most existing methods. An implication of our theoretical analysis is that features allow FUSE to obtain low-error estimates with limited ground truth on the correctness of source observations. We study the tradeoff between the statistical efficiency and the runtime of data fusion models along two directions: (i) whether or not the model uses features (ii) the amount of ground truth available. We empirically show that features allow FUSE with ERM to obtain estimates of similar or better quality than feature-less models, and also FUSE with EM, with only a few training examples (in some cases as few as \(50\)) while being much faster; in our experiments we observe speedups of \(27\times\). We evaluate FUSE on real data and show that it outperforms feature-less baselines, and can yield reductions of more than \(30\%\) in the source accuracy estimation error and improvements of more than \(10\%\) in the F1-score when resolving conflicts across sources.

          Related collections

          Author and article information




          Comment on this article