+1 Recommend
0 collections
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Snorkel: rapid training data creation with weak supervision


      Read this article at

          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.


          Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research laboratories. In a user study, subject matter experts build models \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2.8\times $$\end{document} faster and increase predictive performance an average \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$45.5\%$$\end{document} versus seven hours of hand labeling. We study the modeling trade-offs in this new setting and propose an optimizer for automating trade-off decisions that gives up to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1.8\times $$\end{document} speedup per pipeline execution. In two collaborations, with the US Department of Veterans Affairs and the US Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$132\%$$\end{document} average improvements to predictive performance over prior heuristic approaches and comes within an average \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$3.60\%$$\end{document} of the predictive performance of large hand-curated training sets.

          Related collections

          Most cited references14

          • Record: found
          • Abstract: found
          • Article: not found

          Framewise phoneme classification with bidirectional LSTM and other neural network architectures.

          In this paper, we present bidirectional Long Short Term Memory (LSTM) networks, and a modified, full gradient version of the LSTM learning algorithm. We evaluate Bidirectional LSTM (BLSTM) and several other network architectures on the benchmark task of framewise phoneme classification, using the TIMIT database. Our main findings are that bidirectional networks outperform unidirectional ones, and Long Short Term Memory (LSTM) is much faster and also more accurate than both standard Recurrent Neural Nets (RNNs) and time-windowed Multilayer Perceptrons (MLPs). Our results support the view that contextual information is crucial to speech processing, and suggest that BLSTM is an effective architecture with which to exploit it.
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            The Comparative Toxicogenomics Database: update 2017

            The Comparative Toxicogenomics Database (CTD; http://ctdbase.org/) provides information about interactions between chemicals and gene products, and their relationships to diseases. Core CTD content (chemical-gene, chemical-disease and gene-disease interactions manually curated from the literature) are integrated with each other as well as with select external datasets to generate expanded networks and predict novel associations. Today, core CTD includes more than 30.5 million toxicogenomic connections relating chemicals/drugs, genes/proteins, diseases, taxa, Gene Ontology (GO) annotations, pathways, and gene interaction modules. In this update, we report a 33% increase in our core data content since 2015, describe our new exposure module (that harmonizes exposure science information with core toxicogenomic data) and introduce a novel dataset of GO-disease inferences (that identify common molecular underpinnings for seemingly unrelated pathologies). These advancements centralize and contextualize real-world chemical exposures with molecular pathways to help scientists generate testable hypotheses in an effort to understand the etiology and mechanisms underlying environmentally influenced diseases.
              • Record: found
              • Abstract: not found
              • Article: not found

              Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm


                Author and article information

                VLDB J
                VLDB J
                The Vldb Journal
                Springer Berlin Heidelberg (Berlin/Heidelberg )
                15 July 2019
                15 July 2019
                : 29
                : 2
                : 709-730
                [1 ]GRID grid.168010.e, ISNI 0000000419368956, Stanford University, ; Stanford, CA USA
                [2 ]GRID grid.40263.33, ISNI 0000 0004 1936 9094, Computer Science Department, Brown University, ; Providence, RI USA
                © The Author(s) 2019

                Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

                : 16 December 2018
                : 15 May 2019
                : 25 June 2019
                Funded by: FundRef http://dx.doi.org/10.13039/100000185, Defense Advanced Research Projects Agency;
                Award ID: N66001-15-C-4043
                Award ID: FA8750-17-2-0095
                Funded by: FundRef http://dx.doi.org/10.13039/100000185, Defense Advanced Research Projects Agency;
                Award ID: FA8750-12-2-0335
                Award ID: FA8750-13-2-0039
                Funded by: FundRef http://dx.doi.org/10.13039/100000015, US Department of Energy;
                Award ID: 108845
                Funded by: FundRef http://dx.doi.org/10.13039/100000002, National Institutes of Health;
                Award ID: U54EB020405
                Funded by: FundRef http://dx.doi.org/10.13039/100000006, Office of Naval Research;
                Award ID: N000141210041
                Award ID: N000141310129
                Funded by: FundRef http://dx.doi.org/10.13039/100000936, Gordon and Betty Moore Foundation;
                Special Issue Paper
                Custom metadata
                © Springer-Verlag GmbH Germany, part of Springer Nature 2020

                machine learning,weak supervision,training data
                machine learning, weak supervision, training data


                Comment on this article