1
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Large-scale Vietnamese point-of-interest classification using weak labeling

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Point-of-Interests (POIs) represent geographic location by different categories (e.g., touristic places, amenities, or shops) and play a prominent role in several location-based applications. However, the majority of POIs category labels are crowd-sourced by the community, thus often of low quality. In this paper, we introduce the first annotated dataset for the POIs categorical classification task in Vietnamese. A total of 750,000 POIs are collected from WeMap, a Vietnamese digital map. Large-scale hand-labeling is inherently time-consuming and labor-intensive, thus we have proposed a new approach using weak labeling. As a result, our dataset covers 15 categories with 275,000 weak-labeled POIs for training, and 30,000 gold-standard POIs for testing, making it the largest compared to the existing Vietnamese POIs dataset. We empirically conduct POI categorical classification experiments using a strong baseline (BERT-based fine-tuning) on our dataset and find that our approach shows high efficiency and is applicable on a large scale. The proposed baseline gives an F1 score of 90% on the test dataset, and significantly improves the accuracy of WeMap POI data by a margin of 37% (from 56 to 93%).

          Related collections

          Most cited references22

          • Record: found
          • Abstract: not found
          • Article: not found

          A Computer Movie Simulating Urban Growth in the Detroit Region

          W Tobler (1970)
            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            Citizens as sensors: the world of volunteered geography

              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Snorkel: rapid training data creation with weak supervision

              Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research laboratories. In a user study, subject matter experts build models \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2.8\times $$\end{document} 2.8 × faster and increase predictive performance an average \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$45.5\%$$\end{document} 45.5 % versus seven hours of hand labeling. We study the modeling trade-offs in this new setting and propose an optimizer for automating trade-off decisions that gives up to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1.8\times $$\end{document} 1.8 × speedup per pipeline execution. In two collaborations, with the US Department of Veterans Affairs and the US Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$132\%$$\end{document} 132 % average improvements to predictive performance over prior heuristic approaches and comes within an average \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$3.60\%$$\end{document} 3.60 % of the predictive performance of large hand-curated training sets.
                Bookmark

                Author and article information

                Contributors
                Journal
                Front Artif Intell
                Front Artif Intell
                Front. Artif. Intell.
                Frontiers in Artificial Intelligence
                Frontiers Media S.A.
                2624-8212
                09 December 2022
                2022
                : 5
                : 1020532
                Affiliations
                [1] 1Center of Multidisciplinary Integrated Technologies for Field Monitoring, Vietnam National University of Engineering and Technology , Hanoi, Vietnam
                [2] 2NTT Hi-Tech Institute, Nguyen Tat Thanh University , Ho Chi Minh City, Vietnam
                [3] 3Faculty of Information Technology, VNU University of Engineering and Technology , Hanoi, Vietnam
                [4] 4FIMO , Hanoi, Vietnam
                Author notes

                Edited by: Mohammad Akbari, Amirkabir University of Technology, Iran

                Reviewed by: Rini Anggrainingsih, Sebelas Maret University, Indonesia; Suan Lee, Semyung University, South Korea

                *Correspondence: Viet Hung Luu hunglv@ 123456fimo.vn

                This article was submitted to Natural Language Processing, a section of the journal Frontiers in Artificial Intelligence

                Article
                10.3389/frai.2022.1020532
                9780588
                36568578
                b0612660-c71b-4a72-9f9d-7e378fb74150
                Copyright © 2022 Tran, Le, Pham, Luu and Bui.

                This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

                History
                : 16 August 2022
                : 08 November 2022
                Page count
                Figures: 3, Tables: 3, Equations: 0, References: 22, Pages: 8, Words: 4028
                Categories
                Artificial Intelligence
                Original Research

                crowd-sourcing,point-of-interest,weak labeling,snorkel,bert-based

                Comments

                Comment on this article