7
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation

      Preprint
      , ,

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Web archives are a valuable resource for researchers of various disciplines. However, to use them as a scholarly source, researchers require a tool that provides efficient access to Web archive data for extraction and derivation of smaller datasets. Besides efficient access we identify five other objectives based on practical researcher needs such as ease of use, extensibility and reusability. Towards these objectives we propose ArchiveSpark, a framework for efficient, distributed Web archive processing that builds a research corpus by working on existing and standardized data formats commonly held by Web archiving institutions. Performance optimizations in ArchiveSpark, facilitated by the use of a widely available metadata index, result in significant speed-ups of data processing. Our benchmarks show that ArchiveSpark is faster than alternative approaches without depending on any additional data stores while improving usability by seamlessly integrating queries and derivations with external tools.

          Related collections

          Most cited references5

          • Record: found
          • Abstract: not found
          • Article: not found

          Cassandra

            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            MapReduce

              Bookmark
              • Record: found
              • Abstract: not found
              • Article: not found

              Access and Scholarly Use of Web Archives

                Bookmark

                Author and article information

                Journal
                2017-02-03
                Article
                10.1145/2910896.2910902
                1702.01015
                d8d40895-7bf7-4214-859e-99c45d98ba49

                http://arxiv.org/licenses/nonexclusive-distrib/1.0/

                History
                Custom metadata
                JCDL 2016, Newark, NJ, USA
                cs.DL cs.DB

                Databases,Information & Library science
                Databases, Information & Library science

                Comments

                Comment on this article