25
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Detecting Malicious PowerShell Scripts Using Contextual Embeddings

      Preprint
      , ,

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          PowerShell is a command line shell, that is widely used in organizations for configuration management and task automation. Unfortunately, PowerShell is also increasingly used by cybercriminals for launching cyber attacks against organizations, mainly because it is pre-installed on Windows machines and it exposes strong functionality that may be leveraged by attackers. This makes the problem of detecting malicious PowerShell scripts both urgent and challenging. We address this important problem by presenting several novel deep learning based detectors of malicious PowerShell scripts. Our best model obtains a true positive rate of nearly 90% while maintaining a low false positive rate of less than 0.1%, indicating that it can be of practical value. Our models employ pre-trained contextual embeddings of words from the PowerShell "language". A contextual word embedding is able to project semantically similar words to proximate vectors in the embedding space. A known problem in the cybersecurity domain is that labeled data is relatively scarce in comparison with unlabeled data, making it difficult to devise effective supervised detection of malicious activity of many types. This is also the case with PowerShell scripts. Our work shows that this problem can be largely mitigated by learning a pre-trained contextual embedding based on unlabeled data. We trained our models' embedding layer using a scripts dataset that was enriched by a large corpus of unlabeled PowerShell scripts collected from public repositories. As established by our performance analysis, the use of unlabeled data for the embedding significantly improved the performance of our detectors. We estimate that the usage of pre-trained contextual embeddings based on unlabeled data for improved classification accuracy will find additional applications in the cybersecurity domain.

          Related collections

          Most cited references5

          • Record: found
          • Abstract: not found
          • Conference Proceedings: not found

          A unified architecture for natural language processing

            Bookmark
            • Record: found
            • Abstract: not found
            • Conference Proceedings: not found

            Malware classification with LSTM and GRU language models and a character-level CNN

              Bookmark
              • Record: found
              • Abstract: not found
              • Conference Proceedings: not found

              Learning the PE Header, Malware Detection with Minimal Domain Knowledge

                Bookmark

                Author and article information

                Journal
                23 May 2019
                Article
                1905.09538
                e5a842dd-ed2a-4861-ae3e-d4fba5f88c53

                http://arxiv.org/licenses/nonexclusive-distrib/1.0/

                History
                Custom metadata
                17 pages, 7 figures
                cs.CR cs.LG

                Security & Cryptology,Artificial intelligence
                Security & Cryptology, Artificial intelligence

                Comments

                Comment on this article