A Social Network Analysis Tool for Uncovering Cybersecurity Threats

The


INTRODUCTION
Today, the Internet archives a vast amount of cybersecurity related data and information from a variety of sources. On the other hand, despite this exponentially increasing amount of information on cybercrime, cyberthreats appear to escalate, as recent studies and surveys demonstrate. More importantly, the most serious vulnerabilities, such as the ones that are unknown to the developer of a product, i.e. 0-day exploits, are traded in various underground black hat hacker communities [3]. The fast growing number of existing software vulnerabilities and the corresponding exploits support a complex ecosystem of buyers and sellers of toolkits for compromising the security of widely used software applications, operating systems and hardware devices [15]. The transactions between the discoverers and the buyers of the vulnerabilities vary from the totally illegal black markets to the borderline (legal or semi-legal) grey markets. The buyers' profile, at least for the legal transactions, is usually that of law enforcement agencies, secret services and national security agencies. As expected, transactions information is rarely made public. Another approach which is considered both legal and ethical is to offer bug bounties to researchers for their work on discovering bugs and problems that could lead to security breaches [9].
As stated above, the amount of publicly available cybersecurity information is massive and exponentially growing. However, despite this wealth of information, the number of cybersecurity incidents is increasing at a fast pace. One plausible explanation for this, seemingly, contradictory fact is that a vast information volume does not automatically, by itself, constitute a solid, comprehensive, cybersecurity defence knowledge basis. Stated differently, it appears that this massive information openly available on the Internet exists in a, rather, fragmented and "noisy" form. It is fragmented because it comes from a tremendous number of heterogeneous sources having a non-standardised format with no semantic information or metadata that would aid processing and knowledge extraction. Also, the available information is noisy because the useful part is usually buried in lengthy texts that may address numerous other things, also in a wordy fashion, perhaps unrelated to cybersecurity or reflecting the writer's views in a form of free style discussion unsuitable for automated processing. However, these unstructured and full of redundancies information sources, are still our primary means of obtaining data, knowledge and alerts about cybercrime and its financial aspects which are also non-negligible [2] along with the technical ones. What is more interesting, in this context, are the direct financial consequences of software bugs to the stock prices of a firm [24] or to the society in general [28].
In view of this complex cybersecurity information landscape, our goal is to describe the design, implementation and functionalities of an online, real time system, that utilises Social Networks to identify, archive and analyse cybersecurity related discussions, posts and information exchanges. Through the information archiving and processing capabilities of the software, we can raise early warning alerts, develop information and knowledge bases in a structured and standardised, automated processing friendly, format.
The rest of the paper is organised as follows. Section 2 presents recent related work. Section 3 provides an overview of the CSNA architecture. Section 4 focuses on the implementation aspects of the CSNA as well as examples of its analysis and data presentation capabilities. Finally, section 5 discusses our conclusions from designing and building the CSNA platform and provides directions for future work.

RELATED WORK
Online data from web services, social networks, and Open Source Intelligence (OSINT) feeds can provide important input in virtually all scientific disciplines and research domains [11]. This information can play a major role in extracting valuable insights in the latest developments in many human activities and physical processes. For instance, one of the most well studied frameworks of forecasting based on social networks is related to health epidemics. Ritterman et al. [20] were among the first to use the Twitter to predict the evolution of the swine flu pandemic. In particular, they analysed fifty million tweets using Artificial Intelligence (AI) methods and Support Vector Machines (SVM) for Machine Learning (ML). Their encouraging results were also validated by other researchers. Successful monitoring algorithms for the influenza pandemic have been also documented in other research works [23]. Santillana et al. [21] further improved the accuracy of traditional approaches by combing social media and existing data sources in a single searching and processing framework. Nagar et al. [16] exploited geocoded Twitter data in order to improve the efficiency of the prediction of the New York city 2012 influenza epidemic. Other researchers [18] have used Wikipedia access logs to monitor and forecast six diseases. Their methods yielded promising results with high forecasting value. As our work also uses some basic epidemiological metrics in order to monitor the spread of malware and other cybersecurity threats, we also plan to Social networks analysis in support of forecasts in the stocks markets has attracted the attention of several researchers. Rao et al. [19] analysed stock market activity using Twitter sentiment analysis. The authors collected and processed four million tweets, identifying a strong correlation between stock prices and twitter sentiments. Mao et al. [14] demonstrated a correlation of S%P with Twitter data which, according to the authors, renders the predictions in the stock market domain feasible. Other researchers [17], focused on the information available in the StockTwits platform using sentiment analysis and posts volume indicators. The authors also highlight the volatility and complexity of stock markets, which poses limitations to such approaches. Interestingly, in previous works, researchers have tried to explore [13] the efficiency of forecasting models and techniques from the financial world to the ICT security landscape. A conceptually similar approach [22] evaluates the risks stemming from remotely launched cyberattacks.
With respect to political studies, Tumasjan et al. [25] described one of the first efforts to use Twitter to estimate the German Federal election results. Their research showed that the distribution of tweets for the main parties was similar to the corresponding distribution of votes in the election.
Other researchers tried to use Social Networks to predict the election outcome in numerous other cases. Kegan et al. [12] worked on the Pakistani and Indian elections of 2013 and 2014 respectively using a diffusion estimation model. Ahmed and Skoric [1] emphasised on the nature of the Twitter campaign characteristics in these elections as a decisive factor for the final result. Choy et al. [7] presented a platform based on sociodemographics and sentiment analysis for modelling the Singapore elections. The authors, also, tried to predict the 2012 US presidential elections in [8] with partial, however, success. Bovet et al. [5] used a combination of statistical physics of complex systems and artificial intelligence to emphasise that social network analysis can provide more timely and less expensive estimates for the election results, as accurate as national polls. Likewise, Burnap et al. [6] used sentiment analysis of 13.899.073 tweets to predict the 2015 General Elections results in UK. A significant contribution of their work is that they identified the importance of the geolocation information of the tweets for a successful election results prediction. On the other hand, using Twitter to predict elections has a number of limitations and shortcomings which are summarised in Gayo-Avello's work [10].
Many researchers have tried to predict criminal activity based on social media, with encouraging results. Wang et al. [27] focused on the"hit-andrun" based crimes. They used automatic sentiment analysis and natural language processing for related Twitter posts. Bendler et al. [4] used geographically weighted regression analysis to improve the accuracy of delinquency prediction. A very interesting combination of techniques for improving crime predictions is presented in [26]. In this work, the author gathers data from Foursquare, Twitter and taxi trips to provide evidence for the importance and necessity of deploying hybrid approaches based on temporal and static analysis.
Although all crime types share some common patterns, our research focuses, exclusively, on cybercrime activity, as in [29]. In that work, the authors concentrate on analysing, understanding and predicting DDoS attacks with satisfactory accuracy. In contrast, in our work, we focus on evolving real-time cybersecurity threats using a dynamic intelligent threat detection approach.

THE SAINT SOCIAL NETWORK ANALYSER
The primary target of the Cybersecurity Social Network Analyser platform (CSNA) is Twitter. Twitter is a very useful and massive source for up-todate and real time publicly available information generated continuously by users of any background and educational level, on virtually any subject. In our case, the focus is on cybersecurity related information.
However, simply obtaining raw data coming from the social medium's API (e.g. Twitter's API), does not provide much added value. The proposed software system adds value to the obtained data by easily processing, analysing, and preparing results to be archived in standard data formats and visualised in intuitive graphical representations. Using Python, which has excellent string processing capabilities and great ease-of-use, it is possible to facilitate the development of any conceivable data processing procedure on any social medium's data as long as one builds a suitable client application to interact with the target social networks's API.
The general architecture of the CSNA platform is shown in Figure 1. The data flow and data handling procedures of the CSNA are divided into six steps: (1) Setting up a search query: Starting with this step, a query containing search keywords, i.e. hashtags, is created and inserted into the crawler script as a parameter so that a search process is initiated for tweets that include the hashtags.
(2) User Authentication (OAuth): At this point, an  The CSNA is one of the two primary subsystems of the SAINT platform. The Open Source Intelligence (OSINT) module is another complementary framework for monitoring ongoing threats. It is currently under development and therefore it will be discussed in more detail in the Future Work section of this paper. The OSINT module collects related information from a number of publicly available web sites, services and datasets and stores them for further analysis in the SAINT database. One important aspect of this two-modular approach, based on the CSNA and the OSINT subsystems is the possibility to correlate and cross-validate both findings. During its last 12-month operation, the SAINT managed to identify imminent threats based on the correlation and concurrency of spikes in the received information on both subsystems. In Section 4.3 we will provide some relevant examples.

Details on the Implementation and Operation of the System
Python has excellent string processing capabilities, great ease-of-use, and it is possible to facilitate the development of a variety of data processing procedures on any social medium's content as long as a suitable client application is built to interact with the target social network's API. Each tweet instance that is obtained by the CSNA is collected into the MongoDB database, stored as a JSON-style format. The use of MongoDB as a data repository, renders the process of the data manipulation and analysis more convenient. We use Python's pymongo library to connect each of our Twitter sub-crawlers to the MongoDB database system. Finally, each sub-crawler writes new records (tweets) to the corresponding Twitter collection (based on cybersecurity markets or cyberattacks related data) into the general database, that is named 'twitter'.

Entity Analysis
We performed frequency analysis using the data collected by the Twitter sub-crawlers into the two twitter database collections. Slicing and dicing the data entities allows us to produce interesting statistics. The results of such processing are the tweet entities analysis on the contents of the collections.
We were interested in extracting the 10 most frequently appearing user mentions, hashtags, and word terms over the last 7 days that were written in the tweets collected by the relevant sub-crawler. We searched over all the tweets that are stored in the categorised collections (marketsbased, threats-based). The results from this analysis can be seen in the Figures 2, 3, and 4 respectively.
We employ various aspects of text analytics and NLP methods, such as text pre-processing, encoding, normalisation and complex tokenisation, for dealing with phrases, symbols, etc., as well as statistical analysis approaches for the tweets and various regular expressions methods applied to the raw text. Tokenisation is, also, a vital step of the process in which a stream of text is broken down into individual units called tokens. In their simplest form, these units are words, but we also work on more complex tokenisations that deal with phrases, symbols etc. Another useful pre-processing operation is the stop word removal. Stop words are words that do not bear any content when taken in isolation, such as articles, propositions, adverbs, and so on. Various regular expression (regex) methods are applied to avoid duplicate counts, recognise emoticons, HTML tags, URLs, numbers and other characters as well as targeted words. To complete the stop word list, we finally included the 'rt', 'via', and '...' tokens (a singlecharacter Unicode symbol for horizontal ellipses).
Applying the NLP techniques that were described above to the threats-based and markets-based collections of tweets, a categorical analysis step is, subsequently, performed and two bar chart plots are created for each kind of entity analysis, in real-time. All the figures given in this paper are screen-shots from the outputs generated from our online CSNA platform. Figure 2 shows the most frequently mentioned users in the collected raw tweets that are stored into the two kinds of collections. From Figure 3, we can see that some of the most common hashtags that were mentioned in the markets-based tweets (up bar-plot) were: security, bugbounty, and Hackerone. This agrees with our intuition since Hackerone is one of the most well-known organisation specializing in organising bug-bounties on behalf of companies and, in general, discovering the most serious security issues before they can be exploited by criminals. Additionally, in the same Figure, in the down bar-plot, malicious software hashtag mentions such as malware, ransomware, phishing and others are among the top mentioned threats. We can observe that some of the top attack methods that cybercriminals use at the time, such as malware and ransomware viruses, stir interest and ignite users' posts about IoT security, cyberattacks, hacking, artificial intelligence, cybercrime and cybersecurity in general. From the last category of bar-plots exports (Figure 4), there are a few terms on the left-hand side of the plots with very high frequency of appearance. For example, the most frequently appearing term is twice more frequent than the ones after index position 10. As we move towards the right-hand side of the plots, the curve becomes less steep, implying that the terms on the right-hand portion of the plots have similar frequency of appearance.

Time Series Descriptive Analysis
In this section, we discuss another aspect of our analysis, that of studying the distribution of tweets over time. A time series is a sequence of data points that consists of successive observations over a given time interval. As Twitter provides a 'created at' field with a precise timestamp/datetime of the tweet, we can reorder tweets into "time buckets" so that we can examine how users react to the appearance of events in real time, i.e. while an event is on its onset or it is still ongoing.
Time series descriptive analysis can provide information on how users may react to trending events or a data related to the online reputation management. The dynamic and massive nature of Twitter appears to be a perfect feature that enables to identify the public's reactions to events and breaking news, with a focus on cybersecurity. In this case, our goal is to observe how users react to various cyberattacks (threats-based collection) and bug bounty related news (markets-based collection) as time evolves and to derive valuable information about potential cybersecurity risks and related data breach costs. We exploited Panda's capabilities for handling time series, and we used Highcharts for a visual interpretation of the series.
The processing by the CSNA uses the Highcharts JavaScript library to create informative visualisations of the time series data. Highcharts is, in general, more verbose than other data visualisation libraries, but its outputs are not complicated and they are easy to interpret. The charts are exported into the online CSNA platform, including the corresponding function and query that were used for the Time Series Descriptive Analysis of each category. The generated CSV and JSON files are also available for download. Figure 5 shows a Time Series Descriptive analysis outcome that come of the corresponding cybersecurity markets collection. Currently the SAINT platform intends to act as a comprehensive analytical framework to assist researches, scientist and cybersecurity experts to identify imminent threats. The collection, processing and analysis of the data is fully automated and can be adjusted and parameterised to target additional types of cyberattacks. Although much effort is put into making the detection of serious cybersecurity incidents also automated as we will discuss later, the final decision relies on the platform operator based on the visual inspection of the data and the graph charts. Our effort is to make the system selfintelligent by using outlier detection techniques to identify suspicious events. To make that decision less complicated the SAINT system utilises a comparative framework. The operator, before taking any action, can compare the data from the CSNA and the OSINT module. Generally, as it is observed in Figure 4, when a spike related to an issue is shown in an OSINT indicator's Time Series descriptive results, after a few hours or the next day, a similar spike is also shown to the relevant statistical result of the CSNA, indicating that Twitter users are involved with that issue too. Thus, the two subsystems can crossvalidate each other. Empirical results suggest that when a trend is observable in both subsystems or in other words in social networks and the open source intelligence feeds, there is a very high probability of an ongoing or imminent incident. Next steps are the automatisation of this procedure as well.

Word Clouds
A Word Cloud is a visual representation of text data, typically used to display keyword metadata (tags) on websites or to visualise free form text. Tags are usually single words depicted in the word cloud using different font sizes and/or colors depending on their importance.
The word cloud format is useful for quickly identifying the most important terms in the word cloud picture comparing, at the same time, the terms according to their importance. Moreover, significant textual data items can be highlighted. Word clouds are also widely used for analysing data from social networks.
There are three main reasons behind our decision to use word clouds to present textual data: 1. They convey information with simplicity and clarity.
2. They are powerful communication tools that are easy to understand and interpret.
3. They are more visually engaging than a simple table of data.
In our case, we use the word cloud visualisation method to present the most important terms (tags) for each twitter database collection over the past 30 days. The data analysis process is similar to the one described in Section 4.2. The Categorical Analysis part of the CSNA analysis framework is used to identify the 10 most frequently used, by Twitter users, terms. This data visualisation method is useful for understanding, in depth, the trends in a topic (in our case, cybersecurity markets and cyberattacks) as it facilitates the application of human intelligence methods to help stakeholders make more informed decisions.

CONCLUSIONS AND DIRECTIONS FOR FUTURE WORK
Our work presented the design, implementation, and operation of a Cybersecurity Social Network Analyser (CSNA). We discussed the applicability of the CSNA in discovering, extracting, and analysing information contained in tweets, based on appropriate keywords or keyword combinations. The obtained information is stored in a database using a standardised electronic format for further processing and visualisation.
We focused on Twitter, since Social Networks can provide significant insights in various trends related to ongoing cybersecurity activities and cyberecurity breach incidents. For instance, the number and frequency of cybersecurity relevant discussions in social media can be highly indicative for the current malware epidemics and other types of cybercriminal activities. The identification of trends is achieved by the platform through a variety of approaches.
Hashtags act as a filter to select tweets of interest. The NLP methods that the software deploys are particularly effective in extracting relevant text from the main body of the raw tweet data. Twitter metadata information items are, also, particularly effective in correlating structured and unstructured tweet information.
The visualisation of the information in a time-series form provides valuable timing data related to the development and onset of these trends. The graphical display of the rates of appearance of specific topics in tweets highlights their significance at the time they were posted. Empirical results correlate the social network analysis outcomes with specific incidents such as ongoing phishing campaigns and eminent ransomware attacks. Our findings are consistent with the work of other researchers regarding the deployment of social networks in monitoring and forecasting various phenomena in, virtually, any aspect of human activity. To increase the accuracy of our detection mechanisms we also deploy some basic mathematical epidemiological models such us the epidemic curve to identify points of time where an escalation is taking place.
Our future work, partially implemented, focuses on the improvement of the CSNA platform using more intelligent methods for locating and extracting information. This is work in progress and aims to create a dynamical semi-autonomous system that will be in the position to automatically refine and improve its search methods and goals. Using the new version of the SAINT tools, we expect to be able to identify the correlation between the data diffused in Social Networks and Open Source Intelligence (OSINT) feeds. We are already in the process of extracting data from numerous information sources. In particular, we are able to collect public data from more than 10 different datasets to cover threats related with malware attacks, ransomware operations, botnets activity, spamming and phishing campaigns. The datasets are updated periodically on an hourly or on a daily basis and therefore we are able to process and correlate them with CSNA's A Social Network Analysis Tool for Uncovering Cybersecurity Threats Vlachos • Stamatiou • Tzamalis • Nikoletseas • Chantzi output and identify common patterns. In the next version, a unified processing and analysis framework is expected to provide automatic detection of suspicious patterns in both subsystems. Finally, an experimental Deep Web Crawler DWC is also under heavy development. The goal is to make the DWC to follow leads from the CSNA and the OSINT subsystem. As cyberattacks become more sophisticated, innovative and multidisciplinary approaches are worth exploring.