diff --git a/Thesis_Docs/Nikkhah_Nasab-Aida-Mastersthesis.blg b/Thesis_Docs/Nikkhah_Nasab-Aida-Mastersthesis.blg index a72913734acb9eb4ef6519b50a09ad4664a8ab05..d960ff43360c263f54c40eb1c511db64ddff8479 100644 --- a/Thesis_Docs/Nikkhah_Nasab-Aida-Mastersthesis.blg +++ b/Thesis_Docs/Nikkhah_Nasab-Aida-Mastersthesis.blg @@ -11,46 +11,48 @@ Database file #1: ../Thesis_Docs/sources/references.bib -- IEEEtran.bst version 1.14 (2015/08/26) by Michael Shell. -- http://www.michaelshell.org/tex/ieeetran/bibtex/ -- See the "IEEEtran_bst_HOWTO.pdf" manual for usage information. +Warning--empty journal in shalaginov2016malware Done. You've used 27 entries, 4087 wiz_defined-function locations, - 975 strings with 14010 characters, -and the built_in function-call counts, 24093 in all, are: -= -- 1844 -> -- 673 -< -- 204 -+ -- 368 -- -- 127 -* -- 1158 -:= -- 3343 -add.period$ -- 67 + 1016 strings with 16168 characters, +and the built_in function-call counts, 25552 in all, are: += -- 2012 +> -- 704 +< -- 193 ++ -- 378 +- -- 136 +* -- 1403 +:= -- 3724 +add.period$ -- 61 call.type$ -- 27 -change.case$ -- 29 -chr.to.int$ -- 498 -cite$ -- 27 -duplicate$ -- 1683 -empty$ -- 2002 -format.name$ -- 145 -if$ -- 5704 +change.case$ -- 41 +chr.to.int$ -- 474 +cite$ -- 28 +duplicate$ -- 1752 +empty$ -- 2027 +format.name$ -- 156 +if$ -- 5929 int.to.chr$ -- 0 int.to.str$ -- 27 -missing$ -- 306 -newline$ -- 106 -num.names$ -- 27 -pop$ -- 765 +missing$ -- 341 +newline$ -- 152 +num.names$ -- 31 +pop$ -- 824 preamble$ -- 1 purify$ -- 0 quote$ -- 2 -skip$ -- 1833 +skip$ -- 1868 stack$ -- 0 -substring$ -- 1235 -swap$ -- 1398 -text.length$ -- 38 +substring$ -- 1195 +swap$ -- 1500 +text.length$ -- 35 text.prefix$ -- 0 top$ -- 5 type$ -- 27 -warning$ -- 0 -while$ -- 117 +warning$ -- 1 +while$ -- 115 width$ -- 29 -write$ -- 278 +write$ -- 354 +(There was 1 warning) diff --git a/Thesis_Docs/Nikkhah_Nasab-Aida-Mastersthesis.pdf b/Thesis_Docs/Nikkhah_Nasab-Aida-Mastersthesis.pdf index f2a11669349e3f1ab4712df8cd5a22c8359653ab..5ab87aabb51dbbdea402f27a8db2b7a94908201e 100644 Binary files a/Thesis_Docs/Nikkhah_Nasab-Aida-Mastersthesis.pdf and b/Thesis_Docs/Nikkhah_Nasab-Aida-Mastersthesis.pdf differ diff --git a/Thesis_Docs/Nikkhah_Nasab-Aida-Mastersthesis.tex b/Thesis_Docs/Nikkhah_Nasab-Aida-Mastersthesis.tex index 6ff60dca11c8e503ff7976ff42b737f1fddbfe2b..6c1d0107bd73f88bfefbb4b0786e7b5fd8ce252a 100644 --- a/Thesis_Docs/Nikkhah_Nasab-Aida-Mastersthesis.tex +++ b/Thesis_Docs/Nikkhah_Nasab-Aida-Mastersthesis.tex @@ -64,7 +64,7 @@ % Replace these values %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \newcommand{\thesistitleDE}{Leistungsbewertung des BAYWATCH-Frameworks: Eine Evaluierung mit realen und künstlichen Daten} -\newcommand{\thesistitleEN}{DBAYWATCH Framework Performance: An Evaluation with Real and Artificial Data} +\newcommand{\thesistitleEN}{BAYWATCH Framework Performance: An Evaluation with Real and Artificial Data} \newcommand{\student}{Aida Nikkhah Nasab} \newcommand{\matrnr}{22208964} %\newcommand{\stammnr}{234567} % Unset this variable if you only have a (single) matriculation number diff --git a/Thesis_Docs/main.tex b/Thesis_Docs/main.tex index 4afe064800232d3a6e123726f5435ede8fc5a760..70d5e65a69b1e12550ccdffe94d7d9fe34d7b17d 100644 --- a/Thesis_Docs/main.tex +++ b/Thesis_Docs/main.tex @@ -1,32 +1,50 @@ \chapter{Topical Overview} \section{Problem Statement} -In today’s interconnected digital landscape, the protection of information and the security of network systems are of paramount importance. Modern enterprises generate vast quantities of user log data every day, which holds valuable insights but also poses significant challenges in distinguishing benign activities from potential security threats. As cyber attacks—especially Advanced Persistent Threats (APTs)—grow increasingly sophisticated and complex, there is an urgent need for robust, preventive cybersecurity measures. A critical challenge lies in efficiently sifting through enormous volumes of log data to identify and respond to malicious events before they inflict substantial harm on network infrastructures. Furthermore, it is essential to check the performance of the existing detection framework on both real data and synthetically generated data, ensuring that the method performs reliably under diverse conditions. This research is dedicated to enhancing cybersecurity in large-scale network environments by enabling early detection of APTs and other threats, thereby safeguarding the network infrastructure from potential damage. +In today’s interconnected digital landscape, safeguarding information and securing network systems are critical priorities. Modern enterprises generate vast amounts of user log data daily, containing valuable insights yet presenting significant challenges in differentiating benign activities from potential security threats. As cyberattacks—particularly Advanced Persistent Threats (APTs)—become increasingly sophisticated, there is a pressing need for robust and proactive cybersecurity measures. A key challenge lies in efficiently analyzing massive volumes of log data to detect and mitigate malicious activities before they cause substantial harm to network infrastructures. One common tactic employed by APTs is beaconing behavior, where compromised systems communicate with command-and-control (C2) servers at regular intervals. Identifying such covert communication patterns is important for early threat detection. This thesis focuses on the detection of beaconing behavior to enhance cybersecurity in large-scale network environments. \section{Research Objectives} The primary objective of this research is to advance network security by developing improved methods for early detection and rapid response to potential cyber threats, with a particular emphasis on APTs. Specific research objectives include the development of advanced detection techniques that leverage distinctive behavioral patterns, the implementation of proactive security measures to enable swift responses, and the enhancement of beaconing behavior analysis through an innovative signal analysis pipeline. In addition, the research aims to continuously evaluate and improve security policies based on both real-world and synthetic data. \section{Research Questions} -The research is guided by several key questions, including: How can beaconing behavior be effectively detected within large-scale network data to provide early warning of potential threats? What is the impact of periodicity in network communications on distinguishing between benign and malicious activities? Furthermore, is the beaconing behavior detectable in generated synthetic data, and how does its detectability compare to that in real-world data? +The research is guided by several key questions, including: +\begin{itemize} +\item How can beaconing behavior be effectively detected within large-scale network data to provide early warning of potential threats? +\item What is the impact of periodicity in network communications on distinguishing between benign and malicious activities? +\item Is the beaconing behavior detectable in generated synthetic data, and how does its detectability compare to that in real-world data? +\end{itemize} \section{Structure of the Thesis} -The thesis is organized into a cohesive narrative that begins by establishing the foundational background and core concepts essential to understanding network security and periodicity detection. Following this, a review of related work contextualizes the current research within the broader field. The methodology chapter then details the advanced techniques introduced in the framework. Chapter 5, Data Analysis is an exploration of real-world network log data to uncover patterns and insights related to beaconing behavior, setting the stage for subsequent evaluations. Chapter 6 represents a detailed description of the procedures and techniques employed to generate synthetic beaconing data, which is used to validate the performance of the detection framework under controlled conditions. Chapter 7 represents Evaluation and Results. An investigation and comparison of the framework’s performance on both real and synthetic data, summarizing key findings and contributions, and discussing potential improvements. Finally Chapter 8 represents Conclusions and Future Work. The final chapter presents the overall conclusions of the research, outlines the contributions made, and proposes directions for future research in the field of network security. +The thesis is organized into a cohesive narrative that begins by establishing the foundational background and core concepts essential to understanding network security and periodicity detection. Following this, a review of related work contextualizes the current research within the broader field. The methodology chapter then details the advanced techniques introduced in the framework. Chapter 5, Data Analysis is an exploration of real-world network log data to uncover patterns and insights related to beaconing behavior, setting the stage for subsequent evaluations. Chapter 6 represents a detailed description of the procedures and techniques employed to generate synthetic beaconing data, which is used to validate the performance of the detection framework under controlled conditions. Chapter 7 represents Evaluation and Results. An investigation and comparison of the framework’s performance on both real and synthetic data, summarizing key findings and contributions, and discussing potential improvements. Finally, Chapter 8 represents Conclusions and Future Work. It presents the overall conclusions of the research, outlines the contributions made, and proposes directions for future research in the field of network security. + +The source code and implementation details of this research are available in the following repository: \footnotesize\texttt{https://mygit.th-deg.de/an28964/master-thesis} \chapter{Background} -This chapter provides the foundational knowledge necessary for understanding the context and significance of this research. It begins with an overview of the cybersecurity landscape, emphasizing the current state, emerging trends, and persistent challenges faced by organizations. It then explores Advanced Persistent Threats (APTs) and their sophisticated, covert tactics that pose significant risks to enterprise networks. The discussion also covers the concept of periodicity in network communication, which is for detecting anomalies in cybersecurity contexts. On top of that, the chapter represents the role of time series databases, with a specific focus on InfluxDB, in managing and analyzing the vast amounts of data generated in cybersecurity operations. Finally, the chapter introduces the BAYWATCH framework, which serves as the foundation for the research by providing a structured approach to detecting beaconing behavior in network traffic. +This chapter provides the foundational knowledge for understanding the context and significance of this research. It begins with an overview of the cybersecurity landscape, emphasizing the current state, emerging trends, and persistent challenges organizations face. It then explores Advanced Persistent Threats (APTs) and their sophisticated, covert tactics that pose significant risks to enterprise networks. The discussion also covers the concept of periodicity in network communication, which is for detecting anomalies in cybersecurity contexts. On top of that, the chapter represents the role of time series databases, with a specific focus on InfluxDB, in managing and analyzing the vast amounts of data generated in cybersecurity operations. Finally, the chapter introduces the BAYWATCH framework, which serves as the foundation for the research by providing a structured approach to detecting beaconing behavior in network traffic. -The field of cybersecurity is continually evolving, with new threats emerging as technology advances. Understanding these threats and the strategies to counter them is for protecting sensitive information, ensuring the continuity of operations, and maintaining the integrity of enterprise networks. This chapter lays the foundation for the research by discussing key concepts and technologies relevant to cybersecurity, setting the stage for the detailed analysis and solutions proposed in subsequent chapters. +The field of cybersecurity is continually evolving, with new threats emerging as technology advances. Understanding these threats and the strategies to counter them is important for protecting sensitive information, ensuring operations continuity, and maintaining enterprise networks' integrity. This chapter lays the foundation for the research by discussing key concepts and technologies relevant to cybersecurity, setting the stage for the detailed analysis and solutions proposed in subsequent chapters. \section{Cybersecurity Landscape} -The cybersecurity landscape is characterized by a dynamic and increasingly complex environment where various types of cyber threats continually evolve. Organizations across the globe face numerous challenges in protecting their networks, data, and systems from these threats, which range from malware and ransomware to sophisticated nation-state attacks. +The cybersecurity landscape is characterized by an ever-evolving and increasingly complex environment, where cyber threats continuously adapt to exploit emerging technologies and vulnerabilities. Organizations across industries face persistent challenges in securing their networks, data, and critical infrastructure against a wide array of threats, ranging from malware and ransomware to highly coordinated Advanced Persistent Threats (APTs) and nation-state-sponsored attacks. As adversaries refine their techniques, security strategies must evolve to stay ahead of these rapidly shifting threats. + +Several key factors shape the modern cybersecurity landscape, making threat detection and mitigation increasingly challenging. The rapid digitization of industries, the widespread adoption of cloud computing, and the proliferation of Internet of Things (IoT) devices have significantly expanded the attack surface, providing adversaries with new entry points to exploit. While these technological advancements offer substantial benefits in terms of scalability and efficiency, they also introduce new vulnerabilities, often outpacing the ability of traditional security solutions to keep up. -Cybersecurity encompasses a wide range of practices, technologies, and strategies aimed at safeguarding information and systems from unauthorized access, damage, or disruption. It involves both proactive measures, such as implementing robust security architectures and practices, and reactive measures, such as incident response and recovery strategies. The cybersecurity landscape is shaped by various factors, including the rapid digitization of industries, the proliferation of cloud services, the Internet of Things (IoT), and the increasing sophistication of cyber attacks. +Additionally, the emergence of cybercrime-as-a-service models, such as Ransomware-as-a-Service (RaaS) and Phishing-as-a-Service (PhaaS), has lowered the barrier to entry for cybercriminals, allowing even unskilled attackers to launch sophisticated campaigns with minimal effort. Threat actors leverage automation, AI-driven attack techniques, and deepfake technologies to enhance the effectiveness of their operations, making it more difficult for conventional security measures to detect and neutralize threats in real time. -The rapid digitization of industries, the increasing reliance on cloud services, and the proliferation of Internet of Things (IoT) devices have significantly expanded the attack surface for cyber threats. These developments, while beneficial, have introduced new vulnerabilities that attackers are quick to exploit. Additionally, the rise of ransomware as a service (RaaS) and the growing sophistication of phishing attacks reflect the evolving threat landscape. +Another critical challenge in cybersecurity is the shortage of skilled security professionals. As attacks grow in complexity, organizations struggle to recruit and retain experts capable of defending against sophisticated adversaries. This talent gap is exacerbated by the increasing demand for expertise in specialized areas such as threat hunting, digital forensics, AI-driven security analytics, and adversarial attack simulations. As a result, organizations are turning to automation, machine learning-driven anomaly detection, and advanced behavioral analytics to augment human capabilities and enhance their defensive strategies. Given these challenges, the need for detection frameworks is more urgent than ever. + +This research focuses specifically on the detection of beaconing behavior, a common characteristic of command-and-control (C2) communications used in APTs, botnets, and malware infections. By analyzing periodic patterns in network traffic, this study aims to develop effective methodologies for identifying stealthy cyber threats and mitigating their impact. The next sections will further explore the theoretical foundations and practical applications of periodicity detection in cybersecurity, setting the stage for the proposed detection framework. -Another significant challenge is the shortage of skilled cybersecurity professionals, which hampers the ability of organizations to effectively defend against these threats. This gap is exacerbated by the complexity of modern networks and the need for advanced tools and techniques to detect and mitigate sophisticated attacks. \section{Advanced Persistent Threats (APTs) and Covert Tactics} -Advanced Persistent Threats (APTs) represent one of the most sophisticated and dangerous forms of cyber attacks. APTs involve prolonged, targeted efforts by attackers, typically state-sponsored or highly organized criminal groups, aimed at stealing sensitive information, disrupting operations, or compromising infrastructure. Unlike traditional cyber attacks, which may be opportunistic and short-lived, APTs are characterized by their stealth, persistence, and the significant resources devoted to them. +Advanced Persistent Threats (APTs) represent one of the most sophisticated and dangerous forms of cyber attacks, posing significant risks to organizations, governments, and critical infrastructure. Unlike conventional cyber attacks that are often opportunistic and short-lived, APTs involve prolonged, highly targeted operations conducted by state-sponsored actors, cybercriminal syndicates, or advanced hacking groups. + +What sets APTs apart is their stealth and persistence. Attackers typically employ multi-stage intrusion strategies, beginning with initial reconnaissance, followed by infiltration through zero-day exploits, spear-phishing campaigns, or supply-chain attacks. Once inside the target network, they establish command-and-control (C2) infrastructure, allowing them to maintain access over extended periods while evading detection. To achieve this, APT actors leverage sophisticated evasion techniques, including polymorphic malware, encrypted C2 channels, and lateral movement tactics to expand their foothold within the network. + +A defining characteristic of APT campaigns is their low-and-slow approach, where attackers deliberately limit their activities to avoid triggering security alerts. Unlike other malware, which may cause immediate disruptions, APTs focus on long-term intelligence gathering, strategic espionage, and covert data exfiltration, often remaining undetected for months or even years. The consequences of a successful APT attack can be severe, leading to financial losses, reputational damage, intellectual property theft, and even national security threats. + +Given their stealthy nature, detecting APTs requires advanced behavioral analysis and anomaly detection techniques rather than relying solely on signature-based security measures. One common indicator of APT activity is beaconing behavior, where compromised systems periodically communicate with remote C2 servers. Identifying these covert communication patterns within large-scale network traffic is a critical challenge in cybersecurity, making periodicity detection a valuable approach for uncovering hidden threats. + +This research aims to enhance APT detection capabilities by focusing on beaconing behavior analysis in network traffic. By leveraging both real-world and synthetic data, the study explores effective methodologies for identifying periodic communication patterns, enabling organizations to detect and mitigate stealthy cyber threats before they cause significant harm. \begin{figure} \centering @@ -35,7 +53,7 @@ Advanced Persistent Threats (APTs) represent one of the most sophisticated and d \label{fig:apt_attack_lifecycle} \end{figure} -Figure \ref{fig:apt_attack_lifecycle} illustrates the lifecycle of an APT attack, highlighting the various stages involved, from initial reconnaissance to exfiltration of data. Understanding these stages is important for developing effective detection and mitigation strategies. +Figure \ref{fig:apt_attack_lifecycle} illustrates the sequential stages of an APT attack, beginning with Phase 1 (Reconnaissance), where attackers gather target intelligence through open-source research or network scanning. Phase 2 (Initial Compromise) involves breaching defenses via methods like spear-phishing or exploiting vulnerabilities. Phase 3 (Establishing Foothold) ensures persistence through backdoors, credential theft, or malware installation. Phase 4 (Internal Reconnaissance) focuses on mapping the compromised network to identify high-value assets. Phase 5 (Lateral Movement) enables attackers to pivot across systems, escalating privileges to reach critical data. Phase 6 (Data Exfiltration) involves stealthily transferring stolen information via encrypted channels. Finally, Phase 7 (Clearing Tracks) erases forensic evidence (e.g., log deletion, tool removal) to evade detection. This progression underscores the need for frameworks like Baywatch to detect subtle, multi-stage behaviors—such as lateral movement, encrypted exfiltration, or log tampering—while balancing accuracy and scalability in real-world environments\cite{charan2021dmapt}. APT actors employ various covert tactics to remain undetected and achieve their objectives. Some of these tactics include: @@ -109,7 +127,9 @@ InfluxDB can be employed in cybersecurity for: \label{fig:influxdb_architecture} \end{figure} -Figure \ref{fig:influxdb_architecture} illustrates the architecture of InfluxDB and how data flows through the system, from ingestion to querying and visualization. +Figure \ref{fig:influxdb_architecture} illustrates the workflow of a distributed system designed for high-throughput data ingestion and querying. Users initiate data ingestion via an Ingest Router, which routes incoming data to an Ingester node. The Ingester ensures durability by writing data to a Write-Ahead Log (WAL) and temporarily storing metadata and raw data files. A Compactor optimizes storage by merging and persisting data to Object Storage (e.g., cloud buckets) while updating the Catalog (metadata index) for efficient lookup. The Garbage Collector purges obsolete data files based on retention policies. For queries, Users submit requests through a Query Router, which directs them to a Querier node. The Querier retrieves results by combining real-time data from Ingesters (not yet persisted) and historical data from Object Storage, leveraging the Catalog for metadata navigation. This architecture emphasizes scalability (distributed Ingesters/Queriers), reliability (WAL), and cost-efficiency (Object Storage integration), while balancing low-latency access with persistent storage management. + + \section{Overview of the BAYWATCH Framework} The BAYWATCH framework consists of four main phases, each involving one or more filtering steps. These phases are: @@ -224,19 +244,19 @@ The URL path token filter analyzes the resource path within the URLs accessed by The novelty analysis step removes duplicate cases by filtering out source/destination pairs that have already been reported for beaconing behavior. This step ensures that only new and unique cases are passed on to the ranking algorithm for further analysis. \section{Investigation and Verification} -The final phase of the BAYWATCHframework involves the manual investigation and verification of the remaining suspicious cases. This phase is for ensuring that only truly malicious beaconing cases are reported. +The final phase of the BAYWATCH framework involves the manual investigation and verification of the remaining suspicious cases. This phase is for ensuring that only truly malicious beaconing cases are reported. \subsection{Feature Set} Each candidate case is represented by a set of features, including the source, destination, and a series of time intervals. The BAYWATCH framework generates additional features, such as the entropy of the time intervals, the n-gram histogram, and the compressibility of the symbolized series. These features are used to train a classifier for automated classification of the candidate cases. \subsection{Classifier} -The BAYWATCH framework employs a randomforest classifier to classify the candidate cases as either benign or malicious. The random forest classifier is trained using a small set of manually investigated cases and their corresponding labels. The trained classifier is then applied to +The BAYWATCH framework employs a random forest classifier to classify the candidate cases as either benign or malicious. The random forest classifier is trained using a small set of manually investigated cases and their corresponding labels. The trained classifier is then applied to the remaining cases to automate the classification process. \subsection{Bootstrapping Process} -To minimize the manual investigation workload, the BAYWATCH framework employs a boot strapping process. A small set of candidate cases is manually investigated and used as a training set for the classifier. The trained classifier is then applied to the remaining cases, significantly +To minimize the manual investigation workload, the BAYWATCH framework employs a bootstrapping process. A small set of candidate cases is manually investigated and used as a training set for the classifier. The trained classifier is then applied to the remaining cases, significantly reducing the number of cases that require manual investigation. \section{Summary} -This chapter has provided a comprehensive overview of the cybersecurity landscape, APTs and their covert tactics, enterprise networks, periodicity in network communication, and time series databases, with a detailed focus on InfluxDB. These foundational topics are important for understanding the subsequent chapters, which will represent related work, methodology, implementation, experiments, and results. The knowledge gained from this background will inform the development and evaluation of advanced techniques for detecting and mitigating cyber threats in enterprise networks. -In this work, using machine learning techniques like bootstrapping and classifiers is not implemented, and the focus is on time series analysis and periodicity detection in both real and synthetic data. +This chapter has provided a comprehensive overview of the cybersecurity landscape, APTs and their covert tactics, enterprise networks, periodicity in network communication, and time series databases, with a detailed focus on InfluxDB. These foundational topics are important for understanding the subsequent chapters, which will represent related work, methodology, implementation, experiments, and results. Additionally, this chapter has introduced the BAYWATCH framework, which serves as the core detection system for analyzing beaconing behavior. The BAYWATCH framework follows a structured multi-step process. In this work, using machine learning techniques like bootstrapping and classifiers is not implemented, and the focus is on time series analysis and periodicity detection in both real and synthetic data. The knowledge gained from this background will inform the development and evaluation of advanced techniques for detecting and mitigating cyber threats in enterprise networks. + \chapter{Related Work} @@ -277,27 +297,27 @@ Haffey et al. (2018) focused on modeling, analyzing, and characterizing periodic Recent research has focused on various aspects of enterprise security and malicious activity detection. Oprea et al. (2018) introduced MADE, a security analytics framework designed to enhance threat detection in enterprise environments \cite{oprea2018made} . The framework leverages advanced analytics to detect potential threats by analyzing large volumes of security data, enabling organizations to respond more effectively to cyber incidents. Ukrop et al. (2019) investigated the perception of IT professionals regarding the trustworthiness of TLS certificates, highlighting challenges in assessing certificate legitimacy and its implications for secure communications \cite{ukrop2019will} . In a related study, Vissers et al. (2017) explored the ecosystem of malicious domain registrations within the .eu top-level domain (TLD), providing insights into the strategies used by attackers to exploit domain registration systems for malicious purposes \cite{vissers2017exploring} . Together, these works contribute to the broader understanding of security challenges in modern networks and propose solutions to improve detection and mitigation strategies. \chapter{Methodology} -This chapter covers a detailed overview of the event log dataset, along with the exploratory data analysis and the preprocessing steps required to ensure the suitability of the data for the study. Furthermore, the steps in the data generation process utilized in this thesis are defined. The chapter also provides an explanation of the BAYWATCH framework, including the various phases and steps involved in detecting beaconing behavior. The methodology section concludes with a description of the evaluation metrics used to assess the performance of the BAYWATCH framework. +This chapter begins by presenting the data strategy rationale, outlining the significance of selecting appropriate data sources for analyzing periodic network behaviors. The discussion includes both real-world network logs and artificially generated datasets, emphasizing the importance of using a combination of these data sources to ensure comprehensive evaluation and validation of the detection framework. Following this, the chapter provides a detailed overview of the event log dataset, describing its structure, characteristics, and relevance to the study. Additionally, necessary preprocessing steps—such as data cleaning, normalization, and transformation—are discussed to ensure the dataset is well-prepared for analysis. Beyond real data, the chapter outlines the methodology used for generating synthetic datasets, which serve as a controlled testbed for validating the framework’s effectiveness. The artificial data generation process is explained step by step, ensuring that it closely mimics real-world network traffic while allowing precise control over key parameters related to beaconing behavior. \section{Data Strategy Rationale} Real data represents actual network traffic, capturing the authentic, complex, and often noisy behavior of users in an enterprise environment. It reflects genuine usage patterns—including legitimate periodic traffic, gaps due to network delays or device outages, and inherent variability caused by diverse applications—that naturally emerge during normal operations. For example, noise can be introduced by network retransmissions and latency issues, while missing data often occurs when devices temporarily go offline or due to privacy-driven data filtering. These challenges arise organically and can obscure the detection of malicious beaconing behavior by blending with benign periodic activities. In contrast, artificial data is generated under controlled conditions. Beaconing behavior is simulated with predetermined parameters (e.g., specific beacon frequencies and controlled jitter ranges), establishing a “ground truth†scenario where variations can be deliberately introduced. This controlled simulation replicates how challenges like noise and irregular intervals might manifest under different network conditions. By systematically varying these parameters, the sensitivity and robustness of the BAYWATCH framework can be precisely assessed, and the impact of specific challenges on detection accuracy can be thoroughly evaluated. -The combination of both data sources enables validation of the framework in authentic operational conditions (ensuring external validity) while also leveraging controlled simulations to isolate and address specific challenges (ensuring internal validity). This dual approach ensures that the detection mechanism is both practical for real-world deployment and resilient against a broad spectrum of adversarial scenarios. +The combination of both data sources enables validation of the framework in authentic operational conditions (ensuring external validity) while also leveraging controlled simulations to isolate and address specific challenges. This dual approach ensures that the detection mechanism is both practical for real-world deployment and resilient against a broad spectrum of adversarial scenarios. \section{Real Data Source} -The real-world data used in this study was collected from a large-scale enterprise network, capturing user activities as they navigate various URLs throughout the workday. This dataset provides a detailed perspective on user interactions, enabling an in-depth analysis of browsing patterns and behaviors. The data is stored in JSON format, which offers flexibility and readability, making it easier to manage and manipulate large volumes of information. Each entry in the dataset records a specific user interaction, including precise timestamps and the URLs visited, allowing for a chronological reconstruction of user activities. This level of detail is important for identifying patterns and trends over time, such as peak usage periods or frequent transitions between specific URLs. +The real-world data used in this study was collected from a large-scale enterprise network, specifically capturing user activities as they navigate through various URLs during their workday. This dataset offers an invaluable and detailed perspective on user interactions within the network, providing a comprehensive view of how users engage with web resources in a real-time operational environment. Each data entry is recorded, tracking specific user interactions, and is linked to precise timestamps, ensuring the ability to recreate a chronological sequence of activities. This feature is important for reconstructing browsing sessions and understanding the context in which different behaviors unfold. The dataset, stored in JSON format, is highly flexible and human-readable, which facilitates easier parsing, management, and analysis of large volumes of data. JSON's lightweight nature also makes it well-suited for representing nested information, ensuring that detailed records of user activity can be organized efficiently without sacrificing accessibility. Each entry in the dataset contains key information, such as the URL visited, the time of access, and metadata related to the user’s session, allowing for fine-grained analysis of individual behaviors and network traffic patterns. This detailed level of logging is important for identifying trends, anomalies, and specific patterns within user behavior, which could provide valuable insights into network usage over time. It enables the identification of peak usage periods, frequent transitions between particular URLs, and patterns of repeated activity that may indicate consistent or habitual behavior. Additionally, such data can be instrumental in recognizing deviations from typical usage patterns, which may be indicative of malicious activities like beaconing or other types of cyber threats. Through this dataset, it becomes possible to conduct both broad trend analysis and detailed individual case studies, contributing significantly to the understanding of user behavior in an enterprise network context. \subsection{Data Structure and Schema} The dataset is structured as a collection of JSON files, with each file containing detailed logs of user interactions. Each entry in the JSON files includes the following fields: \begin{itemize} - \item \texttt{IP\_Address}: The IP\_Address of the user's device, providing a unique identifier for each host - \item \texttt{logdate}: The date and time of the user interaction, recorded in a standardized date-time format. - \item \texttt{url\_hostname}: The hostname of the URL visited by the user. - \item \texttt{user}: An optional field denoting the user identifier. For security reasons, usernames are deliberately omitted during the import process. + \item \texttt{IP\_Address}: The IP\_Address of the user's device is recorded to provide a unique identifier for each host within the network. This field is for tracing the source of network traffic and understanding the specific device interactions with the network. By capturing the IP address, it becomes possible to track user activity across different sessions, correlate network events, and identify patterns of behavior. It also allows for detecting unusual or unauthorized access attempts. + \item \texttt{logdate}: The date and time of the user interaction is recorded in a standardized date-time format, such as "2023-08-01T06:06:37.000Z." This format follows the ISO 8601 standard, which includes the year, month, and day (YYYY-MM-DD) followed by a "T" separator, the time in hours, minutes, and seconds (HH:mm:ss), and an optional milliseconds part (.000). The "Z" at the end indicates that the time is represented in Coordinated Universal Time (UTC). + \item \texttt{url\_hostname}: The hostname of the URL visited by the user is recorded in the dataset to track the specific websites or servers accessed during user interactions. This field provides valuable insights into the types of resources being requested, which can be used to identify patterns in browsing behavior, such as frequently visited websites or network traffic trends. By capturing the hostname, the dataset enables analysis of user activities at a higher level, helping to identify potential security concerns, such as access to malicious or unauthorized sites. This information can be used to detect anomalies in network traffic and enhance the accuracy of network monitoring and beaconing detection efforts. + \item \texttt{user}: An optional field denoting the user identifier is included in the dataset to track individual user interactions. However, for privacy and security reasons, usernames are deliberately omitted during the import process to prevent the exposure of personally identifiable information (PII). Instead of using direct identifiers, anonymized or pseudonymous identifiers may be used to preserve user privacy while still enabling the analysis of user behaviors. This approach ensures that sensitive data is protected in compliance with data protection regulations, such as GDPR, while allowing for meaningful analysis of network activity and interactions. \end{itemize} The structure of the JSON files is defined by a Document Type Definition (DTD), which ensures consistency and reliability across all entries. Below is an example of the JSON schema used for the dataset: @@ -309,65 +329,83 @@ The structure of the JSON files is defined by a Document Type Definition (DTD), "logdate": { "type": "string", "format": "date-time"}, "url_hostname": { "type": "string"}, "user": { "type": "string"}}, - "required": ["logdate", "url_hostname"]} + "required": ["logdate", "url_hostname"] } \end{lstlisting} The structured format of the JSON files ensures that each entry is consistent and comprehensive, providing a reliable record of user activities for analysis. \subsection{Data Collection and Scale} -The dataset was collected over the course of a single day, specifically on a typical workday, August 1, 2023 (Tuesday), generating nearly 73 gigabytes of information. This large-scale data collection captures the following details: +The dataset was collected over a single day, specifically on a typical workday, August 1, 2023 (Tuesday), generating nearly 73 gigabytes of information. This large-scale data collection captures a comprehensive range of details, providing an in-depth look at user interactions across various online platforms and services. The dataset encompasses the following key components: + \begin{itemize} - \item \textbf{Host Information}: The IP addresses of the user devices, enabling the tracking of individual hosts and their activities. - \item \textbf{Timestamps}: Precise date and time of each user interaction, enabling temporal analysis of browsing patterns. - \item \textbf{URL Hostnames}: The hostnames of the URLs visited, providing insights into the destinations of user traffic. - \item \textbf{User Interactions}: A chronological record of user activities, facilitating the identification of trends and anomalies. + \item \textbf{Host Information}: The dataset records the IP addresses of user devices, offering a unique identifier for each host. This enables the tracking of individual devices, their browsing activities, and the overall flow of data across the network. + \item \textbf{Timestamps}: Each user interaction is timestamped with high precision, capturing the exact date and time of the activity. This temporal data is for performing time-based analysis of browsing patterns, such as identifying peak usage periods, periods of inactivity, and frequent transitions between different URLs. + \item \textbf{URL Hostnames}: The dataset logs the hostnames of the URLs visited by users, providing valuable insights into their browsing destinations. This information helps to identify the types of websites visited, potential external threats, or any anomalous traffic patterns indicating malicious activity. + \item \textbf{User Interactions}: A chronological record of each user activity is maintained, allowing for the detailed reconstruction of browsing sessions. This data enables the identification of typical user behavior, which can then be compared against unusual patterns that might indicate beaconing or other cybersecurity threats. \end{itemize} -The dataset's scale and granularity make it an ideal resource for analyzing user behavior, identifying significant patterns, and supporting the development of effective beaconing detection strategies. +The dataset's scale and granularity make it an ideal resource for analyzing user behavior across different time intervals. The detailed information captured, such as user activity, URL destinations, and precise timestamps, plays a pivotal role in identifying significant patterns, detecting anomalies, and forming the basis for robust beaconing detection strategies. The rich data offers the necessary insights to evaluate the effectiveness of different detection techniques and improve the cybersecurity framework for large-scale networks. \subsection{Data Management and Preprocessing} -To manage and analyze the dataset effectively, a sophisticated data management system was implemented. The system leverages \textbf{InfluxDB}, a time-series database optimized for handling high volumes of temporal data. The data management process involves the following steps: +To effectively manage and analyze the collected dataset, a sophisticated data management system was implemented. This system leverages \textbf{InfluxDB}, a time-series database specifically optimized to handle high volumes of temporal data efficiently. Given the size and complexity of the dataset, careful planning was necessary to ensure seamless data processing, storage, and retrieval for subsequent analysis. The data management process was executed in several systematic steps to ensure that all data was appropriately handled, organized, and analyzed. These steps are outlined below: \begin{enumerate} - \item \textbf{Data Import}: The dataset is imported into InfluxDB using custom Python scripts. These scripts automate the creation of a dedicated "bucket" within InfluxDB, ensuring that the data is organized and stored efficiently. - \item \textbf{Schema Implementation}: A predefined schema is applied to enforce data integrity and consistency. This schema ensures that all entries adhere to the same format and standards, facilitating smoother data processing and analysis. Entries that do not conform to the predefined format are rejected, and a validation message is generated, indicating that the data has not been imported into the database. - \item \textbf{Initial Data Analysis}: The dataset is analyzed to understand its behavior, including: + \item \textbf{Data Import}: The first step in the data management process involves importing the dataset into InfluxDB. Custom Python scripts were developed to automate the entire import process, eliminating the need for manual intervention. These scripts are responsible for creating a dedicated "bucket" within InfluxDB, a specific storage container designed to organize and manage the incoming data. This automated approach not only streamlines the process but also minimizes the possibility of human error, ensuring that data is consistently organized in a structured and efficient manner. + \item \textbf{Schema Implementation}: To maintain the integrity and consistency of the data, a predefined schema was applied during the import process. The schema serves as a blueprint that defines the structure and format of the data, ensuring that all entries conform to the same standards. This step is in maintaining uniformity across the dataset and preventing errors during analysis. If any entry fails to meet the required format, it is automatically rejected by the system, and a validation message is generated. This message serves as a notification, allowing the system administrator to identify any issues and take corrective actions before proceeding with further data analysis. + \item \textbf{Initial Data Analysis}: After successfully importing the dataset, an initial analysis was conducted to gain a better understanding of its overall behavior and characteristics. This preliminary analysis focused on several key aspects of the data: \begin{itemize} - \item Observing overall data trends throughout the day. - \item Identifying the most frequently accessed URLs and calculating their averages. - \item Analyzing the time intervals between requests. - \item Examining the distribution of hosts and their activity patterns. + \item \textbf{Observing Data Trends}: A comprehensive examination of the dataset was conducted to identify overarching trends throughout the day. This step involves analyzing the time distribution of user interactions and identifying periods of peak activity as well as potential idle times. + \item \textbf{Identifying Frequently Accessed URLs}: The most frequently accessed URLs were identified, and their corresponding access frequencies were calculated. This analysis provided valuable insights into user browsing behavior, helping to pinpoint commonly visited sites or services. + \item \textbf{Analyzing Time Intervals}: The intervals between user requests were analyzed to uncover patterns such as frequent browsing sessions or longer periods of inactivity. This aspect of the analysis is key to identifying unusual activity that might indicate an anomalous pattern, such as beaconing. + \item \textbf{Host Activity Distribution}: The distribution of hosts and their associated activity patterns were examined. This step enables the identification of trends specific to individual devices or users, offering further context for understanding network traffic behavior. \end{itemize} \end{enumerate} +The process of inserting data into InfluxDB proved to be challenging due to the sheer volume of data being processed. It took nearly three days to complete the insertion on the hardware system, which was equipped with an Intel Core i5 processor, 16GB of RAM, and a 500GB SSD running Windows 11. The significant duration was primarily attributed to the high amount of data being imported, as well as the need for efficient data validation and schema enforcement. Despite the hardware's relatively moderate specifications, the system managed to handle the import process, though performance was impacted by the scale of the dataset and the complexity of the preprocessing steps. \subsection{Challenges with Real-World Data} -Analyzing real-world network traffic presents several challenges that must be addressed to ensure accurate detection and analysis of beaconing behavior. +Analyzing real-world network traffic presents several challenges that must be addressed to ensure accurate detection and analysis of beaconing behavior. Unlike synthetic datasets, real-world data is influenced by numerous external factors, such as network congestion, varying user behaviors, and environmental inconsistencies, making it necessary to apply robust preprocessing and analytical techniques. -\begin{itemize} - \item \textbf{Noise and Variability:} Real-world network traffic is inherently noisy, with random variations in connection timing due to network delays, retransmissions, and other factors. To mitigate this, preprocessing techniques such as smoothing filters and statistical normalization are applied to reduce variability while preserving essential patterns. Additionally, robust anomaly detection methods help differentiate between normal fluctuations and meaningful periodic signals. - - \item \textbf{Missing Data:} Devices may go offline or move out of the observation range, resulting in gaps in the data. To handle this, interpolation techniques are used to estimate missing values where appropriate, and robust analytical models are employed that can tolerate incomplete datasets. Additionally, missing data points are flagged to prevent misleading conclusions during the analysis process. - - \item \textbf{Legitimate Periodic Traffic:} Many legitimate applications (e.g., software updates, email polling) exhibit periodic behavior that resembles beaconing. To distinguish between benign and malicious periodic traffic, a combination of behavioral profiling, anomaly detection, and contextual analysis is used. This involves analyzing additional metadata such as destination IP addresses, communication frequency, and protocol usage to identify deviations from expected legitimate behavior. -\end{itemize} +\subsubsection{Noise and Variability} +Real-world network traffic is inherently noisy, with fluctuations in connection timing caused by factors such as network delays, packet loss, retransmissions, and load balancing mechanisms. These unpredictable variations can obscure meaningful periodic patterns, making it difficult to identify beaconing behavior accurately. To address this, preprocessing techniques such as smoothing filters, moving averages, and statistical normalization are applied to reduce the effects of noise while preserving periodic trends. +\subsubsection{Missing Data} +Incomplete data is a common challenge when dealing with real-world traffic. Devices may disconnect, move out of the observation range, or experience network failures, leading to gaps in the dataset. These missing entries can distort pattern recognition and reduce the reliability of beaconing detection. Additionally, robust analytical models that can tolerate incomplete datasets, such as probabilistic approaches, help ensure that missing data does not significantly impact detection accuracy. + +\subsubsection{Legitimate Periodic Traffic} +One of the most significant challenges in beaconing detection is distinguishing between malicious activity and legitimate periodic communication. Many benign applications, including email polling, cloud synchronization, system health checks, and software update mechanisms, exhibit regular intervals of network activity that may resemble beaconing behavior. To prevent false positives, a multi-layered approach is used, incorporating behavioral profiling, frequency analysis, and contextual evaluation. This involves analyzing supplementary metadata, such as destination IP addresses, communication frequency, protocol usage, and packet payload characteristics, to identify deviations from expected benign behavior. + +By addressing these challenges through rigorous preprocessing, statistical modeling, and contextual analysis, the impact of real-world data inconsistencies is minimized, leading to more accurate and reliable beaconing detection in large-scale enterprise networks. -By implementing these strategies, the impact of real-world data challenges is minimized, leading to more reliable and accurate results in beaconing detection. \section{Artificial Data Source} In addition to analyzing real-world network traffic, the BAYWATCH framework was evaluated using artificial data to test its robustness and accuracy under controlled conditions. The artificial data was designed to simulate various types of beaconing behavior, including different periodicities, noise levels, and evasion techniques commonly employed by malware authors. A key feature of the artificial data is the introduction of jitter, which simulates random variations in the timing of beaconing events. This section describes the process of generating the artificial data, the specific jitter ranges used, and the structure of the data. \subsection{Design of Artificial Data} -The artificial data was generated to mimic the structure of real-world network traffic, while allowing for precise control over the parameters of the beaconing behavior. Each artificial data set consists of the following fields: + +The artificial data was generated to mimic the structure of real-world network traffic while allowing for precise control over the parameters of beaconing behavior. Each artificial dataset consists of the following fields: \begin{itemize} - \item \textbf{Host Information}: The IP addresses of the user devices, enabling the tracking of individual hosts and their activities. - \item \textbf{Timestamps}: Precise date and time of each user interaction, enabling temporal analysis of browsing patterns. - \item \textbf{URL Hostnames}: The hostnames of the URLs visited, providing insights into the destinations of user traffic. - \item \textbf{User Interactions}: A chronological record of user activities, facilitating the identification of trends and anomalies. - \item \textbf{Is Artificial}: A tag (labeled as "yes") was added to distinguish the artificial data from real-world data. This tag ensures that the artificial data can be easily identified and separated during analysis. + \item \textbf{Host Information}: The IP addresses of the user devices, enabling the tracking of individual hosts and their activities. Since this is generated data, the IP address is set to 127.0.0.1 for all synthetic beacons. + \item \textbf{Timestamps}: The precise date and time of each user interaction, enabling temporal analysis of browsing patterns. The timestamps of the generated data are aligned with the real dataset, specifically on August 1, 2023 (Tuesday), to maintain consistency in temporal analysis. + \item \textbf{URL Hostnames}: The hostnames of the visited URLs, providing insights into the destinations of user traffic. The format of these generated URLs follows the pattern "beacon\{i\}.example.com", where "i" represents a unique identifier for each synthetic beacon. + \item \textbf{User Interactions}: A chronological record of user activities, facilitating the identification of trends and anomalies. The sequence of interactions is designed to resemble real-world browsing behavior to ensure realism in pattern analysis. + \item \textbf{Is Artificial}: A tag (labeled as "yes") was added to distinguish the artificial data from real-world data. This tag ensures that the synthetic data can be easily identified and separated during analysis. \end{itemize} +To ensure consistency and reliability across all generated entries, the structure of the JSON files for artificial data is governed by a Document Type Definition (DTD). Below is an example of the JSON schema used for defining the dataset: + +\begin{lstlisting}[language=json, basicstyle=\small\ttfamily, backgroundcolor=\color{white}] + { "$schema": "http://json-schema.org/draft-07/schema#", + "type": "object", + "properties": { + "logdate": { "type": "string", "format": "date-time"}, + "url_hostname": { "type": "string"}, + "user": { "type": "string"}}, + "Is_A": { "type": "string" }, + "required": ["logdate", "url_hostname", "Is_A"] } +\end{lstlisting} + + \subsection{Jitter and Beacon Frequency Variations} In simulating real-world beaconing behavior, two critical parameters are varied: beacon frequency and jitter. Beacon frequency refers to the regular interval at which a beacon signal is transmitted, while jitter introduces randomness into these intervals to mimic natural network variations or deliberate obfuscation tactics. @@ -411,16 +449,19 @@ Each combination presents unique challenges, allowing for a thorough assessment Different malware or legitimate applications may exhibit varying beacon frequencies. Testing across a spectrum of intervals ensures that the framework can accurately detect both rapid and infrequent beaconing behaviors. Real-world network conditions introduce randomness in communication timings. By incorporating different jitter levels, the framework's resilience to timing variations and its ability to distinguish between regular and irregular patterns are assessed. Through this multifaceted testing approach, the BAYWATCH framework's robustness and adaptability to diverse network behaviors are thoroughly evaluated. \subsection{Integration with Real-World Data} -The artificial data was used in conjunction with real-world network traffic to provide a comprehensive evaluation of the BAYWATCH framework. While the real-world data provides insights into the framework's performance in a production environment, the artificial data allows for controlled testing of specific scenarios and edge cases. The is\_Artificial tag ensures that the artificial data can be easily distinguished from real-world data during analysis. This combination ensures that the framework is both robust to real-world perturbations and accurate in detecting malicious beaconing behavior. + +To evaluate the effectiveness of the BAYWATCH framework, artificial data was integrated alongside real-world network traffic. This hybrid approach allows for a comprehensive assessment by leveraging the strengths of both data types. The real-world dataset captures the complexity and unpredictability of live network environments, offering insights into how the framework performs under realistic conditions. In contrast, the artificial dataset provides controlled scenarios, enabling rigorous testing of specific beaconing patterns, edge cases, and known attack behaviors. A critical component of this integration is the "is Artificial" tag, which explicitly marks synthetic data entries. This tag ensures that artificial records can be easily distinguished from genuine network traffic during preprocessing and analysis. It prevents unintentional contamination of results while allowing direct comparisons between the detection rates of known synthetic beacons and real-world network anomalies. By combining real and artificial data, the evaluation process benefits from both the unpredictability of real-world conditions and the precision of controlled simulations. This dual approach ensures that the BAYWATCH framework remains robust to noise, variations, and legitimate periodic traffic while maintaining high accuracy in detecting malicious beaconing activity. Additionally, the artificial data serves as a validation benchmark, allowing the refinement of detection thresholds and feature extraction techniques before deployment in production environments. + \section{Summary} -The BAYWATCH framework is a robust and scalable methodology designed to detect stealthy beaconing behavior in large-scale enterprise networks. It operates in four main phases: \textbf{Whitelist Analysis}, which eliminates known legitimate traffic using universal and local whitelists; \textbf{Time Series Analysis}, which identifies periodic communication patterns using advanced signal processing techniques such as Fast Fourier Transform (FFT), autocorrelation, and bandpass filtering; \textbf{Suspicious Indicator Analysis}, which further filters out legitimate behavior by analyzing domain-specific indicators like URL tokens and novelty; and \textbf{Investigation and Verification}, where remaining suspicious cases are manually reviewed using a bootstrapping process to minimize workload. The framework was evaluated using both \textbf{real-world data}, collected from a large-scale enterprise network, and \textbf{artificial data}, which simulated various beaconing scenarios with controlled jitter ranges (2, 5, 10, 30, and 60 seconds) and noise levels. The integration of real-world and artificial data ensures a comprehensive evaluation, demonstrating the framework's ability to reliably detect malicious beaconing behavior while remaining robust to real-world perturbations and noise. This makes BAYWATCH a valuable tool for securing enterprise networks against advanced cyber threats. +This chapter outlines the methodology for evaluating the BAYWATCH framework, combining real-world enterprise network logs and synthetic datasets to ensure comprehensive validation under both authentic and controlled conditions. Real data, collected from a large-scale enterprise network, captures the inherent complexity of user interactions—including noise, missing entries, and legitimate periodic traffic—while artificial data simulates beaconing behavior with precise parameters (e.g., intervals of 10–300 seconds, jitter ranges of 2–150 seconds) to establish ground truth scenarios. The real dataset, structured in JSON format with fields for IP addresses, timestamps, and URL hostnames, underwent rigorous preprocessing (e.g., schema validation, noise reduction) and was managed via InfluxDB for efficient time-series analysis, despite hardware limitations during its three-day ingestion. Challenges such as distinguishing malicious beaconing from benign periodic activities were addressed through multi-layered filtering and contextual metadata analysis. Artificial data, tagged for easy identification, tested the framework’s robustness against diverse evasion tactics, including high-jitter and low-frequency beaconing. By integrating synthetic data into real traffic, the methodology enabled a dual evaluation: real-world data assessed external validity, while controlled simulations isolated performance metrics (e.g., detection sensitivity, false-positive rates). This hybrid approach ensured BAYWATCH’s adaptability to both unpredictable network environments and sophisticated adversarial strategies, setting a foundation for empirical validation in subsequent chapters. + \chapter{Data Analysis} This chapter represents the detailed analysis of the dataset, focusing on understanding user behavior, temporal patterns, and network interactions. By employing advanced visualization techniques and statistical methods, this chapter aims to uncover meaningful insights into the dataset's structure, identify patterns, and detect potential anomalies. The analysis is divided into four main sections: \textbf{Visualization of URL Request Counts}, \textbf{24-Hour URL Visit Analysis}, \textbf{Time Interval Analysis of URL Requests}, and \textbf{Distribution of Hosts Based on Unique URLs Contacted}. Each section provides a comprehensive exploration of the data, supported by visualizations and detailed interpretations. \section{Visualization of URL Request Counts} -Understanding the distribution and frequency of URL requests is for identifying patterns and anomalies in user behavior. This section presents visualizations of URL request counts using both logarithmic and linear scales, enabling a detailed comparison of visit frequencies across different URLs. +Understanding the distribution and frequency of URL requests is for identifying patterns, detecting anomalies, and distinguishing between normal and potentially suspicious browsing behavior. Analyzing these request patterns provides insights into user interactions, network load, and possible beaconing activity. This section presents visualizations of URL request counts using both logarithmic and linear scales to facilitate a comprehensive comparison of visit frequencies across different URLs. The logarithmic scale is particularly useful for handling large variations in request counts, allowing infrequent and high-volume URLs to be analyzed side by side without extreme disparities overshadowing meaningful trends. \begin{figure} \centering @@ -429,10 +470,11 @@ Understanding the distribution and frequency of URL requests is for identifying \label{fig:requestcount} \end{figure} -Figure \ref{fig:requestcount} illustrates the request counts for different URL hostnames. The x-axis represents the index of URLs in descending order of request count, meaning URLs with the highest traffic appear on the left. The y-axis in log scale, shows the distribution of request counts, highlighting a steep drop-off where a few URLs receive significantly higher traffic, while most receive fewer requests. This pattern indicates a power-law distribution, common in network and web traffic analysis. The visualization provides a clear overview of the most visited URLs and their relative popularity, enabling further analysis of user preferences and traffic patterns. +Figure \ref{fig:requestcount} illustrates the distribution of request counts across different URL hostnames, providing insights into traffic patterns within the dataset. The x-axis represents the index of URLs sorted in descending order of request count, ensuring that the most frequently accessed URLs appear on the left. The y-axis, displayed on a logarithmic scale, captures the wide range of request frequencies, making it easier to observe both high-traffic and low-traffic URLs within the same visualization. The figure highlights a steep drop-off in request counts, where a small subset of URLs receives significantly higher traffic while the majority of URLs are accessed infrequently. This pattern suggests a power-law distribution, a well-documented phenomenon in network and web traffic analysis, where a few highly popular destinations dominate overall traffic while long-tail URLs contribute to a smaller proportion of requests. By visualizing this distribution, the analysis enables a clearer understanding of user browsing behavior, network load distribution, and potential anomalies. Identifying the most visited URLs and their relative popularity supports further investigation into traffic patterns, content access trends, and potential indicators of automated or malicious activity. + \section{24-Hour URL Visit Analysis} -Understanding the temporal patterns of URL visits is essential for identifying peak usage times, detecting anomalies, and optimizing network resources. This section analyzes the distribution of URL visits over a 24-hour period, providing insights into user activity patterns throughout the day. +Understanding the temporal patterns of URL visits is for analyzing user behavior, identifying peak usage periods, detecting anomalies, and optimizing network resource allocation. By examining how URL requests fluctuate over 24 hours, valuable insights can be gained into traffic trends, workload distribution, and deviations from expected activity. This section presents an analysis of the temporal distribution of URL visits, highlighting variations in user activity throughout the day. Identifying peak access times helps in understanding common usage patterns, such as increased traffic during working hours or spikes related to scheduled automated processes. Additionally, periods of unusually high or low activity may indicate network congestion, scheduled maintenance, or potential security threats such as automated scanning or beaconing behavior. By visualizing these trends, the study enables a more comprehensive assessment of network dynamics, facilitating proactive resource management and anomaly detection. This analysis provides a foundation for optimizing infrastructure performance and enhancing network security by distinguishing between normal fluctuations and suspicious activity. \begin{figure} \centering @@ -441,12 +483,13 @@ Understanding the temporal patterns of URL visits is essential for identifying p \label{fig:24hvisit} \end{figure} -Figure \ref{fig:24hvisit} illustrates the number of visits to different URLs over a 24-hour period. The X-axis represents the hours of the day, while the Y-axis indicates the number of visits. The visualization reveals distinct patterns in user activity, with a sharp decline in visits around 04:00, followed by a gradual increase as the day progresses. This pattern suggests that user activity is concentrated during working hours, with minimal activity during the early morning hours. The sharp decline at 04:00 is particularly noteworthy, as it marks the transition from "day activity" (00:00–04:00) to "night activity" (04:00–24:00). This categorization allows for a more detailed analysis of user behavior during different time periods. +Figure \ref{fig:24hvisit} illustrates the number of visits to different URLs over 24 hours, providing insights into temporal browsing patterns. The X-axis represents the hours of the day (ranging from 00:00 to 24:00), while the Y-axis indicates the total number of URL visits recorded within each hour. The visualization reveals distinct trends in user activity, with a significant drop in visit counts around 04:00, followed by a steady increase as the day progresses. This pattern suggests that network activity is heavily concentrated during working hours, with minimal traffic during the early morning period. The sharp decline at 04:00 is particularly noteworthy, as it marks a transition between two distinct phases of network usage: "day activity" (00:00–04:00) and "night activity" (04:00–24:00). This segmentation allows for a more detailed examination of user behavior across different time intervals. For instance, the low-activity period may correspond to automated background processes, while the surge in traffic later in the day reflects active user interactions. Identifying these temporal trends is important for optimizing resource allocation, detecting anomalies, and distinguishing between normal fluctuations and potential security threats such as scheduled beaconing or automated network scans. -A comparision is done during this 24 hours, which is the average number of visits during the "day" (00:00–04:00) and "night" (04:00–24:00) periods. The "day" period shows an average of approximately 2,000 visits, while the "night" period has a higher average of around 2,500 visits. This indicates that user activity is more concentrated during the night period, which may reflect the working hours of the organization or the time zones of the users. This analysis provides insights into user behavior and can help organizations optimize their network resources and security measures based on peak usage times. +A comparison is conducted between the average number of visits during two distinct periods of the 24-hour cycle: the "day" period (00:00–04:00) and the "night" period (04:00–24:00). The "day" period exhibits an average of approximately 2,000 visits, while the "night" period shows a higher average of around 2,500 visits. This indicates that user activity is more concentrated during the night period, which could be influenced by factors such as the working hours of the organization or the time zones of users within the network. Such a pattern suggests that employees or users from different geographical locations may have varying peak usage times, with a noticeable increase in activity during the night hours. This insight into temporal trends in user behavior can be analytical for organizations aiming to optimize network resource allocation and improve the efficiency of security monitoring. For instance, understanding these peak periods allows network administrators to better allocate bandwidth or enhance security measures during times of higher activity. Additionally, this analysis can help in identifying potential anomalies, such as spikes in traffic, which may warrant further investigation to ensure the integrity and security of the network. \section{Time Interval Analysis of URL Requests} -Analyzing the time intervals between URL requests is for identifying patterns in user interactions and detecting potential beaconing behavior. This section examines the distribution of time intervals between requests, using both seconds and minutes as the units of analysis. +Analyzing the time intervals between URL requests is important for identifying patterns in user interactions and detecting potential beaconing behavior. By studying the time gaps between consecutive requests, it is possible to uncover patterns of periodicity that may indicate either normal user activity or suspicious behavior, such as beaconing attempts. This section examines the distribution of time intervals between requests, using both seconds and minutes as the units of analysis. In particular, the analysis focuses on uncovering regular patterns that are often associated with beaconing, where requests are sent at fixed intervals. By comparing the distribution of time intervals across different time frames, such as shorter intervals (seconds) and longer intervals (minutes), we can identify potential correlations and trends in user activity. For example, frequent and predictable intervals may suggest automated systems or malicious actors attempting to maintain communication over a network. Understanding these time distributions is a key part of detecting and mitigating beaconing behavior, while also distinguishing it from normal user browsing activity. + \begin{figure} \centering @@ -455,11 +498,7 @@ Analyzing the time intervals between URL requests is for identifying patterns in \label{fig:timeintervallog} \end{figure} -Figure \ref{fig:timeintervallog} illustrates the distribution of time intervals between URL requests, with the Y-axis on a logarithmic scale. The X-axis represents time intervals in seconds, divided into 65 bins, where each bin corresponds to a one-second interval ranging from 0 to 65 seconds. The logarithmic scale on the Y-axis is used to better visualize the wide range of request counts, compressing the scale for higher values and expanding it for lower values. This allows for a clearer comparison of the frequency of requests across different time intervals. - -The visualization reveals a consistent pattern where the number of requests decreases as the time interval increases. However, every 10 seconds, there is a slight increase in the number of requests, indicating a periodic pattern in user behavior. This periodicity may reflect regular user activity, such as polling mechanisms or automated updates, which occur at fixed intervals. Such patterns are common in network traffic and can serve as a baseline for detecting anomalies or malicious beaconing behavior. For example, if a URL exhibits a similar periodic pattern but with irregular or unexpected intervals, it could indicate potential malicious activity, such as a botnet communicating with a command-and-control (C\&C) server. - -By analyzing these patterns, organizations can gain valuable insights into normal user behavior and identify deviations that may warrant further investigation. This analysis is particularly useful for detecting stealthy beaconing behavior, where malware attempts to blend in with legitimate traffic by mimicking periodic patterns. By understanding the temporal dynamics of user interactions, organizations can develop more effective detection mechanisms and enhance their network security posture. +Figure \ref{fig:timeintervallog} illustrates the distribution of time intervals between URL requests, with the Y-axis displayed on a logarithmic scale. The X-axis represents time intervals in seconds, divided into 65 bins, where each bin corresponds to a one-second interval ranging from 0 to 65 seconds. The use of a logarithmic scale on the Y-axis is particularly useful for visualizing the wide range of request counts. By compressing the scale for higher values and expanding it for lower values, the logarithmic scale enables a clearer and more detailed comparison of the frequency of requests across different time intervals. The visualization reveals a consistent pattern where the number of requests decreases as the time interval between them increases. However, there is a noticeable spike in the number of requests at every 10-second interval, suggesting periodicity in user behavior. This periodicity could be indicative of regular user activities, such as polling mechanisms, automated updates, or recurring checks for new information. These behaviors are common in legitimate network traffic and can help establish a baseline for normal activity. The identification of such periodic patterns is important in network traffic analysis, as it helps differentiate between regular activity and potential malicious behavior. For instance, if a URL exhibits similar periodic patterns but with irregular or unexpected intervals, it could be a sign of beaconing—a technique often used by malware to maintain communication with a command-and-control (C\&C) server. In this case, the analysis could reveal anomalies in the intervals that deviate from expected patterns, potentially indicating a botnet or other malicious activity. By comparing these patterns against known baselines of legitimate traffic, it becomes easier to identify and flag suspicious requests for further investigation. \begin{figure} \centering @@ -468,16 +507,10 @@ By analyzing these patterns, organizations can gain valuable insights into norma \label{fig:timeintervallogmin} \end{figure} -Figure \ref{fig:timeintervallogmin} extends the analysis of time intervals between URL requests to a larger time scale, with the X-axis divided into 31 bins, each representing a one-minute interval except the last one. The Y-axis remains on a logarithmic scale to accommodate the wide range of request counts, ensuring that both high-frequency and low-frequency intervals are clearly visible. Similar to Figure \ref{fig:timeintervallog}, the visualization shows a decreasing trend in the number of requests as the time interval increases. However, there is a noticeable spike in requests every 5 minutes, indicating periodic behavior at a larger time scale. - -This periodic pattern is consistent across all URLs, suggesting that it reflects a common behavior in the dataset, such as scheduled tasks, automated updates, or regular user interactions. For example, many applications and systems are configured to perform tasks at regular intervals (e.g., every minute or every few minutes), which could explain the observed spikes. This behavior is particularly relevant for detecting malicious beaconing, as malware often uses similar periodic patterns to communicate with command-and-control (C\&C) servers. By identifying these patterns, organizations can establish a baseline for normal behavior and detect deviations that may indicate malicious activity. - -The use of a logarithmic scale is essential for visualizing the wide range of time intervals and request counts. In a linear scale, the differences between bins would be less apparent due to the dominance of high-frequency intervals. The logarithmic scale compresses the scale for higher values and expands it for lower values, making it easier to identify periodic patterns and anomalies. This approach provides a clearer understanding of the temporal dynamics of user interactions and enhances the ability to detect stealthy beaconing behavior. - -By analyzing these patterns, organizations can gain valuable insights into user behavior and network activity, enabling them to optimize their monitoring and security strategies. This analysis is particularly useful for identifying potential threats, such as botnets or other malware, that attempt to blend in with legitimate traffic by mimicking periodic patterns. By leveraging advanced visualization techniques and statistical methods, organizations can develop more effective detection mechanisms and enhance their network security posture. +Figure \ref{fig:timeintervallogmin} extends the analysis of time intervals between URL requests to a larger time scale, with the X-axis each representing a one-minute interval, except for the last bin, which aggregates data from intervals longer than 31 minutes. To avoid losing beaconing data at the edges, each bin spans ±30 seconds; for example, the 1-minute bin represents data from 30 to 90 seconds. The Y-axis remains on a logarithmic scale, ensuring that both high-frequency and low-frequency intervals are visible and can be compared effectively. This use of a logarithmic scale enables the identification of trends across various time scales, making it a powerful tool for understanding patterns in network traffic. Similar to the analysis presented in Figure \ref{fig:timeintervallog}, the visualization reveals a decreasing trend in the number of requests as the time interval between them increases. This suggests that user interactions are typically clustered within shorter time intervals, with longer gaps between requests. However, a notable spike in request frequency appears every 5 minutes, indicating a periodic pattern at a larger time scale. This periodicity is consistent across all URLs in the dataset, suggesting that it represents a common behavior such as scheduled tasks, automated updates, or regular user interactions. These spikes could correspond to routine activities in many systems or applications that are configured to perform tasks at fixed intervals—such as background data synchronization, refresh cycles, or regular system health checks. The observed periodic behavior is particularly significant in the context of detecting malicious beaconing activity. Malicious software, including botnets and malware, often utilizes similar periodic behavior to maintain communication with command-and-control (C\&C) servers, operating at regular intervals. By identifying these regular spikes in request frequency, organizations can establish a baseline for normal network behavior and detect any deviations that might indicate unauthorized or suspicious activities. The consistent periodicity observed across the dataset could thus serve as a key indicator for detecting potential threats and taking proactive security measures. The logarithmic scale is crucial for effectively visualizing the wide range of time intervals and request counts. The logarithmic scale compresses the scale for higher values and expands it for lower values, allowing for a more balanced view of both common and rare events. This enhanced visualization capability enables a clearer understanding of the temporal dynamics of user interactions and supports the identification of periodic patterns, which are important for detecting stealthy beaconing behavior in network traffic. Ultimately, this approach aids in distinguishing between normal and abnormal patterns, enhancing the framework’s ability to identify potential security threats. \section{Distribution of Hosts Based on Unique URLs Contacted} -Understanding the interaction patterns of hosts within the network is for identifying key services, detecting anomalies, and optimizing network performance. This section analyzes the distribution of hosts based on the number of unique URLs they contacted, providing insights into the concentration of network activity. +Understanding the interaction patterns of hosts within the network is essential for identifying key services, detecting anomalies, and optimizing network performance. By analyzing the distribution of hosts based on the number of unique URLs they contacted, insights can be gained into the concentration of network activity and the diversity of services being accessed. This analysis helps highlight the most active hosts and their browsing behaviors, providing valuable information for pinpointing critical network resources, determining high-traffic users, and identifying potential security concerns. For example, an unusually high number of unique URL requests from a single host may indicate an abnormal pattern, which could suggest automated processes or even malicious behavior. By focusing on the number of unique URLs accessed by each host, this section offers a clear understanding of how traffic is distributed across the network and how hosts interact with various services. Additionally, this analysis aids in understanding the level of engagement with different network segments, assisting network administrators in optimizing resource allocation and managing network load during peak times." \begin{figure} \centering @@ -486,13 +519,7 @@ Understanding the interaction patterns of hosts within the network is for identi \label{fig:ip} \end{figure} -Figure \ref{fig:ip} illustrates the distribution of hosts (IP addresses) based on the number of unique URLs they contacted. The X-axis represents the number of unique URLs, ranging from 1 to 15, while the Y-axis shows the count of hosts in each category. The visualization reveals that the majority of hosts interact with only a small number of unique URLs. Specifically, approximately 17,500 hosts contacted exactly two unique URLs, while around 15,000 hosts contacted only one unique URL. As the number of unique URLs increases, the number of hosts decreases significantly. However, there are still many URLs that are connected to other hosts. - -This pattern suggests that network activity is highly concentrated on a small set of destinations, with most hosts accessing only a few key resources. For example, hosts that contact only one or two unique URLs are likely accessing essential services, such as internal tools, authentication servers, or frequently used websites. In contrast, hosts that contact a larger number of unique URLs may represent more diverse or specialized activity, such as administrators, developers, or automated systems performing a wide range of tasks. This rationale underscores the importance of using whitelists to filter out known legitimate traffic, allowing the focus to be on identifying truly suspicious activities. - -The concentration of network activity on a small set of URLs has important implications for network monitoring and security. By identifying the most frequently accessed URLs, organizations can prioritize their monitoring and security efforts on resources that are more likely to be targeted by malicious actors. For example, high-traffic URLs are often the focus of attacks, such as phishing, malware distribution, or command-and-control (C\&C) communication. By focusing on these key resources, organizations can more effectively detect and mitigate potential threats. - -Additionally, the distribution of hosts based on unique URLs contacted can help identify anomalies in network behavior. For example, a host that suddenly starts contacting a large number of unique URLs may indicate suspicious activity, such as a compromised device performing reconnaissance or data exfiltration. By establishing a baseline for normal behavior, organizations can more easily detect deviations that may warrant further investigation. +Figure \ref{fig:ip} illustrates the distribution of hosts (IP addresses) based on the number of unique URLs they contacted. The X-axis represents the number of unique URLs, ranging from 1 to 15, while the Y-axis shows the count of hosts within each category. The visualization highlights that the majority of hosts interact with only a small number of unique URLs. Specifically, approximately 17,500 hosts contacted exactly two unique URLs, while around 15,000 hosts interacted with only one unique URL. As the number of unique URLs increases, the number of hosts decreases significantly, although there are still many hosts contacting more than a few URLs. This pattern suggests that network activity is highly concentrated around a small set of destinations, with most hosts accessing only a limited range of resources. For example, hosts that contact only one or two unique URLs are likely interacting with essential services such as internal tools, authentication servers, or frequently accessed websites. In contrast, hosts contacting a larger number of unique URLs may represent more diverse or specialized activities, such as administrators, developers, or automated systems performing a variety of tasks across the network. This distribution of host behavior emphasizes the importance of leveraging whitelists to filter out known legitimate traffic, ensuring that analysis can focus on detecting potentially suspicious activities. The concentration of network traffic on a limited set of URLs also carries significant implications for network monitoring and security. By identifying the most frequently accessed URLs, organizations can prioritize security measures for resources that are most likely to be targeted by malicious actors. URLs that experience high traffic are often the focal points of cyberattacks, such as phishing schemes, malware distribution, or command-and-control (C\&C) communication. By directing attention to these critical resources, organizations can enhance their ability to detect and mitigate emerging threats. Additionally, monitoring the distribution of hosts based on the number of unique URLs they access can help identify anomalous behavior. For instance, a host that unexpectedly begins contacting a large number of unique URLs could indicate suspicious activity, such as a compromised device engaged in reconnaissance or data exfiltration. Establishing a baseline for normal host behavior allows organizations to more effectively identify deviations that may require further investigation, enhancing overall network security. \textbf{Analysis of URL Connections} @@ -500,30 +527,30 @@ After checking the URLs that were reached by these hosts, several conclusions ca \begin{itemize} \item \textbf{Certificate and Security Validations:} \\ - Many URLs (e.g., \nolinkurl{ocsp.digicert.com}, \nolinkurl{ocsp.globalsign.com}, \nolinkurl{crl.globalsign.com}, \nolinkurl{ctldl.windowsupdate.com}) are related to certificate status checking and security validations. This indicates that the hosts are regularly verifying digital certificates to ensure secure communications. + Some URLs, such as \url{ocsp.digicert.com}, \url{ocsp.globalsign.com}, \url{crl.globalsign.com}, and \url{ctldl.windowsupdate.com}, are associated with certificate status checking and other security validations. These connections highlight that the hosts are actively performing routine checks to ensure the validity of digital certificates. This activity is indicative of a continuous effort to maintain secure communication channels, verify certificate integrity, and prevent man-in-the-middle (MITM) attacks. The inclusion of URLs related to certificate revocation and status checking suggests a heightened emphasis on maintaining secure connections in the network environment. \item \textbf{Operating System and Application Updates:} \\ - Endpoints such as \nolinkurl{ctldl.windowsupdate.com}, \nolinkurl{update.googleapis.com}, and \nolinkurl{www.msftconnecttest.com} suggest that these hosts are contacting services to check for OS or application updates, as well as to perform connectivity tests. + Several URLs, including \url{update.googleapis.com}, \url{www.msftconnecttest.com}, and \url{tldl.windowsupdate.com}, are indicative of hosts checking for operating system or application updates. These domains are typically associated with automated update mechanisms, where endpoints periodically reach out to ensure that their software and security patches are up to date. This also includes connectivity tests to verify network accessibility and ensure systems are functioning properly. These connections are crucial for maintaining the integrity and functionality of the hosts, keeping them secure and performing optimally through regular updates. \item \textbf{Enterprise and Cloud Services:} \\ - Domains like \nolinkurl{saml.allianz.com}, \nolinkurl{www.allianz.de}, \nolinkurl{autodiscover.allianz.de}, \nolinkurl{service-now.com}, and \nolinkurl{workspace.citrix} imply usage of enterprise-level services. These may include single sign-on (SSO), IT service management, or remote workspace access, which are typical in corporate environments. + Domains such as \url{saml.allianz.com}, \url{www.allianz.de}, \url{autodiscover.allianz.de}, \url{service-now.com}, and \url{workspace.citrix} point to hosts interacting with enterprise-level services commonly found in corporate environments. These include services for Single Sign-On (SSO), IT service management, and remote workspace access. The connection to platforms like Citrix suggests that users are accessing virtual desktop environments or cloud-based services, enabling flexible work arrangements. Additionally, integration with platforms like ServiceNow highlights that these hosts may be involved in internal IT service management and troubleshooting, which is a critical component of organizational operations, particularly in large enterprises with complex infrastructures. \item \textbf{Monitoring, Telemetry, and Feature Management:} \\ - Connections to \nolinkurl{infrastructure-command-api.eu.newrelic.com}, \nolinkurl{infra-api.eu.newrelic.com}, and \nolinkurl{launchdarkly.com} suggest that the hosts may be sending telemetry data, monitoring performance, or checking feature toggles as part of routine operations. + Connections to services like \url{infrastructure-command-api.eu.newrelic.com}, \url{infra-api.eu.newrelic.com}, and \url{launchdarkly.com} indicate that the hosts may be transmitting telemetry data for system monitoring, performance analytics, and feature management. These domains are linked to widely used platforms for infrastructure monitoring and feature flag management. Such activities suggest that hosts are contributing to the organization’s overall monitoring strategy, providing valuable data on the performance, uptime, and behavior of applications and services. Regular telemetry collection is a key aspect of proactive system management, allowing IT teams to detect potential issues early and optimize performance. \item \textbf{Security and Threat Management Vendors:} \\ - Several URLs from vendors such as Checkpoint (e.g., \nolinkurl{updates.checkpoint.com}, \nolinkurl{services.checkpoint.com}, \nolinkurl{diag-services.checkpoint.com}) and Trend Micro (\nolinkurl{support-connector-service.manage.trendmicro.com}) indicate that these systems are integrated with specialized security solutions for threat management. + Some URLs associated with security vendors, such as \url{updates.checkpoint.com}, \url{services.checkpoint.com}, \url{diag-services.checkpoint.com} and \url{support-connector-service.manage.trendmicro.com}, reflect the presence of security systems designed for threat management. These connections highlight that the hosts are integrated with advanced security solutions, which are important for identifying and mitigating cyber threats. Whether it’s through regular updates, diagnostics, or real-time threat intelligence feeds, these systems play a role in defending the network against evolving security risks. The integration of such services ensures that the network is constantly protected from new and emerging threats, with automated systems managing much of the workload to keep human intervention to a minimum. \item \textbf{Diversity in Connection Sets:} \\ - Hosts with higher unique URL counts are reaching out to a broader array of services. This diversity likely reflects a more comprehensive set of functions, including enhanced security protocols, regular updates, and additional enterprise services. + Hosts that interact with a higher number of unique URLs are typically engaging with a more diverse set of services, which may reflect a broader range of operations. These activities include access to enhanced security protocols, system updates, and various enterprise services. The diversity in connections may also point to more specialized use cases, where certain hosts are tasked with overseeing a larger portion of the network’s activities, such as system administrators or network engineers. The variety in services accessed by these hosts allows for better resource allocation and helps the organization ensure that different types of services receive appropriate attention. These hosts are likely performing more complex tasks, such as monitoring, data analysis, or security audits, and their broad range of connections is an important feature of their role in the network. \end{itemize} \section{Summary} -The data analysis presented in this chapter provides a comprehensive understanding of the dataset's structure, user behavior, and network interactions. By visualizing URL request counts, analyzing 24-hour visit patterns, examining time intervals between requests, and studying the distribution of hosts, this chapter uncovers key insights that can inform network optimization and security strategies. The findings highlight the importance of focusing on high-traffic URLs, understanding temporal patterns in user activity, and detecting periodic behavior that may indicate malicious beaconing. These insights lay the foundation for further analysis and the development of effective detection mechanisms in the BAYWATCH framework. By leveraging advanced visualization techniques and statistical methods, this chapter offers valuable insights into the dataset's characteristics and user behavior, providing a solid basis for enhancing network security and performance. +The data analysis presented in this chapter offers a detailed and comprehensive examination of the dataset's structure, user behavior, and network interactions. By utilizing a variety of visualization tools and statistical methods, the chapter identifies and uncovers key patterns that not only contribute to a better understanding of the data but also provide actionable insights for optimizing network performance and enhancing security measures. The analysis begins with a focus on URL request counts, offering a clear view of the frequency and distribution of web traffic. This helps highlight which URLs are most frequently accessed by hosts within the network, shedding light on the overall popularity of various resources. Understanding the distribution of these request counts is for determining which URLs should be prioritized in network monitoring and security management. The high-traffic URLs, in particular, are often more susceptible to attacks, such as phishing, malware distribution, or even DDoS attacks. By recognizing these hotspots, network administrators can more effectively allocate resources to ensure that these critical URLs are properly secured and monitored. Further investigation into the 24-hour visit patterns of hosts reveals how user activity is distributed across time. By analyzing these temporal patterns, the chapter sheds light on peak usage times, user behavior trends, and possible anomalies. A close examination of these patterns provides a deeper understanding of when the network is most active and helps detect deviations that might indicate unusual or malicious behavior. For instance, atypical spikes in activity at specific hours of the day could signal security incidents such as bot traffic or unauthorized access attempts. This aspect of the analysis is for optimizing network resources and managing traffic loads during high-usage periods, ensuring the network's stability and performance. Another aspect of the analysis involves the time intervals between requests. This segment of the study reveals how hosts interact with the network, providing insights into the frequency of user requests and the temporal gaps between them. This can help identify periodic or repetitive behavior, which may indicate underlying issues such as inefficient resource usage or even intentional attempts at evading detection. The analysis of time intervals is for identifying malicious activities, such as beaconing—a pattern in which an infected device sends regular, seemingly benign requests to a specific URL to maintain communication with a command-and-control server. Detecting such behaviors can play an important role in early-stage threat detection, as it allows for the identification of compromised devices or ongoing cyberattacks before they escalate. The distribution of hosts based on the number of unique URLs they contact provides a further layer of insight into user and network behavior. This analysis highlights the concentration of network activity and reveals how different hosts interact with various resources. For example, some hosts may only contact a limited number of URLs, often related to essential services, while others might interact with a broader set of resources. The latter group may represent specialized functions or more complex network activities. By understanding the distribution of hosts across different sets of URLs, organizations can better prioritize their security efforts and ensure that high-risk activities are closely monitored. This distribution can also help distinguish between normal and anomalous behaviors, offering clues about potential security threats or misconfigurations within the network. Collectively, these findings emphasize the importance of focusing on high-traffic URLs and understanding the temporal patterns in user activity. By identifying periodic behaviors or unusual request intervals, it becomes possible to detect anomalies that could indicate malicious intent or system vulnerabilities. The insights provided by this analysis are important for creating more effective detection mechanisms within the BAYWATCH framework, laying a strong foundation for the development of robust network security tools and strategies. The use of advanced visualization techniques and statistical analysis in this chapter is instrumental in uncovering these patterns. These tools provide a clear and intuitive way to visualize complex data sets, helping to identify trends and outliers that may otherwise go unnoticed. This approach not only contributes to a deeper understanding of the dataset but also facilitates the identification of areas that require further investigation or intervention. By offering a comprehensive view of the network's structure and behavior, this chapter provides a solid foundation for enhancing network security, improving performance, and developing more effective detection and mitigation mechanisms for potential threats. In conclusion, the data analysis conducted in this chapter offers a thorough understanding of network dynamics, highlighting key areas for improvement in both security and performance optimization. By examining the dataset's structure, user behavior, and network interactions through various lenses, this chapter delivers valuable insights that can guide future research and the implementation of more sophisticated network management strategies. These findings are for building a proactive security posture, ensuring the network remains resilient against evolving threats while maintaining optimal performance. \chapter{Implementation} -This chapter details the novel extensions introduced in the BAYWATCH framework. Although the original BAYWATCH framework was reimplemented in Python for enhanced integration and maintainability, the focus here is strictly on the additional contributions that extend beyond the original design. The BAYWATCH framework incorporates an advanced signal analysis pipeline and a comprehensive evaluation methodology using both real network traces and synthetically generated beaconing data. These enhancements are aimed at improving temporal pattern detection under varied jitter, interval, and frequency conditions. +Although the original BAYWATCH framework was reimplemented in Python for enhanced integration and maintainability, the focus here is strictly on the additional contributions that extend beyond the original design. The BAYWATCH framework incorporates an advanced signal analysis pipeline and a comprehensive evaluation methodology using both real network traces and synthetically generated beaconing data. These enhancements are aimed at improving temporal pattern detection under varied jitter, interval, and frequency conditions. \section{System Enhancements} The framework extends the original beaconing detection methodology with several keys: @@ -553,16 +580,22 @@ A multi-stage processing pipeline has been developed to isolate genuine periodic The final stage of the pipeline cross-validates the findings from the frequency and time domains. Here, the candidate frequency components obtained from the FFT are correlated with the candidate lags identified through the autocorrelation analysis. A tolerance is applied to account for minor discrepancies. Only those frequency components that consistently align with the time-domain peaks are retained as genuine periodic signals. This correlation ensures that the periodicity detected in the frequency spectrum is supported by corresponding temporal patterns, thereby enhancing the overall reliability of the detection process. \end{enumerate} -This multi-stage pipeline represents a significant advancement in the detection of beaconing behavior, as it combines complementary analytical techniques to overcome the challenges posed by noisy and irregular data. The integration of bandpass filtering, dynamic FFT thresholding, and autocorrelation-based validation provides a robust framework for isolating true periodic signals, even in complex network traffic environments. +This comprehensive signal analysis pipeline, combining bandpass filtering, permutation-based FFT thresholding, autocorrelation peak detection, and frequency-lag correlation, constitutes a novel contribution of the BAYWATCH framework. It is designed to be robust in the presence of noise and adaptable to variations in beaconing patterns, thereby significantly enhancing the detection of periodic signals in complex network traffic data. -This comprehensive signal analysis pipeline, combining bandpass filtering, permutation-based FFT thresholding, autocorrelation peak detection, and frequency-lag correlation, constitutes a novel contribution of the DBAYWATCH framework. It is designed to be robust in the presence of noise and adaptable to variations in beaconing patterns, thereby significantly enhancing the detection of periodic signals in complex network traffic data. +\subsection{Evaluation with Beaconing Data} + +A thorough evaluation was conducted to assess the effectiveness of the proposed approach in detecting beaconing behavior within network traffic. This evaluation was carried out using two distinct data sources, each serving a specific purpose in testing and validating the methodology: + +\begin{itemize} + \item \textbf{Real Network Traces:} + Real-world data was collected from an operational enterprise environment to ensure that the evaluation reflected authentic network activity. These traces contained legitimate user interactions, background network traffic, and various enterprise-related communications. By utilizing real network data, the study ensured that the proposed detection mechanisms could operate effectively under practical conditions, accounting for the complexities of actual user behavior, business applications, and security controls. The dataset included a diverse range of network activities, such as web browsing, automated system updates, authentication requests, and enterprise cloud service interactions, providing a realistic testbed for identifying potential beaconing patterns. + + \item \textbf{Synthetic Beaconing Data:} + In addition to real-world data, synthetic beaconing traffic was generated to emulate a broad spectrum of beaconing behaviors under controlled conditions. This dataset was specifically designed to vary key parameters such as beacon interval and jitter, ensuring the evaluation covered a wide range of realistic and adversarial scenarios. The generated beaconing traffic included different types of periodic communications. This synthetic data allowed for systematic testing of detection thresholds, sensitivity to timing variations, and the robustness of anomaly detection mechanisms. +\end{itemize} + +By combining real-world network traces with controlled synthetic beaconing data, the evaluation process ensured a comprehensive and rigorous assessment of the detection approach. The real network traces provided insight into how the method performed in a complex, dynamic enterprise environment, while the synthetic data allowed for precise benchmarking of detection capabilities across different beaconing strategies. This dual approach helped validate the robustness of the system against both common and advanced beaconing techniques, ensuring its applicability in practical network security scenarios. -\subsection{Evaluation with Beaconing Data} -A thorough evaluation was conducted using two data sources: -\begin{itemize} - \item \textbf{Real Network Traces:} Collected from an operational enterprise environment. - \item \textbf{Synthetic Beaconing Data:} Generated to emulate a wide range of conditions by varying the beacon interval and jitter. -\end{itemize} \section{Experimental Design of Synthetic Beaconing Data} The synthetic experiments simulated various beacon configurations to study the impact of temporal noise on detection accuracy. Each beacon candidate is characterized by: @@ -581,11 +614,6 @@ The synthetic experiments simulated various beacon configurations to study the i \begin{equation} \text{Frequency} = \frac{1}{\text{Interval}}. \end{equation} - For example, if a beacon transmits every 10 seconds, its frequency is: - \begin{equation} - \frac{1}{10} = 0.1 \text{ Hz}. - \end{equation} - A higher frequency corresponds to more frequent transmissions, facilitating detection. \end{itemize} @@ -597,13 +625,13 @@ For instance: \end{itemize} \section{Results and Analysis} -Table~\ref{tab:beacon_comparison} presents an analysis of various beacon candidates. The table provides metrics that describe the periodicity of beacon transmissions, including interval duration, jitter, jitter-to-interval ratio, and frequency. These metrics are using for understanding the complexity of beacon detection and the impact of temporal noise on signal identification. +Table~\ref{tab:beacon_comparison} presents an analysis of various beacon candidates. The table provides metrics that describe the periodicity of beacon transmissions, including interval duration, jitter, and jitter-to-interval ratio. These metrics are used to understand the complexity of beacon detection and the impact of temporal noise on signal identification. The table is sorted by interval in increasing. This table provides an overview of beacon transmission characteristics, highlighting differences in periodicity and randomness. Beacons with high jitter introduce more unpredictability, making them harder to detect, while those with low jitter and frequent transmissions are easier to identify. The results underscore the importance of understanding temporal patterns in beacon signals and the impact of noise on detection accuracy. \begin{table} \centering - \caption{Beacon Candidates Ranked by Amplitude Score} + \caption{Beacon Candidates Ranked by Interval} \label{tab:beacon_comparison} \resizebox{\textwidth}{!}{% \begin{tabular}{lcccc} @@ -611,28 +639,27 @@ This table provides an overview of beacon transmission characteristics, highligh \textbf{Beacon URL} & \textbf{Interval (s)} & \textbf{Jitter (s)} & - \textbf{Jitter/Interval Ratio (\%)} & - \textbf{Frequency ($\sim$Hz)} \\ + \textbf{Jitter/Interval Ratio (\%)} \\ \midrule - beacon1.example.com & 300 & 2 & 0.67 & 0.00331 - 0.00335 \\ - beacon2.example.com & 300 & 5 & 1.67 & 0.00327 - 0.00338 \\ - beacon3.example.com & 300 & 120 & 40.00 & 0.0023 - 0.0055 \\ - beacon4.example.com & 300 & 150 & 50.00 & 0.0022 - 0.0066 \\ - beacon5.example.com & 10 & 5 & 50.00 & 0.066 - 0.2 \\ - beacon6.example.com & 10 & 2 & 20.00 & 0.083 - 0.125 \\ - beacon7.example.com & 20 & 5 & 25.00 & 0.04 - 0.066 \\ - beacon8.example.com & 20 & 10 & 50.00 & 0.033 - 0.1 \\ - beacon9.example.com & 30 & 5 & 16.67 & 0.028 - 0.04 \\ - beacon10.example.com & 40 & 5 & 12.50 & 0.022 - 0.028 \\ - beacon11.example.com & 50 & 5 & 10.00 & 0.018 - 0.022 \\ - beacon12.example.com & 60 & 10 & 16.70 & 0.014 - 0.02 \\ - beacon13.example.com & 120 & 10 & 8.30 & 0.0076 - 0.009 \\ + beacon5.example.com & 10 & 5 & 50.00 \\ + beacon6.example.com & 10 & 2 & 20.00 \\ + beacon7.example.com & 20 & 5 & 25.00 \\ + beacon8.example.com & 20 & 10 & 50.00 \\ + beacon9.example.com & 30 & 5 & 16.67 \\ + beacon10.example.com & 40 & 5 & 12.50 \\ + beacon11.example.com & 50 & 5 & 10.00 \\ + beacon12.example.com & 60 & 10 & 16.70 \\ + beacon13.example.com & 120 & 10 & 8.30 \\ + beacon1.example.com & 300 & 2 & 0.67 \\ + beacon2.example.com & 300 & 5 & 1.67 \\ + beacon3.example.com & 300 & 120 & 40.00 \\ + beacon4.example.com & 300 & 150 & 50.00 \\ \bottomrule \end{tabular}% } \end{table} -Figure \ref{fig:ABGraph} illustrates the synthetic beacon candidates with varying jitter levels. The x-axis represents the time intervals frequencies between beacon transmissions, while the y-axis shows the amplitude of the signals. The graph demonstrates how different jitter levels affect the periodicity and amplitude of beacon signals. Beacons with low jitter exhibit clear periodic patterns, making them easier to detect, while those with high jitter show more irregularity, complicating identification. By analyzing these synthetic beacon candidates, the framework can evaluate its detection capabilities under different noise conditions and refine its algorithms to improve accuracy and robustness. The figure presents the results obtained after applying all stages of the detection algorithm to the selected beacon URLs. Each URL exhibits distinct candidate points where periodic behavior has been detected. The analysis reveals a significant variation in the number of candidate points across different beacon URLs. Specifically, some beacons, such as "beacon1.example.com" through "beacon4.example.com", exhibit only a single candidate point, indicating that their periodic signals are either weak or occur over long intervals, making them more challenging to detect. This suggests that beaconing behaviors with longer intervals are inherently more difficult to detect, as their signals appear less frequently in the analyzed data. On the other hand, the algorithm performs more effectively in detecting beacons with shorter intervals. Other beacons, such as "beacon5.example.com" and "beacon6.example.com", show a substantially higher number of candidates, with 429 and 153 detected points, respectively, suggesting stronger periodicity and making them easier to identify. +Figure \ref{fig:ABGraph} illustrates the synthetic beacon candidates with varying jitter levels. The x-axis represents the time intervals frequencies between beacon transmissions, while the y-axis shows the amplitude of the signals. The graph demonstrates how different jitter levels affect the periodicity and amplitude of beacon signals. Beacons with low jitter exhibit clear periodic patterns, making them easier to detect, while those with high jitter show more irregularity, complicating identification. The figure presents the results obtained after applying all stages of the detection algorithm to the selected beacon URLs. Each URL exhibits distinct candidate points where periodic behavior has been detected. The analysis reveals a significant variation in the number of candidate points across different beacon URLs. Specifically, some beacons, such as "beacon1.example.com" through "beacon4.example.com", exhibit only a single candidate point, indicating that their periodic signals are either weak or occur over long intervals, making them more challenging to detect. This suggests that beaconing behaviors with longer intervals are inherently more difficult to detect, as their signals appear less frequently in the analyzed data. On the other hand, the algorithm performs more effectively in detecting beacons with shorter intervals. Other beacons, such as "beacon5.example.com" and "beacon6.example.com", show a substantially higher number of FFT candidates, with 429 and 153 detected points, respectively, suggesting stronger periodicity. A clear example of this can be seen in "beacon7.example.com", where a detected frequency of 0.05 Hz corresponds to a periodic beaconing behavior every 20 seconds. The detection of this short-interval beacon illustrates the algorithm’s strength in identifying high-frequency periodic transmissions, as their repetition leads to more pronounced spectral features in the output. Overall, the figure demonstrates that while long-interval beacons pose detection challenges, the algorithm excels in identifying shorter-interval beacons with strong periodicity. @@ -644,9 +671,9 @@ A clear example of this can be seen in "beacon7.example.com", where a detected f \end{figure} \section{Discussion and Conclusion} -The BAYWATCH extensions significantly enhance beacon detection accuracy by incorporating an advanced signal analysis pipeline. The evaluation confirms that maintaining a low jitter/interval ratio (ideally below 10\%) is useful for robust detection. However, the framework also demonstrates resilience under moderate noise conditions through adaptive thresholding and correlation techniques. The synthetic experiments provide valuable insights into the impact of temporal noise on beacon detection, highlighting the importance of understanding periodic patterns and the challenges posed by irregular transmissions. By combining real-world network traces with synthetic beaconing data, the framework achieves a comprehensive evaluation, demonstrating its efficacy in detecting malicious beaconing behavior under varied conditions. These results underscore the framework's potential to enhance network security by identifying stealthy threats and improving anomaly detection capabilities. The advanced signal analysis pipeline, coupled with a rigorous evaluation methodology, positions the BAYWATCH framework as a valuable tool for securing enterprise networks against advanced cyber threats. +The BAYWATCH extensions refine beacon detection by evaluating the impact of temporal noise and jitter on detection accuracy. The evaluation highlights that a low jitter/interval ratio (ideally below 10\%) can aid detection, but attackers are unlikely to operate under such ideal conditions. Instead, the framework demonstrates resilience under moderate noise through adaptive thresholding and correlation techniques. The synthetic experiments provide insights into the challenges posed by irregular transmissions, emphasizing the importance of understanding periodic patterns in adversarial settings. By combining real-world network traces with synthetic beaconing data, the framework ensures a comprehensive evaluation, assessing its capability to detect malicious beaconing behavior under varied conditions. These results highlight the framework’s role in enhancing network security by improving anomaly detection against stealthy threats. -The enhancements in the BAYWATCH framework, as detailed in this chapter, offer substantial improvements over the original BAYWATCH implementation. By reimplementing the base framework in Python and extending it with an advanced signal analysis pipeline, BAYWATCH achieves improved accuracy and scalability in beacon detection. The comprehensive evaluation with both real and synthetic data underscores the critical impact of jitter on detection performance and provides clear guidelines for optimal parameter settings in practical network security applications. +The enhancements in the BAYWATCH framework, as detailed in this chapter, primarily focus on improving its evaluation and applicability rather than incorporating a fundamentally new signal analysis pipeline. By reimplementing the base framework in Python and systematically analyzing the effects of jitter, BAYWATCH achieves improved accuracy and scalability in beacon detection. The comprehensive evaluation with both real and synthetic data underscores the critical impact of jitter on detection performance and provides clear guidelines for optimal parameter settings in practical network security applications. \chapter{Experiments and Discussions} This chapter presents a comprehensive evaluation of the framework to validate its efficacy in detecting malicious beaconing behavior in large-scale networks. The experiments are designed to address two objectives: first assessing the framework's robustness and accuracy under controlled noise conditions using synthetic datasets, and second evaluating its practical performance in real-world enterprise network environments. Synthetic data, generated with programmable noise levels and periodic patterns, enables systematic testing of framework's core algorithms, such as the Fast Fourier Transform (FFT) and autocorrelation-based verification. Subsequently, the framework is deployed on a real-world dataset. This dual approach not only validates the theoretical soundness of the methodology but also demonstrates its scalability and operational feasibility. By synthesizing findings from both artificial and real-world scenarios, this chapter provides insights into framework's strengths, limitations, and applicability in modern cybersecurity defense systems. @@ -688,7 +715,8 @@ The final step combines the results from the FFT and ACF steps to confirm malici Figure \ref{fig:combinedall} presents the analysis of three selected URLs "fpc.mesedge.net", "m4v4r4c5.stackpathcdn.com", and "beacon7.example.com", derived from both real and synthetic data. The first URL represents a non-beaconing behavior observed in real data, meaning that no periodic transmission pattern is present. The second URL, also extracted from real data, exhibits a clear beaconing behavior. The third URL corresponds to a synthetic beacon, artificially generated to simulate a periodic transmission pattern. -The x-axis represents the frequency range, corresponding to different time intervals, while the y-axis indicates the amplitude of the detected signals. The results show that for the first URL, which does not exhibit beaconing behavior, very few significant points appear in the output, confirming the absence of strong periodic patterns. In contrast, the second beacon, which originates from real data, displays a periodic behavior with a transmission interval of 10 seconds. Similarly, the synthetic beacon demonstrates a periodicity of 20 seconds. +The x-axis represents the frequency range, corresponding to different time intervals, while the y-axis indicates the amplitude of the detected signals. The beacon, derived from real data, exhibits periodic behavior with a frequency of approximately 0.1 Hz and a transmission interval of 10 seconds. Similarly, the synthetic beacon shows candidate frequencies at 0.05 Hz and 0.15 Hz. However, the peak at 0.05 Hz indicates that the URL exhibits beaconing behavior with a periodicity of 20 seconds. Additionally, no periodic beaconing behavior is detected for the non-beacon URL. + \begin{table} \centering @@ -716,17 +744,21 @@ By applying the detection algorithm to this dataset and analyzing the output, it \section{Discussion} -The framework’s combination of Fast Fourier Transform (FFT) and autocorrelation function (ACF) proved highly effective in detecting malicious beaconing behavior. By leveraging the complementary strengths of FFT (frequency-domain analysis) and ACF (time-domain consistency), the framework achieved a high detection accuracy while minimizing false positives. The cross-validation step, where only frequencies confirmed by both FFT and ACF are flagged as beaconing candidates, ensured robust identification of malicious behavior even in noisy environments. +The framework’s combination of Fast Fourier Transform (FFT) and autocorrelation function (ACF) proved highly effective in detecting malicious beaconing behavior. By leveraging the complementary strengths of FFT (frequency-domain analysis) and ACF (time-domain consistency), the framework achieved a high detection accuracy while minimizing false positives. The FFT efficiently identified periodic signals within network traffic, allowing for the detection of recurring patterns that might indicate beaconing behavior. Meanwhile, the ACF ensured that these periodic signals were consistent over time, reinforcing the detection confidence. The cross-validation step, where only frequencies confirmed by both FFT and ACF are flagged as beaconing candidates, played an important role in enhancing detection robustness. This approach significantly reduced false positives by filtering out transient, non-malicious periodic signals, ensuring that the framework remained effective even in noisy environments with fluctuating network activity. + +Despite its strong performance, the framework has certain limitations that must be addressed for broader applicability. First, its reliance on historical data means that it cannot detect zero-day beaconing behavior. Since the detection process requires a sufficient time window to analyze periodicity, newly emerging threats that do not yet exhibit clear patterns may evade detection until enough data has been accumulated. This limitation underscores the need for complementary real-time anomaly detection techniques that can provide additional layers of defense against previously unseen threats. + +Second, while the framework effectively filters out most noise, it occasionally flags legitimate periodic traffic as suspicious. Some automated processes, such as scheduled software updates, real-time financial market feeds, and news syndication services, naturally generate periodic network traffic that may be mistakenly classified as beaconing behavior. This issue could be mitigated by integrating adaptive whitelisting mechanisms that dynamically update based on observed traffic patterns and external threat intelligence feeds. By continuously refining its whitelist based on empirical observations, the framework could reduce false positives without compromising its ability to detect actual threats. -However, the framework has certain limitations. First, its reliance on historical data means it cannot detect zero-day beaconing behavior, as it requires a sufficient time window to analyze periodicity. Second, while the framework effectively filters out most noise, it occasionally flags legitimate periodic traffic (e.g., news feeds) as suspicious. This issue could be mitigated by integrating adaptive whitelisting mechanisms that dynamically update based on observed traffic patterns and threat intelligence feeds. +A factor in evaluating the effectiveness of the proposed algorithm is its execution time. After applying the necessary preprocessing and filtering steps on real-world network data, the implemented algorithm—designed to cross-check FFT and ACF results—completed execution in under 10 seconds. This runtime was measured on a system equipped with an Intel Core i5 processor, 16GB RAM, and a 500GB SSD running Windows 11. The rapid execution time highlights the efficiency of the implemented detection pipeline, demonstrating that the approach is computationally feasible for deployment in real-world enterprise environments. -One critical aspect of evaluating the effectiveness of the proposed algorithm is its execution time. After applying the necessary preprocessing and filtering steps on the real data, the algorithm was executed, and the results were obtained in less than 10 seconds. This rapid response time demonstrates the efficiency of the implemented pipeline. +The fast execution is largely attributed to the effectiveness of the preprocessing steps. By applying a combination of noise reduction techniques, frequency filtering, and data normalization, the dataset was refined before undergoing FFT and ACF analysis. This preprocessing stage reduced computational complexity, allowing the core detection algorithm to focus on meaningful periodic signals rather than processing extraneous noise. As a result, the framework efficiently extracted periodic patterns from network traffic without significant delays, making it suitable for near analysis in security operations centers (SOCs). -The fast execution is largely attributed to the effectiveness of the preprocessing steps. By applying various filtering techniques beforehand, the data was already refined and structured, reducing computational complexity in the subsequent analysis. As a result, the algorithm was able to process the data efficiently, extracting periodic patterns without significant delays. +Furthermore, the framework's modular design allows it to be adapted and extended based on specific security needs. Different configurations can be applied depending on the threat landscape, network architecture, and available computational resources. For instance, organizations with high network traffic volumes can implement parallelized versions of the FFT and ACF computations to further accelerate processing times. Additionally, integrating machine learning models trained on historical network traffic could enhance the framework’s ability to distinguish between benign and malicious periodic behavior. -Compared to traditional methods that may require extensive computational resources or longer processing times due to noise and redundant data, the proposed approach provides a streamlined and optimized solution. The ability to generate results within such a short timeframe highlights the algorithm’s suitability for real-time or near-real-time applications in network traffic analysis and beacon detection. +Another potential enhancement involves incorporating contextual analysis of detected beaconing traffic. While the current approach focuses primarily on periodicity, future iterations could integrate additional metadata, such as domain reputation scores, geolocation data, and endpoint behavior analysis. By correlating detected beaconing patterns with known threat intelligence sources, the framework could improve its ability to distinguish between benign automated services and actual command-and-control (C2) communications used in cyberattacks. -In conclusion, the framework represents a significant step forward in the detection of malicious beaconing behavior. Its modular design, scalability, and high accuracy make it a practical tool for enterprise threat detection. By addressing its current limitations and exploring future enhancements, the framework could become an even more powerful component of modern cybersecurity defense systems. +In conclusion, the proposed framework represents a significant advancement in the detection of malicious beaconing behavior. Its modular architecture, scalability, and high accuracy make it a practical tool for enterprise threat detection and network security monitoring. By addressing its current limitations and exploring potential enhancements, such as adaptive whitelisting and machine learning-based classification, the framework could evolve into an even more powerful component of modern cybersecurity defense systems. Its ability to efficiently analyze network traffic and detect covert communication channels underscores its value as a proactive defense mechanism against advanced persistent threats (APTs) and other sophisticated cyber threats. \chapter{Conclusion and Future Work} @@ -737,9 +769,8 @@ The framework was evaluated using both synthetic and real-world datasets. In syn Key contributions of this work include: \begin{itemize} - \item A novel 8-step filtering methodology that reliably separates legitimate from malicious beaconing behavior. - \item A highly scalable implementation, enabling the analysis of large-scale network traffic over extended time windows. - \item A systematic evaluation of the framework’s core algorithms, demonstrating robustness to real-world perturbations and scalability to millions of daily connection pairs. + \item An implementation of the BAYWATCH framework's 8-step filtering methodology, designed to effectively distinguish between legitimate and malicious beaconing behavior. + \item An implementation optimized for scalability, leveraging efficient data structures and parallel processing techniques to analyze large-scale network traffic over extended time windows. \end{itemize} The research is guided by several key questions, which are addressed below: @@ -775,10 +806,16 @@ Machine learning techniques could be integrated into the framework to improve th While the current implementation focuses on web proxy logs, the framework could be extended to analyze other types of network traffic, such as DNS queries, NetFlow data, and firewall logs. This would provide a more comprehensive approach to detecting advanced threats and improve the framework’s ability to identify covert communication channels. \subsection{Handling Aperiodic Beaconing} -Although rare, some advanced threats may employ aperiodic beaconing strategies to evade detection. Future work could explore techniques for detecting such behavior, such as analyzing the entropy of communication intervals or using machine learning to identify irregular patterns. +Although rare, some advanced threats may employ aperiodic beaconing strategies to evade detection. Future work could explore techniques for detecting such behavior, such as analyzing the entropy of communication intervals. \subsection{Deployment in Diverse Environments} Finally, future work could focus on deploying and evaluating the framework in diverse network environments, such as cloud infrastructures, IoT networks, and industrial control systems. This would help identify environment-specific challenges and adapt the framework to different use cases. \section{Final Remarks} -The BAYWATCH framework represents a significant advancement in the detection of malicious beaconing behavior. Its modular design, scalability, and high accuracy make it a practical tool for enterprise threat detection. By addressing its current limitations and exploring future enhancements, the framework could become an even more powerful component of modern cybersecurity defense systems, helping organizations protect their networks from advanced and persistent threats. +The BAYWATCH framework advances the field of cybersecurity by offering a robust, scalable solution for detecting malicious beaconing—a critical indicator of advanced threats like APTs and botnets. Its success lies in the hybrid methodology that combines spectral analysis (FFT) for frequency detection and autocorrelation (ACF) for temporal validation, enabling precise identification of periodic patterns even amidst noisy, real-world network traffic. By integrating multi-layered filtering—whitelisting, time series scrutiny, and behavioral heuristics—the framework minimizes false positives while maintaining sensitivity to stealthy threats, addressing a longstanding challenge in anomaly detection. + +The framework’s practical efficacy is underscored by its validation across both synthetic and real-world datasets, demonstrating resilience to jitter, network noise, and evasion tactics. This dual-data approach not only confirms its adaptability to controlled scenarios but also its readiness for enterprise deployment, where complexity and scale demand solutions that balance accuracy with computational efficiency. Moreover, BAYWATCH’s modular architecture allows seamless integration with existing security infrastructures, such as threat intelligence platforms, enhancing organizational capabilities to preemptively identify and mitigate covert attacks. + +Looking ahead, the framework’s evolution could redefine proactive cybersecurity strategies. Future enhancements like real-time analysis and adaptive whitelisting would enable dynamic threat response, while machine learning integration could uncover subtler, non-periodic attack patterns through unsupervised anomaly detection. Extending BAYWATCH to diverse data sources—such as DNS logs, IoT telemetry, or cloud-native traffic—would broaden its applicability across modern IT ecosystems, from edge computing to industrial control systems. + +Ultimately, this work bridges a vital gap in cyber defense, providing enterprises with a tool to combat increasingly sophisticated threats. By continuing to refine BAYWATCH’s capabilities and deploying it across sectors like healthcare, finance, and critical infrastructure, the cybersecurity community can foster a more resilient digital landscape, safeguarding sensitive data and operational continuity in an era of relentless cyber escalation. diff --git a/Thesis_Docs/sources/references.bib b/Thesis_Docs/sources/references.bib index 5ed64713cff1a09b298d249aac484a9883a4b41b..9360d0a615f58dcc6de46e2a228ef57c0aa529d4 100644 --- a/Thesis_Docs/sources/references.bib +++ b/Thesis_Docs/sources/references.bib @@ -1,97 +1,192 @@ @inproceedings{hu2016baywatch, - title={BAYWATCH: robust beaconing detection to identify infected hosts in large-scale enterprise networks}, - author={Hu, Xin and Jang, Jiyong and Stoecklin, Marc Ph and Wang, Ting and Schales, Douglas L and Kirat, Dhilung and Rao, Josyula R}, - booktitle={2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)}, - pages={479--490}, - year={2016}, - organization={IEEE} + title = {{BAYWATCH}: {Robust} {Beaconing} {Detection} to {Identify} {Infected} {Hosts} in {Large}-{Scale} {Enterprise} {Networks}}, + shorttitle = {{BAYWATCH}}, + url = {https://ieeexplore.ieee.org/abstract/document/7579765}, + doi = {10.1109/DSN.2016.50}, + abstract = {Sophisticated cyber security threats, such as advanced persistent threats, rely on infecting end points within a targeted security domain and embedding malware. Typically, such malware periodically reaches out to the command and control infrastructures controlled by adversaries. Such callback behavior, called beaconing, is challenging to detect as (a) detection requires long-term temporal analysis of communication patterns at several levels of granularity, (b) malware authors employ various strategies to hide beaconing behavior, and (c) it is also employed by legitimate applications (such as updates checks). In this paper, we develop a comprehensive methodology to identify stealthy beaconing behavior from network traffic observations. We use an 8-step filtering approach to iteratively refine and eliminate legitimate beaconing traffic and pinpoint malicious beaconing cases for in-depth investigation and takedown. We provide a systematic evaluation of our core beaconing detection algorithm and conduct a large-scale evaluation of web proxy data (more than 30 billion events) collected over a 5-month period at a corporate network comprising over 130,000 end-user devices. Our findings indicate that our approach reliably exposes malicious beaconing behavior, which may be overlooked by traditional security mechanisms.}, + urldate = {2025-03-06}, + booktitle = {2016 46th {Annual} {IEEE}/{IFIP} {International} {Conference} on {Dependable} {Systems} and {Networks} ({DSN})}, + author = {Hu, Xin and Jang, Jiyong and Stoecklin, Marc Ph. and Wang, Ting and Schales, Douglas L. and Kirat, Dhilung and Rao, Josyula R.}, + month = jun, + year = {2016}, + note = {ISSN: 2158-3927}, + keywords = {Anomaly Detection, Beaconing Detection, Communication channels, Computer security, Discrete Fourier transforms, Intrusion Detection, Malware, Robustness, Signal Processing, Time-frequency analysis}, + pages = {479--490}, + file = {Full Text PDF:files/38/Hu et al. - 2016 - BAYWATCH Robust Beaconing Detection to Identify Infected Hosts in Large-Scale Enterprise Networks.pdf:application/pdf;IEEE Xplore Abstract Record:files/37/7579765.html:text/html}, } + + @article{o1992dynamic, - title={A dynamic network architecture}, - author={O'Malley, Sean W and Peterson, Larry L}, - journal={ACM Transactions on Computer Systems (TOCS)}, - volume={10}, - number={2}, - pages={110--143}, - year={1992}, - publisher={ACM New York, NY, USA} +title = {A dynamic network architecture}, + volume = {10}, + issn = {0734-2071}, + url = {https://dl.acm.org/doi/10.1145/128899.128901}, + doi = {10.1145/128899.128901}, + abstract = {Network software is a critical component of any distributed system. Because of its complexity, network software is commonly layered into a hierarchy of protocols, or more generally, into a protocol graph. Typical protocol graphs—including those standardized in the ISO and TCP/IP network architectures—share three important properties; the protocol graph is simple, the nodes of the graph (protocols) encapsulate complex functionality, and the topology of the graph is relatively static. This paper describes a new way to organize network software that differs from conventional architectures in all three of these properties. In our approach, the protocol graph is complex, individual protocols encapsulate a single function, and the topology of the graph is dynamic. The main contribution of this paper is to describe the ideas behind our new architecture, illustrate the advantages of using the architecture, and demonstrate that the architecture results in efficient network software.}, + number = {2}, + urldate = {2025-03-06}, + journal = {ACM Trans. Comput. Syst.}, + author = {O'Malley, Sean W. and Peterson, Larry L.}, + month = may, + year = {1992}, + pages = {110--143}, + file = {Full Text PDF:files/40/O'Malley and Peterson - 1992 - A dynamic network architecture.pdf:application/pdf}, } + + @inproceedings{krawczyk2013security, - title={On the security of the TLS protocol: A systematic analysis}, - author={Krawczyk, Hugo and Paterson, Kenneth G and Wee, Hoeteck}, - booktitle={Annual Cryptology Conference}, - pages={429--448}, - year={2013}, - organization={Springer} + address = {Berlin, Heidelberg}, + title = {On the {Security} of the {TLS} {Protocol}: {A} {Systematic} {Analysis}}, + isbn = {978-3-642-40041-4}, + shorttitle = {On the {Security} of the {TLS} {Protocol}}, + doi = {10.1007/978-3-642-40041-4_24}, + abstract = {TLS is the most widely-used cryptographic protocol on the Internet. It comprises the TLS Handshake Protocol, responsible for authentication and key establishment, and the TLS Record Protocol, which takes care of subsequent use of those keys to protect bulk data. In this paper, we present the most complete analysis to date of the TLS Handshake protocol and its application to data encryption (in the Record Protocol). We show how to extract a key-encapsulation mechanism (KEM) from the TLS Handshake Protocol, and how the security of the entire TLS protocol follows from security properties of this KEM when composed with a secure authenticated encryption scheme in the Record Protocol. The security notion we achieve is a variant of the ACCE notion recently introduced by Jager et al. (Crypto ’12). Our approach enables us to analyse multiple different key establishment methods in a modular fashion, including the first proof of the most common deployment mode that is based on RSA PKCS \#1v1.5 encryption, as well as Diffie-Hellman modes. Our results can be applied to settings where mutual authentication is provided and to the more common situation where only server authentication is applied.}, + language = {en}, + booktitle = {Advances in {Cryptology} – {CRYPTO} 2013}, + publisher = {Springer}, + author = {Krawczyk, Hugo and Paterson, Kenneth G. and Wee, Hoeteck}, + editor = {Canetti, Ran and Garay, Juan A.}, + year = {2013}, + keywords = {Cryptology ePrint Archive, Encryption Scheme, Mutual Authentication, Random Oracle Model, Transport Layer Security}, + pages = {429--448}, + file = {Full Text PDF:files/42/Krawczyk et al. - 2013 - On the Security of the TLS Protocol A Systematic Analysis.pdf:application/pdf}, } + + @inproceedings{thomas1997team, - title={Team-based access control (TMAC) a primitive for applying role-based access controls in collaborative environments}, - author={Thomas, Roshan K}, - booktitle={Proceedings of the second ACM workshop on Role-based access control}, - pages={13--19}, - year={1997} + address = {Fairfax, Virginia, United States}, + title = {Team-based access control ({TMAC}): a primitive for applying role-based access controls in collaborative environments}, + isbn = {978-0-89791-985-2}, + shorttitle = {Team-based access control ({TMAC})}, + url = {http://portal.acm.org/citation.cfm?doid=266741.266748}, + doi = {10.1145/266741.266748}, + language = {en}, + urldate = {2025-03-06}, + booktitle = {Proceedings of the second {ACM} workshop on {Role}-based access control - {RBAC} '97}, + publisher = {ACM Press}, + author = {Thomas, Roshan K.}, + year = {1997}, + pages = {13--19}, + file = {Full Text PDF:files/47/Thomas - 1997 - Team-based access control (TMAC) a primitive for applying role-based access controls in collaborati.pdf:application/pdf}, } + + @inproceedings{zhang2023global, - author = {Zhang, Yizhe and Dong, Hongying and Nottingham, Alastair and Buchanan, Molly and Brown, Donald E. and Sun, Yixin}, - title = {Global Analysis with Aggregation-based Beaconing Detection across Large Campus Networks}, - booktitle = {ACSAC '23: Proceedings of the 39th Annual Computer Security Applications Conference}, - pages = {565--579}, - year = {2023}, - doi = {10.1145/3627106.3627126} + address = {New York, NY, USA}, + series = {{ACSAC} '23}, + title = {Global {Analysis} with {Aggregation}-based {Beaconing} {Detection} across {Large} {Campus} {Networks}}, + isbn = {979-8-4007-0886-2}, + url = {https://dl.acm.org/doi/10.1145/3627106.3627126}, + doi = {10.1145/3627106.3627126}, + abstract = {We present a new approach to effectively detect and prioritize malicious beaconing activities in large campus networks by profiling the server activities through aggregated signals across multiple traffic protocols and networks. Key components of our system include a novel time-series analysis algorithm that uncovers hidden periodicity in aggregated signals, and a ranking-based detection pipeline that utilizes self-training and active-learning techniques. We evaluate our detection system on 10 months of real-world traffic collected at two large campus networks, comprising over 75 billion connections. On a daily average, we detect 43\% more periodic domains by aggregating signals across multiple networks compared to single-network analysis. Furthermore, our ranking pipeline successfully identifies 1,387 unique malicious domains, out of which 781 (56\%) were unknown to the major online threat intelligence platform, VirusTotal, at the time of our detection.}, + urldate = {2025-03-06}, + booktitle = {Proceedings of the 39th {Annual} {Computer} {Security} {Applications} {Conference}}, + publisher = {Association for Computing Machinery}, + author = {Zhang, Yizhe and Dong, Hongying and Nottingham, Alastair and Buchanan, Molly and Brown, Donald E. and Sun, Yixin}, + month = dec, + year = {2023}, + pages = {565--579}, + file = {Full Text PDF:files/49/Zhang et al. - 2023 - Global Analysis with Aggregation-based Beaconing Detection across Large Campus Networks.pdf:application/pdf}, } + + @inproceedings{apruzzese2017identifying, - title={Identifying malicious hosts involved in periodic communications}, - author={Apruzzese, Giovanni and Marchetti, Mirco and Colajanni, Michele and Zoccoli, Gabriele Gambigliani and Guido, Alessandro}, - booktitle={2017 IEEE 16th International Symposium on Network Computing and Applications (NCA)}, - pages={1--8}, - year={2017}, - organization={IEEE} + title = {Identifying malicious hosts involved in periodic communications}, + url = {https://ieeexplore.ieee.org/abstract/document/8171326}, + doi = {10.1109/NCA.2017.8171326}, + abstract = {After many research efforts, Network Intrusion Detection Systems still have much room for improvement. This paper proposes a novel method for automatic and timely analysis of traffic generated by large networks, which is able to identify malicious external hosts even if their activities do not raise any alert by existing defensive systems. Our proposal focuses on periodic communications, since our experimental evaluation shows that they are more related to malicious activities, and it can be easily integrated with other detection systems. We highlight that periodic network activities can occur at very different intervals ranging from seconds to hours, hence a timely analysis of long time-windows of the traffic generated by large organizations is a challenging task in itself. Existing work is primarily focused on identifying botnets, whereas the method proposed in this paper has a broader target and aims to detect external hosts that are likely involved in any malicious operation. Since malware-related network activities can be considered as rare events in the overall traffic, the output of the proposed method is a manageable graylist of external hosts that are characterized by a considerably higher likelihood of being malicious compared to the entire set of external hosts contacted by the monitored large network. A thorough evaluation on a real large network traffic demonstrates the effectiveness of our proposal, which is capable of automatically selecting only dozens of suspicious hosts from hundreds of thousands, thus allowing security operators to focus their analyses on few likely malicious targets.}, + urldate = {2025-03-06}, + booktitle = {2017 {IEEE} 16th {International} {Symposium} on {Network} {Computing} and {Applications} ({NCA})}, + author = {Apruzzese, Giovanni and Marchetti, Mirco and Colajanni, Michele and Zoccoli, Gabriele Gambigliani and Guido, Alessandro}, + month = oct, + year = {2017}, + keywords = {Malware, beaconing, clustering, graylist, Monitoring, Noise measurement, periodicity, Proposals, Security, Time series analysis}, + pages = {1--8}, + file = {Full Text PDF:files/51/Apruzzese et al. - 2017 - Identifying malicious hosts involved in periodic communications.pdf:application/pdf;IEEE Xplore Abstract Record:files/52/8171326.html:text/html}, } + + @article{seo2018abnormal, - title={Abnormal behavior detection to identify infected systems using the APChain algorithm and behavioral profiling}, - author={Seo, Jungwoo and Lee, Sangjin}, - journal={Security and Communication Networks}, - volume={2018}, - number={1}, - pages={9706706}, - year={2018}, - publisher={Wiley Online Library} + title = {Abnormal {Behavior} {Detection} to {Identify} {Infected} {Systems} {Using} the {APChain} {Algorithm} and {Behavioral} {Profiling}}, + volume = {2018}, + issn = {1939-0122}, + url = {https://onlinelibrary.wiley.com/doi/10.1155/2018/9706706}, + doi = {10.1155/2018/9706706}, + abstract = {Recent cyber-attacks have used unknown malicious code or advanced attack techniques, such as zero-day attacks, making them extremely difficult to detect using traditional intrusion detection systems....}, + language = {en}, + number = {1}, + urldate = {2025-03-06}, + journal = {Security and Communication Networks}, + author = {Seo, Jungwoo and Lee, Sangjin}, + month = jan, + year = {2018}, + note = {Publisher: John Wiley \& Sons, Ltd}, + pages = {9706706}, + file = {Full Text PDF:files/56/Seo and Lee - 2018 - Abnormal Behavior Detection to Identify Infected Systems Using the APChain Algorithm and Behavioral.pdf:application/pdf}, } + + @inproceedings{huynh2016uncovering, - title={Uncovering periodic network signals of cyber attacks}, - author={Huynh, Ngoc Anh and Ng, Wee Keong and Ulmer, Alex and Kohlhammer, J{\"o}rn}, - booktitle={2016 IEEE Symposium on Visualization for Cyber Security (VizSec)}, - pages={1--8}, - year={2016}, - organization={IEEE} + title = {Uncovering periodic network signals of cyber attacks}, + url = {https://ieeexplore.ieee.org/abstract/document/7739581}, + doi = {10.1109/VIZSEC.2016.7739581}, + abstract = {This paper addresses the problem of detecting the presence of malware that leaveperiodictraces innetworktraffic. This characteristic behavior of malware was found to be surprisingly prevalent in a parallel study. To this end, we propose a visual analytics solution that supports both automatic detection and manual inspection of periodic signals hidden in network traffic. The detected periodic signals are visually verified in an overview using a circular graph and two stacked histograms as well as in detail using deep packet inspection. Our approach offers the capability to detect complex periodic patterns, but avoids the unverifiability issue often encountered in related work. The periodicity assumption imposed on malware behavior is a relatively weak assumption, but initial evaluations with a simulated scenario as well as a publicly available network capture demonstrate its applicability.}, + urldate = {2025-03-06}, + booktitle = {2016 {IEEE} {Symposium} on {Visualization} for {Cyber} {Security} ({VizSec})}, + author = {Anh Huynh, Ngoc and Keong Ng, Wee and Ulmer, Alex and Kohlhammer, Jörn}, + month = oct, + year = {2016}, + keywords = {Malware, Time series analysis, Detectors, Fourier transform, Fourier transforms, Intrusion detection, IP networks, Periodic traces, Stacked histogram, Visual analytics, Visualization}, + pages = {1--8}, + file = {Full Text PDF:files/58/Anh Huynh et al. - 2016 - Uncovering periodic network signals of cyber attacks.pdf:application/pdf;IEEE Xplore Abstract Record:files/59/7739581.html:text/html}, } + + @misc{jang2021detecting, - title={Detecting malicious beaconing communities using lockstep detection and co-occurrence graph}, - author={Jang, Jiyong and Kirat, Dhilung Hang and Kwon, Bum Jun and Schales, Douglas Lee and Stoecklin, Marc Philippe}, - year={2021}, - month=jan # "~5", - publisher={Google Patents}, - note={US Patent 10,887,323} + title = {Detecting malicious beaconing communities using lockstep detection and co-occurrence graph}, + url = {https://patents.google.com/patent/US10887323/en}, + nationality = {US}, + language = {en}, + assignee = {International Business Machines Corp}, + number = {US10887323B2}, + urldate = {2025-03-06}, + author = {Jang, Jiyong and Kirat, Dhilung Hang and Kwon, Bum Jun and Schales, Douglas Lee and Stoecklin, Marc Philippe}, + month = jan, + year = {2021}, + keywords = {computer, data, implemented method, processing, providing}, + file = {Full Text PDF:files/61/Jang et al. - 2021 - Detecting malicious beaconing communities using lockstep detection and co-occurrence graph.pdf:application/pdf}, } + + @article{talib2022apt, - title={APT beaconing detection: A systematic review}, - author={Talib, Manar Abu and Nasir, Qassim and Nassif, Ali Bou and Mokhamed, Takua and Ahmed, Nafisa and Mahfood, Bayan}, - journal={Computers \& Security}, - volume={122}, - pages={102875}, - year={2022}, - publisher={Elsevier} + title = {{APT} beaconing detection: {A} systematic review}, + volume = {122}, + issn = {01674048}, + shorttitle = {{APT} beaconing detection}, + url = {https://linkinghub.elsevier.com/retrieve/pii/S0167404822002693}, + doi = {10.1016/j.cose.2022.102875}, + abstract = {Advanced Persistent Threat (APT) is a type of threat that has grabbed the attention of researchers, particularly in the industrial security ï¬eld. APTs are cyber intrusions carried out by skilled and well-resourced adversaries who target speciï¬c information in high-proï¬le organizations and governments, frequently as part of a multi-phase long-term operation. One of the phases of the APT process is the command-andcontrol (C\&C) phase, also known as beaconing. Beaconing is an important part of an APT lifecycle, where the adversaries establish channels with the compromised hosts in the targeted system, allowing them to launch additional attacks. Detecting and predicting this stage is therefore a practical way to guard against APTs. This paper discusses the techniques and methods used to detect APTs and also speciï¬cally to identify beaconing, either during the APT lifecycle or not. In it, we determine various artiï¬cial intelligence algorithms used for detecting, analyzing and comparing characteristics of datasets and data sources used to implement these detection techniques. Moreover, we present the strengths and challenges of various APT/beaconing detection methods. Finally, this study outlines many cybersecurity vendor projects that have been created to identify APT or beaconing operations, categorized according to the detection approach utilized.}, + language = {en}, + urldate = {2025-03-06}, + journal = {Computers \& Security}, + author = {Abu Talib, Manar and Nasir, Qassim and Bou Nassif, Ali and Mokhamed, Takua and Ahmed, Nafisa and Mahfood, Bayan}, + month = nov, + year = {2022}, + pages = {102875}, + file = {PDF:files/68/Abu Talib et al. - 2022 - APT beaconing detection A systematic review.pdf:application/pdf}, } + + @Misc{influxdb2023, title = {InfluxDB 3.0 System Architecture}, author = "{InfluxData}", @@ -99,151 +194,303 @@ note = {Accessed: 2024-08-13} } + + @article{guo2023review, - title={A review of Machine Learning-based zero-day attack detection: Challenges and future directions}, - author={Guo, Yang}, - journal={Computer communications}, - volume={198}, - pages={175--185}, - year={2023}, - publisher={Elsevier} + title = {A review of {Machine} {Learning}-based zero-day attack detection: {Challenges} and future directions}, + volume = {198}, + issn = {0140-3664}, + shorttitle = {A review of {Machine} {Learning}-based zero-day attack detection}, + url = {https://www.sciencedirect.com/science/article/pii/S0140366422004248}, + doi = {10.1016/j.comcom.2022.11.001}, + abstract = {Zero-day attacks exploit unknown vulnerabilities so as to avoid being detected by cybersecurity detection tools. The studies (Bilge and DumitraÅŸ, 2012, Google, 0000, Ponemon Sullivan Privacy Report, 2020) show that zero-day attacks are wide spread and are one of the major threats to computer security. The traditional signature-based detection method is not effective in detecting zero-day attacks as the signatures of zero-day attacks are typically not available beforehand. Machine Learning (ML)-based detection method is capable of capturing attacks’ statistical characteristics and is, hence, promising for zero-day attack detection. In this survey paper, a comprehensive review of ML-based zero-day attack detection approaches is conducted, and their ML models, training and testing data sets used, and evaluation results are compared. While significant efforts have been put forth to develop accurate and robust zero-attack detection tools, the existing methods fall short in accuracy, recall, and uniformity against different types of zero-day attacks. Major challenges toward the ML-based methods are identified and future research directions are recommended at last.}, + urldate = {2025-03-06}, + journal = {Computer Communications}, + author = {Guo, Yang}, + month = jan, + year = {2023}, + keywords = {Attack detection, Machine Learning, Zero-day attacks}, + pages = {175--185}, + file = {PubMed Central Full Text PDF:files/77/Guo - 2023 - A review of Machine Learning-based zero-day attack detection Challenges and future directions.pdf:application/pdf;ScienceDirect Snapshot:files/76/S0140366422004248.html:text/html}, } + + @article{caputo2013going, - title={Going spear phishing: Exploring embedded training and awareness}, - author={Caputo, Deanna D and Pfleeger, Shari Lawrence and Freeman, Jesse D and Johnson, M Eric}, - journal={IEEE security \& privacy}, - volume={12}, - number={1}, - pages={28--38}, - year={2013}, - publisher={IEEE} + title = {Going {Spear} {Phishing}: {Exploring} {Embedded} {Training} and {Awareness}}, + volume = {12}, + issn = {1558-4046}, + shorttitle = {Going {Spear} {Phishing}}, + url = {https://ieeexplore.ieee.org/abstract/document/6585241}, + doi = {10.1109/MSP.2013.106}, + abstract = {To explore the effectiveness of embedded training, researchers conducted a large-scale experiment that tracked workers' reactions to a series of carefully crafted spear phishing emails and a variety of immediate training and awareness activities. Based on behavioral science findings, the experiment included four different training conditions, each of which used a different type of message framing. The results from three trials showed that framing had no significant effect on the likelihood that a participant would click a subsequent spear phishing email and that many participants either clicked all links or none regardless of whether they received training. The study was unable to determine whether the embedded training materials created framing changes on susceptibility to spear phishing attacks because employees failed to read the training materials.}, + number = {1}, + urldate = {2025-03-06}, + journal = {IEEE Security \& Privacy}, + author = {Caputo, Deanna D. and Pfleeger, Shari Lawrence and Freeman, Jesse D. and Johnson, M. Eric}, + month = jan, + year = {2014}, + note = {Conference Name: IEEE Security \& Privacy}, + keywords = {Computer security, behavioral science, Behavioral science, Electronic mail, Embedded system, embedded training, Large-scale systems, Phishing, security awareness, spear phishing, User centered design}, + pages = {28--38}, + file = {Full Text PDF:files/79/Caputo et al. - 2014 - Going Spear Phishing Exploring Embedded Training and Awareness.pdf:application/pdf;IEEE Xplore Abstract Record:files/80/6585241.html:text/html}, } + + @article{eisenberg2018network, - title={Network foundation for command and control (C2) systems: literature review}, - author={Eisenberg, Daniel A and Alderson, David L and Kitsak, Maksim and Ganin, Alexander and Linkov, Igor}, - journal={IEEE Access}, - volume={6}, - pages={68782--68794}, - year={2018}, - publisher={IEEE} + title = {Network {Foundation} for {Command} and {Control} ({C2}) {Systems}: {Literature} {Review}}, + volume = {6}, + issn = {2169-3536}, + shorttitle = {Network {Foundation} for {Command} and {Control} ({C2}) {Systems}}, + url = {https://ieeexplore.ieee.org/abstract/document/8535018}, + doi = {10.1109/ACCESS.2018.2873328}, + abstract = {We review the command and control (C2) literature to develop a comprehensive understanding of C2 systems and identify network evaluation methods. C2 is the recursive process of sharing the intent of decision-makers across organizations, turning intent into action, monitoring success, and adjusting goals to meet changing needs. Although substantial C2 research exists, the literature remains isolated by application context, and advances are rarely integrated. Our review identifies research in military, emergency response, civilian infrastructure, and management literature that inform the analysis of C2 systems. We organize C2 research with theory from Network Centric Warfare and complex systems to integrate knowledge across broad disciplines and applications. The review organizes studies across four interrelated domains (i.e., physical, information, social, and cognitive), presents system design and evaluation constraints across subsystems, and offers practical considerations for advancing C2 theory. The review also catalogues network evaluation methods used to study C2 agility, i.e., the ability to successfully effect, cope with, and/or exploit changes in circumstances. Together, this paper supports the organizing, integration, and advancement of knowledge for the influential, yet broad research subject of C2.}, + urldate = {2025-03-06}, + journal = {IEEE Access}, + author = {Eisenberg, Daniel A. and Alderson, David L. and Kitsak, Maksim and Ganin, Alexander and Linkov, Igor}, + year = {2018}, + note = {Conference Name: IEEE Access}, + keywords = {Agility, Analytical models, command and control, Command and control systems, Complex systems, Knowledge engineering, multilayer networks, network science, Nonhomogeneous media, Organizations, system architecture, Task analysis}, + pages = {68782--68794}, + file = {Full Text PDF:files/83/Eisenberg et al. - 2018 - Network Foundation for Command and Control (C2) Systems Literature Review.pdf:application/pdf;IEEE Xplore Abstract Record:files/82/8535018.html:text/html}, } + + @incollection{charan2021dmapt, - title={Dmapt: Study of data mining and machine learning techniques in advanced persistent threat attribution and detection}, - author={Charan, PV Sai and Anand, P Mohan and Shukla, Sandeep K}, - booktitle={Data Mining-Concepts and Applications}, - year={2021}, - publisher={IntechOpen} + title = {{DMAPT}: {Study} of {Data} {Mining} and {Machine} {Learning} {Techniques} in {Advanced} {Persistent} {Threat} {Attribution} and {Detection}}, + isbn = {978-1-83969-267-3}, + shorttitle = {{DMAPT}}, + url = {https://www.intechopen.com/chapters/77974}, + abstract = {Modern-day malware is intelligent enough to hide its presence and perform stealthy operations in the background. Advance Persistent Threat (APT) is one such kind of malware attack on sensitive corporate and banking networks to stay there for a long time undetected. In real-time corporate networks, identifying the presence of intruders is a big challenging task for security experts. Recent APT attacks like Carbanak, The Big Bang, and Red Echo attack (targeting the Indian power sector) are ringing alarms globally. New data exfiltration methods and advancements in malware techniques are the two main reasons for rapid and robust APT evolution. Although many traditional and hybrid methods are available to detect this stealthy malware, the number of target-specific attacks are increasing rapidly at global level. Attackers have been crafting payloads resistant to malware sandbox environments so that traditional sandboxing techniques may not work with these APT malware detection. In this paper, we shed light on various Data Mining, Machine Learning techniques and frameworks used in both Attribution and Detection of APT malware. Added to this, our work highlight GAP analysis and need for paradigm shift in existing techniques to deal with evolving modern APT malware.}, + language = {en}, + urldate = {2025-03-06}, + booktitle = {Data {Mining} - {Concepts} and {Applications}}, + publisher = {IntechOpen}, + author = {Charan, P. V. Sai and Anand, P. Mohan and Shukla, Sandeep K. and Charan, P. V. Sai and Anand, P. Mohan and Shukla, Sandeep K.}, + month = aug, + year = {2021}, + doi = {10.5772/intechopen.99291}, + file = {Full Text PDF:files/85/Charan et al. - 2021 - DMAPT Study of Data Mining and Machine Learning Techniques in Advanced Persistent Threat Attributio.pdf:application/pdf}, } + + @inproceedings{hagan2018peer, - title={Peer based tracking using multi-tuple indexing for network traffic analysis and malware detection}, - author={Hagan, Matthew and Kang, BooJoong and McLaughlin, Kieran and Sezer, Sakir}, - booktitle={2018 16th Annual Conference on Privacy, Security and Trust (PST)}, - pages={1--5}, - year={2018}, - organization={IEEE} + title = {Peer {Based} {Tracking} using {Multi}-{Tuple} {Indexing} for {Network} {Traffic} {Analysis} and {Malware} {Detection}}, + url = {https://ieeexplore.ieee.org/abstract/document/8514165}, + doi = {10.1109/PST.2018.8514165}, + abstract = {Traditional firewalls, Intrusion Detection Systems(IDS) and network analytics tools extensively use the `flow' connection concept, consisting of five `tuples' of source and destination IP, ports and protocol type, for classification and management of network activities. By analysing flows, information can be obtained from TCP/IP fields and packet content to give an understanding of what is being transferred within a single connection. As networks have evolved to incorporate more connections and greater bandwidth, particularly from “always on†IoT devices and video and data streaming, so too have malicious network threats, whose communication methods have increased in sophistication. As a result, the concept of the 5 tuple flow in isolation is unable to detect such threats and malicious behaviours. This is due to factors such as the length of time and data required to understand the network traffic behaviour, which cannot be accomplished by observing a single connection. To alleviate this issue, this paper proposes the use of additional, two tuple and single tuple flow types to associate multiple 5 tuple communications, with generated metadata used to profile individual connnection behaviour. This proposed approach enables advanced linking of different connections and behaviours, developing a clearer picture as to what network activities have been taking place over a prolonged period of time. To demonstrate the capability of this approach, an expert system rule set has been developed to detect the presence of a multi-peered ZeuS botnet, which communicates by making multiple connections with multiple hosts, thus undetectable to standard IDS systems observing 5 tuple flow types in isolation. Finally, as the solution is rule based, this implementation operates in realtime and does not require post-processing and analytics of other research solutions. This paper aims to demonstrate possible applications for next generation firewalls and methods to acquire additional information from network traffic.}, + urldate = {2025-03-06}, + booktitle = {2018 16th {Annual} {Conference} on {Privacy}, {Security} and {Trust} ({PST})}, + author = {Hagan, Matthew and Kang, BooJoong and McLaughlin, Kieran and Sezer, Sakir}, + month = aug, + year = {2018}, + keywords = {Malware, IP networks, Command and control systems, 5-tuple flow tables, Botnet, Indexes, Network Behavioural detection, Next generation firewall, Protocols, Standards, Zeus botnet}, + pages = {1--5}, + file = {Full Text PDF:files/87/Hagan et al. - 2018 - Peer Based Tracking using Multi-Tuple Indexing for Network Traffic Analysis and Malware Detection.pdf:application/pdf;IEEE Xplore Abstract Record:files/88/8514165.html:text/html}, } + + + @article{shalaginov2016malware, - title={Malware beaconing detection by mining large-scale dns logs for targeted attack identification}, - author={Shalaginov, Andrii and Franke, Katrin and Huang, Xiongwei}, - journal={International Journal of Computer and Systems Engineering}, - volume={10}, - number={4}, - pages={743--755}, - year={2016} + title = {Malware {Beaconing} {Detection} by {Mining} {Large}-scale {DNS} {Logs} for {Targeted} {Attack} {Identification}}, + abstract = {One of the leading problems in Cyber Security today is the emergence of targeted attacks conducted by adversaries with access to sophisticated tools. These attacks usually steal senior level employee system privileges, in order to gain unauthorized access to confidential knowledge and valuable intellectual property. Malware used for initial compromise of the systems are sophisticated and may target zero-day vulnerabilities. In this work we utilize common behaviour of malware called " beacon " , which implies that infected hosts communicate to Command and Control servers at regular intervals that have relatively small time variations. By analysing such beacon activity through passive network monitoring, it is possible to detect potential malware infections. So, we focus on time gaps as indicators of possible C2 activity in targeted enterprise networks. We represent DNS log files as a graph, whose vertices are destination domains and edges are timestamps. Then by using four periodicity detection algorithms for each pair of internal-external communications, we check timestamp sequences to identify the beacon activities. Finally, based on the graph structure, we infer the existence of other infected hosts and malicious domains enrolled in the attack activities}, + author = {Shalaginov, Andrii and Franke, Katrin and Huang, Xiongwei}, + month = apr, + year = {2016}, + file = {PDF:files/95/Shalaginov et al. - 2016 - Malware Beaconing Detection by Mining Large-scale DNS Logs for Targeted Attack Identification.pdf:application/pdf}, } + + @inproceedings{yeh2018malware, - title={A malware beacon of botnet by local periodic communication behavior}, - author={Yeh, Yi-Ren and Tu, Tang Chen and Sun, Ming-Kung and Pi, Shih Ming and Huang, C-Y}, - booktitle={2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC)}, - volume={2}, - pages={653--657}, - year={2018}, - organization={IEEE} + title = {A {Malware} {Beacon} of {Botnet} by {Local} {Periodic} {Communication} {Behavior}}, + volume = {02}, + url = {https://ieeexplore.ieee.org/abstract/document/8377941}, + doi = {10.1109/COMPSAC.2018.10313}, + abstract = {Botnets are one of most serious threats in cyber security. Many previous studies have been proposed for botnet detection. Among those approaches, one of main tracks focuses on extracting informative features from network traffic flows. Nevertheless, most features of interest are extracted from the information of a single connection, such as flow duration, flow packet size etc. In this paper, we proposed an novel feature, which is able to detect a long-term behavior of botnets. More specifically, we aim to extract a malware beacon from the periodic communication between bots and bot master. Besides the regular communication pattern, we also explore several types of botnet behavior to leverage the effectiveness of the proposed feature. Our experimental results show that our proposed periodic communication signature could be one of effective features for detecting compromised devices.}, + urldate = {2025-03-06}, + booktitle = {2018 {IEEE} 42nd {Annual} {Computer} {Software} and {Applications} {Conference} ({COMPSAC})}, + author = {Yeh, Yi-Ren and Tu, Tang Chen and Sun, Ming-Kung and Pi, Shih Ming and Huang, C.-Y.}, + month = jul, + year = {2018}, + note = {ISSN: 0730-3157}, + keywords = {Malware, IP networks, Electronic mail, Botnet, feature extraction, Feature extraction, Microsoft Windows, periodic communication, Servers}, + pages = {653--657}, + file = {Full Text PDF:files/97/Yeh et al. - 2018 - A Malware Beacon of Botnet by Local Periodic Communication Behavior.pdf:application/pdf;IEEE Xplore Abstract Record:files/98/8377941.html:text/html}, } + + @inproceedings{enright2022learning, - title={A learning-based zero-trust architecture for 6g and future networks}, - author={Enright, Michael A and Hammad, Eman and Dutta, Ashutosh}, - booktitle={2022 IEEE Future Networks World Forum (FNWF)}, - pages={64--71}, - year={2022}, - organization={IEEE} + title = {A {Learning}-{Based} {Zero}-{Trust} {Architecture} for {6G} and {Future} {Networks}}, + url = {https://ieeexplore.ieee.org/abstract/document/10056611}, + doi = {10.1109/FNWF55208.2022.00020}, + abstract = {In the evolution of 6G and Future Networks, a dynamic, flexible, learning-based security architecture will be essential with the ability to handle both current and evolving cybersecurity threats. This is specially critical with future networks' increased reliance on distributed learning-based approaches for operation. To address this challenge, a distributed learning framework must provide security and trust in an integrated fashion. In contrast to existing approach such as federated learning (FL), that update parameters of a shared model, this work proposes an architecture that is capable of integrating advanced learning with real-time digital forensics, e.g. monitoring compute and storage resources. With real-time monitoring, it is possible to develop a learning-based, real-time Zero-Trust Architecture (ZTA) to achieve the high levels of security. The proposed architecture, serves as a framework to enable and spur innovation, where new machine learning based techniques can be developed for enhanced real-time, adaptive and proactive security, thus, embedding future networks' security with learning-based ZTA elements.}, + urldate = {2025-03-06}, + booktitle = {2022 {IEEE} {Future} {Networks} {World} {Forum} ({FNWF})}, + author = {Enright, Michael A. and Hammad, Eman and Dutta, Ashutosh}, + month = oct, + year = {2022}, + note = {ISSN: 2770-7679}, + keywords = {6G, 6G mobile communication, Computer architecture, cyber-security, Digital forensics, Distance learning, distributed computing, Federated learning, future networks, machine learning, Network security, privacy, Technological innovation, zero trust architecture}, + pages = {64--71}, + file = {Full Text PDF:files/100/Enright et al. - 2022 - A Learning-Based Zero-Trust Architecture for 6G and Future Networks.pdf:application/pdf;IEEE Xplore Abstract Record:files/101/10056611.html:text/html}, } + + @inproceedings{van2022deepcase, - title={Deepcase: Semi-supervised contextual analysis of security events}, - author={Van Ede, Thijs and Aghakhani, Hojjat and Spahn, Noah and Bortolameotti, Riccardo and Cova, Marco and Continella, Andrea and van Steen, Maarten and Peter, Andreas and Kruegel, Christopher and Vigna, Giovanni}, - booktitle={2022 IEEE Symposium on Security and Privacy (SP)}, - pages={522--539}, - year={2022}, - organization={IEEE} + title = {{DEEPCASE}: {Semi}-{Supervised} {Contextual} {Analysis} of {Security} {Events}}, + shorttitle = {{DEEPCASE}}, + url = {https://ieeexplore.ieee.org/document/9833671/?arnumber=9833671}, + doi = {10.1109/SP46214.2022.9833671}, + abstract = {Security monitoring systems detect potentially malicious activities in IT infrastructures, by either looking for known signatures or for anomalous behaviors. Security operators investigate these events to determine whether they pose a threat to their organization. In many cases, a single event may be insufficient to determine whether certain activity is indeed malicious. Therefore, a security operator frequently needs to correlate multiple events to identify if they pose a real threat. Unfortunately, the vast number of events that need to be correlated often overload security operators, forcing them to ignore some events and, thereby, potentially miss attacks. This work studies how to automatically correlate security events and, thus, automate parts of the security operator workload. We design and evaluate DEEPCASE, a system that leverages the context around events to determine which events require further inspection. This approach reduces the number of events that need to be inspected. In addition, the context provides valuable insights into why certain events are classified as malicious. We show that our approach automatically filters 86.72\% of the events and reduces the manual workload of security operators by 90.53\%, while underestimating the risk of potential threats in less than 0.001\% of cases.}, + urldate = {2025-03-06}, + booktitle = {2022 {IEEE} {Symposium} on {Security} and {Privacy} ({SP})}, + author = {Ede, Thijs van and Aghakhani, Hojjat and Spahn, Noah and Bortolameotti, Riccardo and Cova, Marco and Continella, Andrea and Steen, Maarten van and Peter, Andreas and Kruegel, Christopher and Vigna, Giovanni}, + month = may, + year = {2022}, + note = {ISSN: 2375-1207}, + keywords = {Monitoring, Security, Detectors, Organizations, alert-reduction, Behavioral sciences, Inspection, intrusion-detection, Privacy, security-operations-center}, + pages = {522--539}, + file = {Full Text PDF:files/106/Ede et al. - 2022 - DEEPCASE Semi-Supervised Contextual Analysis of Security Events.pdf:application/pdf;IEEE Xplore Abstract Record:files/105/9833671.html:text/html}, } + + @inproceedings{ongun2021portfiler, - title={PORTFILER: port-level network profiling for self-propagating malware detection}, - author={Ongun, Talha and Spohngellert, Oliver and Miller, Benjamin and Boboila, Simona and Oprea, Alina and Eliassi-Rad, Tina and Hiser, Jason and Nottingham, Alastair and Davidson, Jack and Veeraraghavan, Malathi}, - booktitle={2021 IEEE Conference on Communications and Network Security (CNS)}, - pages={182--190}, - year={2021}, - organization={IEEE} + title = {{PORTFILER}: {Port}-{Level} {Network} {Profiling} for {Self}-{Propagating} {Malware} {Detection}}, + shorttitle = {{PORTFILER}}, + url = {https://ieeexplore.ieee.org/document/9705045/?arnumber=9705045}, + doi = {10.1109/CNS53000.2021.9705045}, + abstract = {Recent self-propagating malware (SPM) campaigns compromised hundred of thousands of victim machines on the Internet. It is challenging to detect these attacks in their early stages, as adversaries utilize common network services, use novel techniques, and can evade existing detection mechanisms. We propose PorTFILER (PORT-Level Network Traffic ProFILER), a new machine learning system applied to network traffic for detecting SPM attacks. PORTFILER extracts port-level features from the Zeek connection logs collected at a border of a monitored network, applies anomaly detection techniques to identify suspicious events, and ranks the alerts across ports for investigation by the Security Operations Center (SOC). We propose a novel ensemble methodology for aggregating individual models in PORTFILER that increases resilience against several evasion strategies compared to standard ML baselines. We extensively evaluate PorTFILER on traffic collected from two university networks, and show that it can detect SPM attacks with different patterns, such as WannaCry and Mirai, and performs well under evasion. Ranking across ports achieves precision over 0.94 and false positive rates below 8 {\textbackslash}times 10$^{\textrm{-4}}$ in the top 100 highly ranked alerts. When deployed on the university networks, PorTFILER detected anomalous SPM-like activity on one of the campus networks, confirmed by the university SOC as malicious. PortFILER also detected a Mirai attack recreated on the two university networks with higher precision and recall than deep-learning based autoencoder methods.}, + urldate = {2025-03-06}, + booktitle = {2021 {IEEE} {Conference} on {Communications} and {Network} {Security} ({CNS})}, + author = {Ongun, Talha and Spohngellert, Oliver and Miller, Benjamin and Boboila, Simona and Oprea, Alina and Eliassi-Rad, Tina and Hiser, Jason and Nottingham, Alastair and Davidson, Jack and Veeraraghavan, Malathi}, + month = oct, + year = {2021}, + keywords = {Malware, Standards, Feature extraction, Network security, Deep learning, Internet, malware detection, security analytics, self-propagating malware, Telecommunication traffic, traffic profiling}, + pages = {182--190}, + file = {Full Text PDF:files/109/Ongun et al. - 2021 - PORTFILER Port-Level Network Profiling for Self-Propagating Malware Detection.pdf:application/pdf;IEEE Xplore Abstract Record:files/108/9705045.html:text/html}, } + + @article{niu2020malware, - title={Malware on internet of uavs detection combining string matching and fourier transformation}, - author={Niu, Weina and Zhang, Xiyue and Zhang, Xiaosong and Du, Xiaojiang and Huang, Xiaoming and Guizani, Mohsen and others}, - journal={IEEE Internet of Things Journal}, - volume={8}, - number={12}, - pages={9905--9919}, - year={2020}, - publisher={IEEE} + title = {Malware on {Internet} of {UAVs} {Detection} {Combining} {String} {Matching} and {Fourier} {Transformation}}, + volume = {8}, + issn = {2327-4662}, + url = {https://ieeexplore.ieee.org/document/9220767/?arnumber=9220767}, + doi = {10.1109/JIOT.2020.3029970}, + abstract = {Advanced persistent threat (APT), with intense penetration, long duration, and high customization, has become one of the most grievous threats to cybersecurity. Furthermore, the design and development of Internet-of-Things (IoT) devices often do not focus on security, leading APT to extend to IoT, such as the Internet of emerging unmanned aerial vehicles (UAVs). Whether malware with attack payload can be successfully implanted into UAVs or not is the key to APT on the Internet of UAVs. APT malware on UAVs establishes communication with the command and control (C\&C) server to achieve remote control for UAVs-aware information stealing. Existing effective methods detect malware by analyzing malicious behaviors generated during C\&C communication. However, APT malware usually adopts a low-traffic attack mode, a large amount of normal traffic is mixed in each attack step, to avoid virus checking and killing. Therefore, it is difficult for traditional malware detection methods to discover APT malware on UAVs that carry weak abnormal signals. Fortunately, we found that most APT attacks use domain name system (DNS) to locate C\&C server of malware for information transmission periodically. This behavior will leave some records in the network flow and DNS logs, which provides us with an opportunity to identify infected internal UAVs and external malicious domain names. This article proposes an APT malware on the Internet of UAVs detection method combining string matching and Fourier transformation based on DNS traffic, which is able to handle encrypted and obfuscated traffic due to packet payloads independence. We preprocessed the collected network traffic by converting DNS timestamps of DNS request to strings and used the trained random forest model to discover APT malware domain names based on features extracted through string-matching-based periodicity detection and Fourier transformation-based periodicity detection. The proposed method has been evaluated on the data set, including part of normal domains from the normal traffic and malicious domains marked by security experts from APT malware traffic. Experimental results have shown that our proposed detection method can achieve the accuracy of 94\%, which is better than the periodicity detection algorithm alone. Moreover, the proposed method does not need to set the confidence to filter the periodicity with high confidence.}, + number = {12}, + urldate = {2025-03-06}, + journal = {IEEE Internet of Things Journal}, + author = {Niu, Weina and Xiao, Jian’An and Zhang, Xiyue and Zhang, Xiaosong and Du, Xiaojiang and Huang, Xiaoming and Guizani, Mohsen}, + month = jun, + year = {2021}, + note = {Conference Name: IEEE Internet of Things Journal}, + keywords = {Malware, Servers, Advanced persistent threat (APT), Autonomous aerial vehicles, Blocklists, Fourier transformation, Internet of Things, malware, periodicity detection, string matching, unmanned aerial vehicles (UAVs)}, + pages = {9905--9919}, + file = {Full Text PDF:files/112/Niu et al. - 2021 - Malware on Internet of UAVs Detection Combining String Matching and Fourier Transformation.pdf:application/pdf;IEEE Xplore Abstract Record:files/111/9220767.html:text/html}, } + + @inproceedings{duan2018automated, - title={Automated generation and selection of interpretable features for enterprise security}, - author={Duan, Jiayi and Zeng, Ziheng and Oprea, Alina and Vasudevan, Shobha}, - booktitle={2018 IEEE International Conference on Big Data (Big Data)}, - pages={1258--1265}, - year={2018}, - organization={IEEE} + title = {Automated {Generation} and {Selection} of {Interpretable} {Features} for {Enterprise} {Security}}, + url = {https://ieeexplore.ieee.org/document/8621986/?arnumber=8621986}, + doi = {10.1109/BigData.2018.8621986}, + abstract = {We present an effective machine learning method for malicious activity detection in enterprise security logs. Our method involves feature engineering, or generating new features by applying operators on features of the raw data. We generate DNF formulas from raw features, extract Boolean functions from them, and leverage Fourier analysis to generate new parity features and rank them based on their highest Fourier coefficients. We demonstrate on real enterprise data sets that the engineered features enhance the performance of a wide range of classifiers and clustering algorithms. As compared to classification of raw data features, the engineered features achieve up to 50.6\% improvement in malicious recall, while sacrificing no more than 0.47\% in accuracy. We also observe better isolation of malicious clusters, when performing clustering on engineered features. In general, a small number of engineered features achieve higher performance than raw data features according to our metrics of interest. Our feature engineering method also retains interpretability, an important consideration in cyber security applications.}, + urldate = {2025-03-06}, + booktitle = {2018 {IEEE} {International} {Conference} on {Big} {Data} ({Big} {Data})}, + author = {Duan, Jiayi and Zeng, Ziheng and Oprea, Alina and Vasudevan, Shobha}, + month = dec, + year = {2018}, + keywords = {Malware, Security, Fourier transforms, Feature extraction, Boolean functions, Clustering algorithms, Training data}, + pages = {1258--1265}, + file = {Full Text PDF:files/115/Duan et al. - 2018 - Automated Generation and Selection of Interpretable Features for Enterprise Security.pdf:application/pdf;IEEE Xplore Abstract Record:files/114/8621986.html:text/html}, } + + @inproceedings{haffey2018modeling, - title={Modeling, analysis, and characterization of periodic traffic on a campus edge network}, - author={Haffey, Mackenzie and Arlitt, Martin and Williamson, Carey}, - booktitle={2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)}, - pages={170--182}, - year={2018}, - organization={IEEE} + title = {Modeling, {Analysis}, and {Characterization} of {Periodic} {Traffic} on a {Campus} {Edge} {Network}}, + url = {https://ieeexplore.ieee.org/document/8526883/?arnumber=8526883}, + doi = {10.1109/MASCOTS.2018.00025}, + abstract = {Traffic in today's edge networks is diverse, exhibiting many different patterns. This paper focuses on periodic network traffic, which is often used by known network services (e.g., Network Time Protocol, Akamai CDN) as well as by malicious applications (e.g., botnets, vulnerability scanning). We use a simple and flexible SQL-based approach as our computational model for detecting periodic traffic, and apply it to the analysis of seven weeks of Bro connection logs from a campus edge network. Our results show that periodic traffic analysis is effective for detecting P2P, gaming, cloud, scanning, and botnet traffic flows, which often exhibit periodic network communications. We present a classication taxonomy for periodic traffic, and provide an in-depth characterization of this traffic on our campus edge network.}, + urldate = {2025-03-06}, + booktitle = {2018 {IEEE} 26th {International} {Symposium} on {Modeling}, {Analysis}, and {Simulation} of {Computer} and {Telecommunication} {Systems} ({MASCOTS})}, + author = {Haffey, Mackenzie and Arlitt, Martin and Williamson, Carey}, + month = sep, + year = {2018}, + note = {ISSN: 2375-0227}, + keywords = {Time series analysis, Analytical models, Computational modeling, Correlation, Image edge detection, Network traffic measurement, Periodic traffic, Reactive power, Spectral analysis, workload characterization}, + pages = {170--182}, + file = {Full Text PDF:files/118/Haffey et al. - 2018 - Modeling, Analysis, and Characterization of Periodic Traffic on a Campus Edge Network.pdf:application/pdf;IEEE Xplore Abstract Record:files/117/8526883.html:text/html}, } + + @inproceedings{oprea2018made, - title={Made: Security analytics for enterprise threat detection}, - author={Oprea, Alina and Li, Zhou and Norris, Robin and Bowers, Kevin}, - booktitle={Proceedings of the 34th Annual Computer Security Applications Conference}, - pages={124--136}, - year={2018} + address = {San Juan PR USA}, + title = {{MADE}: {Security} {Analytics} for {Enterprise} {Threat} {Detection}}, + isbn = {978-1-4503-6569-7}, + shorttitle = {{MADE}}, + url = {https://dl.acm.org/doi/10.1145/3274694.3274710}, + doi = {10.1145/3274694.3274710}, + abstract = {Enterprises are targeted by various malware activities at a staggering rate. To counteract the increased sophistication of cyber attacks, most enterprises deploy within their perimeter a number of security technologies, including firewalls, anti-virus software, and web proxies, as well as specialized teams of security analysts forming Security Operations Centers (SOCs).}, + language = {en}, + urldate = {2025-03-06}, + booktitle = {Proceedings of the 34th {Annual} {Computer} {Security} {Applications} {Conference}}, + publisher = {ACM}, + author = {Oprea, Alina and Li, Zhou and Norris, Robin and Bowers, Kevin}, + month = dec, + year = {2018}, + pages = {124--136}, + file = {PDF:files/119/Oprea et al. - 2018 - MADE Security Analytics for Enterprise Threat Detection.pdf:application/pdf}, } + + @inproceedings{ukrop2019will, - title={Will you trust this tls certificate? perceptions of people working in it}, - author={Ukrop, Martin and Kraus, Lydia and Matyas, Vashek and Wahsheh, Heider Ahmad Mutleq}, - booktitle={Proceedings of the 35th annual computer security applications conference}, - pages={718--731}, - year={2019} + address = {San Juan Puerto Rico USA}, + title = {Will you trust this {TLS} certificate?: perceptions of people working in {IT}}, + isbn = {978-1-4503-7628-0}, + shorttitle = {Will you trust this {TLS} certificate?}, + url = {https://dl.acm.org/doi/10.1145/3359789.3359800}, + doi = {10.1145/3359789.3359800}, + abstract = {Flawed TLS certificates are not uncommon on the Internet. While they signal a potential issue, in most cases they have benign causes (e.g., misconfiguration or even deliberate deployment). This adds fuzziness to the decision on whether to trust a connection or not. Little is known about perceptions of flawed certificates by IT professionals, even though their decisions impact high numbers of end users. Moreover, it is unclear how much does the content of error messages and documentation influence these perceptions.}, + language = {en}, + urldate = {2025-03-06}, + booktitle = {Proceedings of the 35th {Annual} {Computer} {Security} {Applications} {Conference}}, + publisher = {ACM}, + author = {Ukrop, Martin and Kraus, Lydia and Matyas, Vashek and Wahsheh, Heider Ahmad Mutleq}, + month = dec, + year = {2019}, + pages = {718--731}, + file = {PDF:files/123/Ukrop et al. - 2019 - Will you trust this TLS certificate perceptions of people working in IT.pdf:application/pdf}, + } + + @inproceedings{vissers2017exploring, - title={Exploring the ecosystem of malicious domain registrations in the. eu tld}, - author={Vissers, Thomas and Spooren, Jan and Agten, Pieter and Jumpertz, Dirk and Janssen, Peter and Van Wesemael, Marc and Piessens, Frank and Joosen, Wouter and Desmet, Lieven}, - booktitle={Research in Attacks, Intrusions, and Defenses: 20th International Symposium, RAID 2017, Atlanta, GA, USA, September 18--20, 2017, Proceedings}, - pages={472--493}, - year={2017}, - organization={Springer} + address = {Cham}, + title = {Exploring the {Ecosystem} of {Malicious} {Domain} {Registrations} in the .eu {TLD}}, + isbn = {978-3-319-66332-6}, + doi = {10.1007/978-3-319-66332-6_21}, + abstract = {This study extensively scrutinizes 14 months of registration data to identify large-scale malicious campaigns present in the .eu TLD. We explore the ecosystem and modus operandi of elaborate cybercriminal entities that recurrently register large amounts of domains for one-shot, malicious use. Although these malicious domains are short-lived, by incorporating registrant information, we establish that at least 80.04\% of them can be framed in to 20 larger campaigns with varying duration and intensity. We further report on insights in the operational aspects of this business and observe, amongst other findings, that their processes are only partially automated. Finally, we apply a post-factum clustering process to validate the campaign identification process and to automate the ecosystem analysis of malicious registrations in a TLD zone.}, + language = {en}, + booktitle = {Research in {Attacks}, {Intrusions}, and {Defenses}}, + publisher = {Springer International Publishing}, + author = {Vissers, Thomas and Spooren, Jan and Agten, Pieter and Jumpertz, Dirk and Janssen, Peter and Van Wesemael, Marc and Piessens, Frank and Joosen, Wouter and Desmet, Lieven}, + editor = {Dacier, Marc and Bailey, Michael and Polychronakis, Michalis and Antonakakis, Manos}, + year = {2017}, + pages = {472--493}, + file = {PDF:files/126/Vissers et al. - 2017 - Exploring the Ecosystem of Malicious Domain Registrations in the .eu TLD.pdf:application/pdf}, }