Selected Completed Doctoral Thesis

Duc C. Le

A machine learning based framework for user-centered insider threat detection : Aug 2021 (co-supervised with Prof. Nur Zincir-Heywood)
Abstract: Insider threat represents a major cyber-security challenge to companies, organizations, and government agencies. Harmful actions in insider threats are performed by authorized users in organizations. Due to the fact that an insider is authorized to access the organization’s computer systems and has knowledge about the organization’s security procedures, detecting insider threats is challenging. Many other challenges exist in this detection problem, including unbalanced data, limited ground truth, and possible user behaviour changes. This research proposes a comprehensive machine learning-based framework for insider threat detection, from data pre-processing, a combination of supervised and unsupervised learning, to deep analysis and meaningful result reporting. For the data pre-processing step, the framework introduces a data extraction approach allowing extraction of numerical feature vectors representing user activities from heterogeneous data, with different data granularity levels and temporal data representations, and enabling applications of machine learning. In the initial detection step of the framework, assume no available ground truth, unsupervised learning methods with different working principles and unsupervised ensembles are explored for anomaly detection to identify anomalous user behaviours that may indicate insider threats. Furthermore, the framework employs supervised and semi-supervised machine learning under limited ground truth availability and real-world conditions to maximize the effectiveness of limited training data and detect insider threats with high precision. Throughout the thesis, realistic evaluation and comprehensive result reporting are performed to facilitate understanding of the framework’s performance under real-world conditions. Evaluation results on publicly available datasets show the effectiveness of the proposed approach. High insider threat detection rates are achieved at very low false positive rates. The robustness of the detection models is also demonstrated and comparisons with the state-of-the-art confirm the advantages of the approach.

Alexander Loginov

On increasing the scope of Genetic Programming trading agents : June 2020
Abstract: This research investigates the potential for widening the scope of Genetic Programming (GP) trading agents beyond constructing decision trees for buy-hold-sell decisions. First, both technical indicators (temporal feature construction) and decision trees (action selection) are co-evolved under the machine learning paradigm of GP with the benefit of setting Stop-Loss and Take-Profit orders using retracement levels demonstrated. GP trading agents are then used to design trading portfolios under a frequent intraday trading scenario. Such a scenario implies that transaction costs have a more significant impact on profitability and investment decisions can be revised frequently. Furthermore, existing long term portfolio selection algorithms cannot guarantee optimal asset selection for intraday trading, thus motivating a different approach to asset selection. The proposed algorithm identifies a subset of assets to trade in the next day and generates buy-hold-sell decisions for each selected asset in real-time. A benchmarking comparison of ranking heuristics is conducted with the popular Kelly Criterion, and a strong preference for the proposed Moving Sharpe ratio demonstrated. Moreover, the evolved portfolios perform significantly better than any of the comparator methods (buy-and-hold strategy, investment in the full set of 86 stocks, portfolios built from random stock selection and Kelly Criterion). Transaction costs (explicit and implicit or hidden) are important, yet often overlooked, attributes of any trading system. The impact of hidden costs (bid-ask spread) is investigated. The nature of bid-ask spreads (fixed or floating) is demonstrated to be important for the effectiveness of the automated trading system and a floating spread is shown to have a more significant impact than a fixed spread. Finally, the proposed GP framework was assessed on non-financial streaming data. This is significant because it provides the basis for comparing the proposed GP framework to alternative machine learning methods specifically designed to operate under a prequential model of evaluation. The GP framework is shown to provide classification performance competitive with currently established methods for streaming classification, and thus its general effectiveness.

Sara Khanchi

Stream genetic programming for botnet detection : Nov 2019 (co-supervised with Prof. Nur Zincir-Heywood)
Abstract: Algorithms for constructing classification models in streaming data scenarios are attracting more attention in the era of artificial intelligence and machine learning for data analysis. The huge volumes of streaming data necessitate a learning framework with timely and accurate processing. For a streaming classifier to be deployed in the real world, multiple challenges exist such as 1) Concept drift, 2) Imbalanced data; and 3) Costly labelling processes. These challenges become more crucial when they occur in sensitive fields of operation such as network security. The objective of this thesis is to provide a team-based genetic programming (GP) framework to explore and address these challenges with regard to network-based services. The GP classifier incrementally introduces changes to the model throughout the course of the stream to adapt to the content of the stream. The framework is based on an active learning approach where the learning process happens in interaction with a data subset to build a model. Thus, the design of the system is founded on the introduction of sampling and archiving policies to decouple the stream distribution from the training data subset. These policies work with no prior information on the distribution of classes and true labels. Benchmarking is conducted with real-world network security datasets with label budgets in the order of 5 to 0.5 percent and significant class imbalance. Evaluations for the detection of minor classes have been performed that represent the classifier behaviour in case of attacks. Comparisons to the current streaming algorithms and specifically network state-of-the-art frameworks for streaming processing under label budgets demonstrate the effectiveness of the proposed GP framework to address the challenges related to streaming data. Furthermore, the applicability of the proposed framework in network and security analytics is demonstrated.

Stephen Kelly

Scaling Genetic Programming to Challenging Reinforcement Tasks through Emergent Modularity : June 2018
Abstract: Algorithms that learn through environmental interaction and delayed rewards, or reinforcement learning, increasingly face the challenge of scaling to dynamic, high-dimensional environments. Video games model these types of real-world decision making and control scenarios while being simple enough to implement within experiments. This work demonstrates how emergent modularity and open-ended evolution allow genetic programming (GP) to discover strategies for difficult gaming scenarios while maintaining relatively low model complexity. Two related learning algorithms are considered: Policy Trees and Tangled Program Graphs (TPG). In the case of Policy Trees, a methodology for transfer learning is proposed which specifically leverages both structural and behavioural modularity in the learner representation. The utility of the approach is empirically evaluated in two challenging task domains: RoboCup Soccer and Ms. Pac-Man. In RoboCup, decision-making policies are first evolved for simple subtasks and then reused within a policy hierarchy in order to learn the more complex task of Half-Field Offense. The same methodology is applied to Ms. Pac-Man, in which case the use of task-agnostic diversity maintenance enables the automatic discovery of suitable sub-policies, removing the need for a prior human-specified task decomposition. In both task domains, the final GP decision-making policies reach state-of-the-art levels of play while being significantly less complex than solutions from temporal difference methods and neuroevolution. TPG takes a more open-ended approach to modularity, emphasizing the ability to adaptively complexify policies through interaction with the task environment. The challenging Atari video game environment is used to show that this approach builds decision-making policies that broadly match the quality of several deep learning methods while being several orders of magnitude less computationally demanding, both in terms of sample efficiency and model complexity. Finally, the approach is capable of evolving solutions to multiple game titles simultaneously with no additional computational cost. In this case, agent behaviours for an individual game as well as single agents capable of playing up to 5 games emerge from the same evolutionary run.

Ali Vahdat

Symbiotic evolutionary subspace clustering : Nov 2012
Abstract: Application domains with large attribute spaces, such as genomics and text analysis, necessitate clustering algorithms with more sophistication than traditional clustering algorithms. More sophisticated approaches are required to cope with the large dimensionality and cardinality of these data sets. Subspace clustering, a generalization of traditional clustering, identies the attribute support for each cluster as well as the location and number of clusters. In the most general case, attributes associated with each cluster could be unique. The proposed algorithm, Symbiotic Evolutionary Subspace Clustering (S-ESC) borrows from `symbiosis' in the sense that each clustering solution is dened in terms of a host (a single member of the host population) and a number of coevolved cluster centroids (or symbionts in an independent symbiont population). Symbionts dene clusters and therefore attribute subspaces, whereas hosts dene sets of clusters to constitute a non-degenerate solution. The symbiotic representation of S-ESC is the key to making it scalable to high-dimensional data sets, while an integrated subsampling process makes it scalable to tasks with a large number of data items. A bi-objective evolutionary method is proposed to identify the unique attribute support of each cluster while detecting its data instances. Benchmarking is performed against a well-known test suite of subspace clustering data sets with four well-known comparator algorithms from both the full-dimensional and subspace clustering literature: EM, MINECLUS, PROCLUS, STATPC and a generic genetic algorithm-based subspace clustering. Performance of the S-ESC algorithm was found to be robust across a wide cross-section of properties with a common parameterization utilized throughout. This was not the case for the comparator algorithms. Speci- cally, performance could be sensitive to a particular data distribution or parameter sweeps might be necessary to provide comparable performance. A comparison is also made relative to a non-symbiotic genetic algorithm. In this case each individual represents the set of clusters comprising a subspace cluster solution. Benchmarking indicates that the proposed symbiotic framework can be demonstrated to be superior once again. The S-ESC code and data sets are publicly available.

Peter Lichodzijewski

A Symbiotic Bid-Based Framework for Problem Decomposition using Genetic Programming: Feb 2011
Abstract: (para 1 of 3) This thesis investigates the use of symbiosis as an evolutionary metaphor for problem decomposition using Genetic Programming. It begins by drawing a connection between lateral problem decomposition, in which peers with similar capabilities coordinate their actions, and vertical problem decomposition, whereby solution subcomponents are organized into increasingly complex units of organization. Furthermore, the two types of problem decomposition are associated respectively with context learning and layered learning. The thesis then proposes the Symbiotic Bid-Based framework modeled after a three-staged process of symbiosis abstracted from biological evolution. As such, it is argued, the approach has the capacity for both types of problem decomposition.

Gunes Kayacik

Can the best defense be a good offense? Evolving (mimicry) attacks for detector vulnerability testing under a 'black-box' assumption: 2009 (co-supervised with Prof. Nur Zincir-Heywood)
Abstract: (para 1 of 3) This thesis proposes a 'black-box' approach for automating attack generation by way of Evolutionary Computation. The proposed 'black-box' approach employs just the anomaly rate or detection feedback from the detector. Assuming a 'black-box' access in vulnerability testing presents a scenario different from a 'white-box' access assumption, since the attacker does not possess sufficient knowledge to constrain the scope of the attack. As such, this thesis contributes by providing a 'black-box' vulnerability testing tool for identifying detector weaknesses and aiding detector research in designing detectors which are robust against evasion attacks.

Andrew R. McIntyre

Novelty Detection + Coevolution = Automatic Problem Decomposition: A framework for scalable Genetic Programming Classifiers: Nov 2007
Abstract: A novel approach to the classi cation of large and unbalanced multi-class data sets is presented where the widely acknowledged issues of scalability, solution transparency, and problem decomposition are addressed simultaneously within the context of the Genetic Programming (GP) paradigm. A cooperative coevolutionary training environment that employs multi-objective evaluation provides the basis for problem decomposition and reduced solution complexity. Scalability is achieved through a Pareto competitive coevolutionary framework, allowing the system to be readily applied to large data sets without recourse to hardware-speci c speedups. A key departure from the canonical GP approach to classi cation involves expressing the output of GP in terms of a non-binary, local membership function (Gaussian), where it is no longer necessary for an expression to represent an entire class. Decomposition is then achieved through reformulating the classi cation problem as one of cluster consistency, where individuals learn to associate subsets of training exemplars with each cluster. Classi cation problems are now solved by several specialist classi ers as opposed to a single `super' individual. Finally, although multi-objective methods have been reported previously for GP classi cation domains, we explicitly formulate the objectives for cooperative behavior. Without this the user is left to choose a single individual as the overall solution from a front of solutions. This work is able to utilize the entire front of solutions without recourse to heuristics for selecting one individual over another or duplicating behaviors between di erent classi ers. Extensive benchmarking is performed against related frameworks for classi cation including Genetic Programming, Neural Networks, and deterministic methods. In contrast to classi ers evolved using competitive coevolution alone, we demonstrate the ability of the proposed coevolutionary model to provide a non-overlapping decomposition or association between learners and exemplars, while returning statistically signi cant improvements in classi er performance. In the case of the Neural Network methods, benchmarking is conducted against the more challenging second order neural learning algorithm of conjugate gradient optimization (previous comparisons limit Neural Network training to rst order methods). The ensuing comparison indicated that most data sets bene t from the proposed algorithm, which remains competitive even against the well-established deterministic algorithms, such as C4.5.

Garnett C. Wilson

Probabilistic Adaptive Mapping Developmental Genetic Programming: March 2007
Abstract: Developmental Genetic Programming (DGP) algorithms explicitly enable the search space for a problem to be divided into genotypes and corresponding phenotypes. The two search spaces are often connected with a genotype-phenotype mapping (GPM) intended to model the biological genetic code, where current implementations of this concept involve evolution of the mappings along with evolution of the genotype solutions. This work presents the Probabilistic Adaptive Mapping DGP (PAM DGP) algorithm, a new developmental implementation that provides research contributions in the areas of GPMs and coevolution. The algorithm component of PAM DGP is demonstrated to overcome coevolutionary performance problems as identified and empirically benchmarked against the latest competing Adaptive Mapping algorithm with both algorithms using the same (non-redundant) mapping encoding process. Having established that PAM DGP provides a superior algorithmic framework given equivalent mapping and genotype structures for the individuals, a new adaptive redundant mapping is incorporated into PAM DGP for further performance enhancement and closer adherence to developmental modeling of the biological code. PAM DGP with two mapping types is then compared to the competing Adaptive Mapping algorithm and Traditional GP with respect to three regression benchmarks. PAM DGP using redundant mappings is then applied to two medical classification domains, where PAM DGP with redundant encodings is found to provide better classifier performance than the alternative algorithms. PAM DGP with redundant mappings is also given the task of learning three sequences of increasing recursion order given a function set consisting of general (not implicitly recursive) machine-language instructions; where it is found to more efficiently learn second and third order recursive Fibonacci functions than the related developmental systems and Traditional GP. PAM DGP using redundant encoding is also demonstrated to produce the semantically highest quality solutions for all three recursive functions considered (Factorial, second and third order Fibonacci). PAM DGP is shown for regression, medical classification, and recursive problems to have produced its solutions by evolving redundant mappings to emphasize appropriate members within relevant subsets of the problem's original function set.

Samuel Howse

NUMMSQUARED 2006a0 Explained, Including a new Well-Founded Functional Foundation for Logic, Mathematics and Computer Science: October 2006
Abstract: Set theory is the standard foundation for mathematics, but often does not include rules of reduction for function calls. Therefore, for computer science, the untyped lambda calculus or type theory is usually preferred. The untyped lambda calculus (and several improvements on it) make functions fundamental, but suffer from non-terminating reductions and have partially non-classical logics. Type theory is a good foundation for logic, mathematics and computer science, except that, by making both types and functions fundamental, it is more complex than either set theory or the untyped lambda calculus. This document proposes a new foundational formal language called NummSquared that makes only functions fundamental, while simultameously ensuring that reduction terminates, having a classical logic, and attempting to follow set theory as much as possible. NummSquared builds on earlier works by John von Neumann in 1925 and Roger Bishop Jones in 1998 that have perhaps not received sufficient attention in computer science. A soundness theorem for NummSquared is proved. Usual set theory, the work of Jones, and NummSquared are all well-founded. NummSquared improves upon the works of von Neumann and Jones by having reduction and proof, by supporting computation and reflection, and by having an interpreter called NsGo (work in progress) so the language can be practically used. NummSquared is variable-free. For enhanced reliability, NsGo is an F#/C# .NET assembly that is mostly automatically extracted from a program of the Coq proof assistant. As a possible step toward making formal methods appealing to a wider audience, NummSquared minimizes constraints on the logician, mathematician or programmer. Because of coercion, there are no types, and functions are defined and called without proof, yet reduction terminates. NummSquared supports proofs as desired, but not required.