Selected Completed Doctoral Thesis
Duc C. Le
A machine learning based framework for user-centered insider threat
detection
: Aug 2021 (co-supervised with Prof. Nur Zincir-Heywood)
Abstract: Insider threat represents a major cyber-security challenge to companies, organizations, and government agencies. Harmful
actions in insider threats are performed by authorized users in organizations. Due to the fact that an insider is
authorized to access the organization’s computer systems and has knowledge
about the organization’s security procedures, detecting insider threats
is challenging. Many other challenges exist in this detection problem, including unbalanced data,
limited ground truth, and possible user behaviour changes. This research proposes a
comprehensive machine learning-based framework for insider threat detection, from data
pre-processing, a combination of supervised and unsupervised learning, to
deep analysis and meaningful result reporting.
For the data pre-processing step, the framework introduces a data
extraction approach
allowing extraction of numerical feature vectors representing user
activities
from heterogeneous data, with different data granularity levels and
temporal data
representations, and enabling applications of machine learning. In the
initial detection
step of the framework, assume no available ground truth, unsupervised
learning
methods with different working principles and unsupervised ensembles are
explored for anomaly detection to identify anomalous user behaviours that
may indicate insider threats. Furthermore, the framework employs supervised
and semi-supervised
machine learning under limited ground truth availability and real-world
conditions to maximize the effectiveness of limited training data and
detect insider threats with high precision. Throughout the thesis, realistic evaluation
and comprehensive result reporting are performed to facilitate understanding
of the framework’s performance under real-world conditions.
Evaluation results on publicly available datasets show the effectiveness of the proposed approach. High insider threat detection rates are achieved
at very low false positive rates. The robustness of the detection models is also demonstrated and comparisons with the state-of-the-art confirm the
advantages of the approach.
Alexander Loginov
On increasing the scope of Genetic Programming trading agents
: June 2020
Abstract: This research investigates the potential for widening the scope of Genetic Programming (GP) trading agents beyond constructing decision trees for buy-hold-sell decisions. First, both technical indicators (temporal feature construction) and decision trees (action selection) are co-evolved under the machine learning paradigm of GP with the benefit of setting Stop-Loss and Take-Profit orders using retracement levels demonstrated. GP trading agents are then used to design trading portfolios under a frequent intraday trading scenario. Such a scenario implies that transaction costs have a more significant impact on profitability and investment decisions can be revised frequently. Furthermore, existing long term portfolio selection algorithms cannot guarantee optimal asset selection for intraday trading, thus motivating a different approach to asset selection. The proposed algorithm identifies a subset of assets to trade in the next day and generates buy-hold-sell decisions for each selected asset in real-time. A benchmarking comparison of ranking heuristics is conducted with the popular Kelly Criterion, and a strong preference for the proposed Moving Sharpe ratio demonstrated. Moreover, the evolved portfolios perform significantly better than any of the comparator methods (buy-and-hold strategy, investment in the full set of 86 stocks, portfolios built from random stock selection and Kelly Criterion). Transaction costs (explicit and implicit or hidden) are important, yet often overlooked, attributes of any trading system. The impact of hidden costs (bid-ask spread) is investigated. The nature of bid-ask spreads (fixed or floating) is demonstrated to be important for the effectiveness of the automated trading system and a floating spread is shown to have a more significant impact than a fixed spread. Finally, the proposed GP framework was assessed on non-financial streaming data. This is significant because it provides the basis for comparing the proposed GP framework to alternative machine learning methods specifically designed to operate under a prequential model of evaluation. The GP framework is shown to provide classification performance competitive with currently established methods for streaming classification, and thus its general effectiveness.
Sara Khanchi
Stream genetic programming for botnet detection
: Nov 2019 (co-supervised with Prof. Nur Zincir-Heywood)
Abstract: Algorithms for constructing classification models in
streaming data scenarios are attracting
more attention in the era of artificial intelligence and machine learning
for
data analysis. The huge volumes of streaming data necessitate a learning
framework
with timely and accurate processing. For a streaming classifier to be
deployed in the
real world, multiple challenges exist such as 1) Concept drift, 2)
Imbalanced data;
and 3) Costly labelling processes. These challenges become more crucial
when they
occur in sensitive fields of operation such as network security. The
objective of this
thesis is to provide a team-based genetic programming (GP) framework to
explore
and address these challenges with regard to network-based services. The GP
classifier
incrementally introduces changes to the model throughout the course of the
stream to
adapt to the content of the stream. The framework is based on an active
learning approach
where the learning process happens in interaction with a data subset to
build
a model. Thus, the design of the system is founded on the introduction of
sampling
and archiving policies to decouple the stream distribution from the
training data subset.
These policies work with no prior information on the distribution of
classes and
true labels. Benchmarking is conducted with real-world network security
datasets
with label budgets in the order of 5 to 0.5 percent and significant class
imbalance.
Evaluations for the detection of minor classes have been performed that
represent the
classifier behaviour in case of attacks. Comparisons to the current
streaming algorithms
and specifically network state-of-the-art frameworks for streaming
processing
under label budgets demonstrate the effectiveness of the proposed GP
framework to
address the challenges related to streaming data. Furthermore, the
applicability of
the proposed framework in network and security analytics is demonstrated.
Stephen Kelly
Scaling Genetic Programming to Challenging Reinforcement Tasks through Emergent Modularity
: June 2018
Abstract: Algorithms that learn through environmental interaction and delayed rewards,
or reinforcement learning, increasingly face the challenge of scaling to dynamic,
high-dimensional environments. Video games model these types of real-world decision making and
control scenarios while being simple enough to implement within experiments. This work
demonstrates how emergent modularity and open-ended evolution allow genetic programming (GP)
to discover strategies for difficult gaming scenarios while maintaining relatively low model
complexity. Two related learning algorithms are considered: Policy Trees and Tangled Program
Graphs (TPG).
In the case of Policy Trees, a methodology for transfer learning is proposed which
specifically leverages both structural and behavioural modularity in the learner
representation. The utility of the approach is empirically evaluated in two challenging task
domains: RoboCup Soccer and Ms. Pac-Man. In RoboCup, decision-making policies are first
evolved for simple subtasks and then reused within a policy hierarchy in order to learn the
more complex task of Half-Field Offense. The same methodology is applied to Ms. Pac-Man, in
which case the use of task-agnostic diversity maintenance enables the automatic discovery of
suitable sub-policies, removing the need for a prior human-specified task decomposition. In
both task domains, the final GP decision-making policies reach state-of-the-art levels of play
while being significantly less complex than solutions from temporal difference methods and
neuroevolution.
TPG takes a more open-ended approach to modularity, emphasizing the ability to adaptively
complexify policies through interaction with the task environment. The
challenging Atari video game environment is used to show that this approach builds
decision-making policies that broadly match the quality of several deep learning methods while
being several orders of magnitude less computationally demanding, both in terms of sample
efficiency and model complexity. Finally, the approach is capable of evolving solutions to
multiple game titles simultaneously with no additional computational cost. In this case, agent
behaviours for an individual game as well as single agents capable of playing up to 5 games
emerge from the same evolutionary run.
Ali Vahdat
Symbiotic evolutionary subspace clustering
: Nov 2012
Abstract: Application domains with large attribute spaces, such as genomics and text analysis,
necessitate clustering algorithms with more sophistication than traditional clustering
algorithms. More sophisticated approaches are required to cope with the large dimensionality
and cardinality of these data sets. Subspace clustering, a generalization
of traditional clustering, identies the attribute support for each cluster as well as the
location and number of clusters. In the most general case, attributes associated with
each cluster could be unique. The proposed algorithm, Symbiotic Evolutionary Subspace
Clustering (S-ESC) borrows from `symbiosis' in the sense that each clustering
solution is dened in terms of a host (a single member of the host population) and a
number of coevolved cluster centroids (or symbionts in an independent symbiont population).
Symbionts dene clusters and therefore attribute subspaces, whereas hosts
dene sets of clusters to constitute a non-degenerate solution. The symbiotic representation
of S-ESC is the key to making it scalable to high-dimensional data sets, while
an integrated subsampling process makes it scalable to tasks with a large number of
data items. A bi-objective evolutionary method is proposed to identify the unique
attribute support of each cluster while detecting its data instances. Benchmarking
is performed against a well-known test suite of subspace clustering data sets with
four well-known comparator algorithms from both the full-dimensional and subspace
clustering literature: EM, MINECLUS, PROCLUS, STATPC and a generic genetic
algorithm-based subspace clustering. Performance of the S-ESC algorithm was found
to be robust across a wide cross-section of properties with a common parameterization
utilized throughout. This was not the case for the comparator algorithms. Speci-
cally, performance could be sensitive to a particular data distribution or parameter
sweeps might be necessary to provide comparable performance. A comparison is
also made relative to a non-symbiotic genetic algorithm. In this case each individual
represents the set of clusters comprising a subspace cluster solution. Benchmarking
indicates that the proposed symbiotic framework can be demonstrated to be superior
once again. The S-ESC code and data sets are publicly available.
Peter Lichodzijewski
A Symbiotic Bid-Based Framework for Problem Decomposition
using Genetic Programming: Feb 2011
Abstract: (para 1 of 3)
This thesis investigates the use of symbiosis as an evolutionary metaphor for problem decomposition using Genetic Programming. It begins by drawing a connection between lateral
problem decomposition, in which peers with similar capabilities coordinate their actions, and
vertical problem decomposition, whereby solution subcomponents are organized into
increasingly complex units of organization. Furthermore, the two types of problem
decomposition are associated respectively with context learning and layered learning. The thesis then proposes the Symbiotic Bid-Based framework modeled after a
three-staged process of symbiosis abstracted from biological evolution. As such, it is
argued, the approach has the capacity for both types of problem decomposition.
Gunes Kayacik
Can the best
defense be a good offense? Evolving (mimicry) attacks for detector
vulnerability testing under a 'black-box' assumption: 2009 (co-supervised with Prof. Nur Zincir-Heywood)
Abstract: (para 1 of 3)
This thesis proposes a 'black-box' approach for automating attack generation by way of Evolutionary Computation. The
proposed 'black-box' approach employs just the anomaly rate or detection feedback from the detector. Assuming a
'black-box' access in vulnerability testing presents a scenario different from a 'white-box' access assumption, since
the attacker does not possess sufficient knowledge to constrain the scope of the attack. As such, this thesis
contributes by providing a 'black-box' vulnerability testing tool for identifying detector weaknesses and aiding
detector research in designing detectors which are robust against evasion attacks.
Andrew R. McIntyre
Novelty Detection + Coevolution = Automatic Problem Decomposition: A framework for scalable
Genetic Programming Classifiers: Nov 2007
Abstract:
A novel approach to the classi cation of large and unbalanced multi-class data sets is
presented where the widely acknowledged issues of scalability, solution transparency,
and problem decomposition are addressed simultaneously within the context of the
Genetic Programming (GP) paradigm. A cooperative coevolutionary training environment
that employs multi-objective evaluation provides the basis for problem
decomposition and reduced solution complexity. Scalability is achieved through a
Pareto competitive coevolutionary framework, allowing the system to be readily applied
to large data sets without recourse to hardware-speci c speedups. A key departure
from the canonical GP approach to classi cation involves expressing the output
of GP in terms of a non-binary, local membership function (Gaussian), where it is
no longer necessary for an expression to represent an entire class. Decomposition
is then achieved through reformulating the classi cation problem as one of cluster
consistency, where individuals learn to associate subsets of training exemplars with
each cluster. Classi cation problems are now solved by several specialist classi ers
as opposed to a single `super' individual. Finally, although multi-objective methods
have been reported previously for GP classi cation domains, we explicitly formulate
the objectives for cooperative behavior. Without this the user is left to choose a
single individual as the overall solution from a front of solutions. This work is able
to utilize the entire front of solutions without recourse to heuristics for selecting one
individual over another or duplicating behaviors between di erent classi ers.
Extensive benchmarking is performed against related frameworks for classi cation
including Genetic Programming, Neural Networks, and deterministic methods.
In contrast to classi ers evolved using competitive coevolution alone, we demonstrate
the ability of the proposed coevolutionary model to provide a non-overlapping decomposition
or association between learners and exemplars, while returning statistically
signi cant improvements in classi er performance. In the case of the Neural Network
methods, benchmarking is conducted against the more challenging second order neural
learning algorithm of conjugate gradient optimization (previous comparisons limit
Neural Network training to rst order methods). The ensuing comparison indicated
that most data sets bene t from the proposed algorithm, which remains competitive
even against the well-established deterministic algorithms, such as C4.5.
Garnett C. Wilson
Probabilistic Adaptive Mapping Developmental Genetic
Programming: March 2007
Abstract:
Developmental Genetic Programming (DGP) algorithms explicitly enable the search
space for a problem to be divided into genotypes and corresponding phenotypes. The
two search spaces are often connected with a genotype-phenotype mapping (GPM)
intended to model the biological genetic code, where current implementations of this
concept involve evolution of the mappings along with evolution of the genotype
solutions. This work presents the Probabilistic Adaptive Mapping DGP (PAM DGP)
algorithm, a new developmental implementation that provides research contributions in
the areas of GPMs and coevolution. The algorithm component of PAM DGP is
demonstrated to overcome coevolutionary performance problems as identified and
empirically benchmarked against the latest competing Adaptive Mapping algorithm with
both algorithms using the same (non-redundant) mapping encoding process. Having
established that PAM DGP provides a superior algorithmic framework given equivalent
mapping and genotype structures for the individuals, a new adaptive redundant mapping
is incorporated into PAM DGP for further performance enhancement and closer
adherence to developmental modeling of the biological code. PAM DGP with two
mapping types is then compared to the competing Adaptive Mapping algorithm and
Traditional GP with respect to three regression benchmarks. PAM DGP using redundant
mappings is then applied to two medical classification domains, where PAM DGP with
redundant encodings is found to provide better classifier performance than the alternative
algorithms. PAM DGP with redundant mappings is also given the task of learning three
sequences of increasing recursion order given a function set consisting of general (not
implicitly recursive) machine-language instructions; where it is found to more efficiently
learn second and third order recursive Fibonacci functions than the related developmental
systems and Traditional GP. PAM DGP using redundant encoding is also demonstrated
to produce the semantically highest quality solutions for all three recursive functions
considered (Factorial, second and third order Fibonacci). PAM DGP is shown for
regression, medical classification, and recursive problems to have produced its solutions
by evolving redundant mappings to emphasize appropriate members within relevant
subsets of the problem's original function set.
Samuel Howse
NUMMSQUARED 2006a0 Explained, Including a new Well-Founded
Functional Foundation for
Logic, Mathematics and Computer Science: October 2006
Abstract:
Set theory is the standard foundation for mathematics, but often does not include rules of reduction
for function calls. Therefore, for computer science, the untyped lambda calculus or type theory
is usually preferred. The untyped lambda calculus (and several improvements on it) make functions
fundamental, but suffer from non-terminating reductions and have partially non-classical logics.
Type theory is a good foundation for logic, mathematics and computer science, except that, by making
both types and functions fundamental, it is more complex than either set theory or the untyped
lambda calculus. This document proposes a new foundational formal language called NummSquared that
makes only functions fundamental, while simultameously ensuring that reduction terminates, having a
classical logic, and attempting to follow set theory as much as possible. NummSquared builds on
earlier works by John von Neumann in 1925 and Roger Bishop Jones in 1998 that have perhaps not
received sufficient attention in computer science. A soundness theorem for NummSquared is proved.
Usual set theory, the work of Jones, and NummSquared are all well-founded. NummSquared improves upon
the works of von Neumann and Jones by having reduction and proof, by supporting computation and
reflection, and by having an interpreter called NsGo (work in progress) so the language can be
practically used. NummSquared is variable-free. For enhanced reliability, NsGo is an F#/C# .NET
assembly that is mostly automatically extracted from a program of the Coq proof assistant. As a
possible step toward making formal methods appealing to a wider audience, NummSquared minimizes
constraints on the logician, mathematician or programmer. Because of coercion, there are no types,
and functions are defined and called without proof, yet reduction terminates. NummSquared supports
proofs as desired, but not required.