Computer algebra, also known as symbolic computation, faces many challenges toward high- performance and practical applications. While some computer algebra systems have made some success, the widespread use of symbolic computing in scientific and engineering applications remains limited. This may be attributed in part to the inherent difficulties of exact computation including expression swell, implementation challenges, and algorithmic complexity. Yet, with the increase in computational power provided by modern computer architectures, efficient implementations are still possible with the right tools and techniques.
To this end, this session is dedicated to exploring high-performance computing applied to computer algebra and its applications, practical and efficient implementations, and research on the challenges inherent to these topics. The scope of this session includes: code optimization techniques; parallel and distributed computing; hardware accelerators (e.g. GPUs); efficient memory management; cache complexity and cache-oblivious algorithms; practical and optimized implementation techniques; and techniques for adaptive or automated performance including auto-tuning and hardware-aware algorithms.
More TBA...
Computing a Groebner basis can be the first step in solving a system of polynomial equations exactly. In this talk, we describe a high performance parallel implementation of the F4 algorithm of Faugere for 63-bit primes. Our implementation in C is available at axcas.net and is released into the public domain, with the goal of seeing it incorporated into computer algebra systems, both commercial and open source, as part of a modular method. Our code implements the matrix compression scheme used by Faugere, which does not duplicate coefficients or monomials and encodes the difference between successive column numbers efficiently using bytes where possible. It uses the parallel probabilistic Gaussian elimination developed by Monagan, Pearce, and Steel, which selects the sparsest pivots, then divides the remaining N rows into blocks of size sqrt(N) and reduces random linear combinations of the rows in each block, stopping when zero reductions start to occur. We have developed a new approach to parallel symbolic preprocessing which proceeds in stages to reduce contention, using a parallel version of Davenport's power of two hash table to create monomials. And we have added a dynamic pairs limit which captures early degree drops and controls the memory use on large problems. We present benchmarks of the program running on various machines, including a 16 core/32 thread 5.4GHz Ryzen 9955HX.
A standard way to deal with expression swell in computer algebra is to use multi-modular or evaluation-interpolation methods. When such computations involve Gröbner bases, these methods typically compute many Gröbner bases of specializations of the same ideal. This can be done efficiently with some precomputation, for instance, using Traverso’s tracing.
The package Groebner.jl implements F4 with tracing and provides an accessible interface for it (https://github.com/sumiya11/Groebner.jl). As a result, tracing can be conveniently incorporated into algorithms where computing a Gröbner basis is not the final goal. Additionally, the package supports generic types, allowing SIMD-friendly representations of coefficients (for example, tuples of machine integers) to be compiled into efficient code almost automatically by Julia. In the talk, we will advocate for this design by demonstrating how existing software leverages the tracing from Groebner.jl to achieve order-of-magnitude speedups in several applications, including parameter identifiability of ODEs and polynomial system solving.
A multiple double is an unevaluated sum of doubles. An NVIDIA tensor core is a specialized high-performance compute core for matrix multiplication. The Ampere A100, released in 2020, introduced tensor cores capable of 64-bit floating-point arithmetic. Every multiple double arithmetical operation requires renormalization, which involves branching, for which tensor cores are unsuited. To solve this problem caused by renormalization, we apply a solution similar to the Ozaki scheme [Ozaki et al, Numerical Algorithms, 2012].
Our software is available under the GPU GPL license on github. Using the CUDA SDK, version 12.4 of nvcc, in addition to the A100, housed in a Microway GPU server, running Linux, with software compiled with g++, the code was developed on an NVIDIA RTX 4080, in an Alienware gaming laptop, running Windows 11, with the Community Visual Studio compiler. Using shared memory on the A100, 13.75 FP64 TFLOPS was reached, on the FP64 tensor cores on the RTX 4080 give 0.50 TFLOPS.
I will revisit the algorithm for "Dynamic Decomposition of Tropical Prevarieties" by Anders Jensen, which we successfully applied to "Smale's 6th Problem for Generic Masses" in celestial mechanics for the five-body problem (n = 5). While sequential computation is practical in this case and the computer-assisted proof can be executed in about one hour, each attempt at the next case, n = 6, requires on the order of one CPU year. I will summarize our current progress for n = 6, including an efficient parallelization of the implementation on a single node (shared memory), as well as potential ways to attack the problem on a distributed cluster.
In this talk, we show how classical Gaussian elimination can be used to efficiently compute rational univariate representations of solutions of general zero-dimensional polynomial systems. We focus in particular on the FGLM-like component of the algorithm, including the computation of minimal polynomials. We provide a compact implementation of our method in Julia (a bit over 1000 lines of code). Despite relying on dense Gaussian elimination, our implementation is able to compute parametrizations of systems with thousands of solutions within seconds. We describe the main ingredients of the implementation, including practical linear algebra modulo a machine prime, techniques for avoiding redundant computations, strategies for vectorization and SIMD, and multi-threading and multi-process execution policies.
Power series are used to approximate the solutions of certain algebraic or differential equations at a point. These techniques are commonly used in computer algebra for instance to compute the branches of a plane algebraic curve around a singularity or to compute the limit points of a constructible set.
In most parallel algebraic computations, the algorithmic patterns follow a multi-way divide-and-conquer scheme or a map-reduce scheme (generally for-loop nests). In contrast, the parallelization of algorithms computing the power series solutions to algebraic and differential equations yield less common and more challenging patterns.
Parallelizing the factorization of univariate polynomials over multivariate power series, based on Weierstrass preparation theorem, yields to the composition of stencil schemes and map-reduce schemes. The same is true when computing the the power series solutions to a linear ordinary differential equation at one of its ordinary points.
Parallelizing the factorization of univariate polynomials over multivariate power series, based on Hensel's lemma, can be done by a producer-consumer scheme composed with the schemes involved in the parallelization of Weierstrass preparation theorem. When moving to the extended Hensel construction, dynamic programming takes center stage.
In this talk, we will discuss these various parallelization schemes for the computation of the power series solutions to algebraic and differential equations. This will include algorithm analysis as well as experimentation reports.
This is joint work with Marc Moreno Maza (Western University, Canada) and Alexander Brandt (Dalhousie University, Canada).
Hypergraph theory is a generalization of graph theory in which an edge can be any non-empty subset of the vertex set, and thus may contain more than two vertices. Given a hypergraph B, with vertex set V, a non-empty subset S of V is transversal to H if S has a non-empty intersection with every edge of H. The transversal hypergraph of H, denoted by Tr(H), is the set of all subsets of V which are transversal to H and inclusion-minimal with that property.
Over the past two decades, hypergraph theory, and transversal hypergraph computation in particular, have gained significant attention and are used in areas like algebra, artificial intelligence, chemistry, computer vision, integrated circuits and VLSI, Physics, scheduling, data mining, social sciences, and even philosophy, to name a few.
Traditionally, Tr(H) is computed by a divide-and-conquer algorithm, based on a formula due to Claude Berge. Various works have improved this approach in terms arithmetic count, cache complexity and parallelism. Moreover, recent studies suggest that the running time of Berge algorithm is on average quasi-linear in the size of Tr(H).
The size of Tr(H) can be exponential in the size of H, thus no algorithm computing tr(H) can run in polynomial time. However, the question of whether Tr(H) can be computed in output-polynomial total time (i.e., in time polynomial in the combined sizes of the input and the output) is an open problem. Studying this latter problem has lead to another line of research, in particular by Thomas Eiter and Georg Gotlob, based on a completion algorithm for computing Tr(H).
In this talk, we will discuss these lines of research. Moreover, we will present parallel and optimized C implementation of Berge divide-and-conquer algorithm, and Eiter and Gotlob completion algorithm.
This is joint work with Marc Moreno Maza (Western University, Canada).
TBA
TBA
TBA