Zhanxing Zhu

ECS, University of Southampton

Southampton, SO17 1BJ, UK

Email: z.zhu@soton.ac.uk

I am an Associate Professor at Vision, Learning and Control Group (VLC), School of Electrical and Computer Science (ECS), University of Southampton, UK. I am now closely affiliated with UKRI AI Centre for Doctoral Training in AI for Sustainability. Previously I obtained Ph.D on machine learning from School of Informatics, University of Edinburgh, UK.

I have been focusing on machine learning, particularly, deep learning, broadly covering its theory, methodology and application. Together with my students and collaborators, we attempt to rigorously reveal the underlying mechanism of why deep learning works or not, and inspired by our theoretical understanding and empirical observation, we develop robust, fast and generalizable models and algorithms to boost its applicability in various challenging scenarios and interdisciplinary tasks, e.g. AI4Science and AI4Engineering. More information is shown in my Google Scholar profile.

Research Interests:

Understanding deep learning theoretically: SGD (ICML’19, ICML’20); Batch Normalization (NeurIPS’21); Implicit Bias (NeurIPS’23, AAAI’25); Knowledge Distillation (NeurIPS’20); Adversarial Training (ICLR’22, TPAMI’25 New!); Neural Scaling Law of LLMs (ICLR’25 New!)
Robust deep learning models in adversarial and continual environments: YOPO (NeurIPS’19), Inversion Attack (ICML’20), Adversarial Invariant Learning (CVPR’21), RCL (NeurIPS’18) and BOCL (TPAMI’21)
Learning from complex dynamics data: (STGCN, STFGN, Neural Lad, Functional Relation Field, DyCAST New!)
AI for Science (AI4Science), including climate, material and healthcare problems (Unisoma for Multi-Solid Simulation New! LaDEEP for Simulating Elastic-Plastic Solids New!).

Ph.D Studentships. I’m interested in supervising motivated students in the area of AI and machine learning, ranging from theory, algorithms and various applications. Please get in touch to discuss the options and potential topics. You can also check out the UKRI AI Centre for Doctoral Training in AI for Sustainability which has opportunities for 70 PhD students in the area of AI and environmental sustainability.

selected publications

TPAMI

Analyzing the Implicit Bias of Adversarial Training from a Generalized Margin Perspective

Bochen Lyu , and Zhanxing Zhu

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2025

Abs PDF

Adversarial training has been empirically demonstrated as an effective strategy to improve the robustness of deep neural networks (DNNs) against adversarial examples. However, the underlying reason of its effectiveness is still non-transparent. In this paper we conduct both extensive theoretical and empirical analysis on the implicit bias induced by adversarial training from a generalized margin perspective. Our results focus on adversarial training for homogeneous DNNs. In particular, (i). For deep linear networks with ℓp-norm perturbation, we show that weight matrices of adjacent layers get aligned and the converged parameters maximize the margin of adversarial examples, which can be further viewed as a generalized margin of the original dataset that can be achieved by an interpolation solution between ℓ2-SVM and ℓq-SVM where 1/p+1/q=1. (ii). For general homogeneous DNNs, including both linear and nonlinear ones, we investigate adversarial training with a variety of adversarial perturbations in a unified manner. Specifically, we show that the direction of the limit point of parameters converges to a KKT point of a constrained optimization problem that aims to maximize the margin for adversarial examples. Additionally, as an application of this general result for two special linear homogeneous DNNs, diagonal linear networks and linear convolutional networks, we show that adversarial training with ℓp-norm perturbation equivalently minimizes an interpolation norm that depends on the depth, the architecture, and the value of p in the predictor space. Extensive experiments are conducted to verify theoretical claims. Our results theoretically provide the basis for the longstanding folklore [1] that adversarial training modifies the decision boundary by utilizing adversarial examples to improve robustness, and potentially provide insights for designing new robust training strategies.
ICLR

A Solvable Attention for Neural Scaling Laws

Bochen Lyu , Di Wang , and Zhanxing Zhu

In International Conference on Learning Representation (ICLR) , 2025

Abs PDF

Transformers and many other deep learning models are empirically shown to predictably enhance their performance as a power law in training time, model size, or the number of training data points, which is termed as the neural scaling law. This paper studies this intriguing phenomenon particularly for the transformer architecture in theoretical setups. Specifically, we propose a framework for self-attention, the underpinning block of transformer, to learn in an in-context manner, where the corresponding learning dynamics is modeled as a non-linear ordinary differential equation (ODE) system. Furthermore, we establish a procedure to derive a tractable solution for this ODE system by reformulating it as a Riccati equation, which allows us to precisely characterize neural scaling laws for self-attention with training time, model size, data size, and the optimal compute. In addition, we reveal that the self-attention shares similar neural scaling laws with several other architectures when the context sequence length of the in-context learning is fixed, otherwise it would exhibit a different scaling law of training time.
NeurIPS

Implicit Bias of (Stochastic) Gradient Descent for Rank-1 Linear Neural Network

Bochen Lyu , and Zhanxing Zhu

In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS) , 2023

Abs PDF

Studying the implicit bias of gradient descent (GD) and stochastic gradient descent (SGD) is critical to unveil the underlying mechanism of deep learning. Unfortunately, even for standard linear networks in regression setting, a comprehensive characterization of the implicit bias is still an open problem. This paper proposes to investigate a new proxy model of standard linear network, rank-1 linear network, where each weight matrix is parameterized as a rank-1 form. For over-parameterized regression problem, we precisely analyze the implicit bias of GD and SGD—by identifying a “potential” function such that GD converges to its minimizer constrained by zero training error (i.e., interpolation solution), and further characterizing the role of the noise introduced by SGD in perturbing the form of this potential. Our results explicitly connect the depth of the network and the initialization with the implicit bias of GD and SGD. Furthermore, we emphasize a new implicit bias of SGD jointly induced by stochasticity and over-parameterization, which can reduce the dependence of the SGD’s solution on the initialization. Our findings regarding the implicit bias are different from that of a recently popular model, the diagonal linear network. We highlight that the induced bias of our rank-1 model is more consistent with standard linear network while the diagonal one is not. This suggests that the proposed rank-1 linear network might be a plausible proxy for standard linear net.
NeurIPS

Spherical Motion Dynamics: Learning Dynamics of Normalized Neural Network using SGD and Weight Decay

Ruosi Wan , Zhanxing Zhu , Xiangyu Zhang , and Jian Sun

Advances in Neural Information Processing Systems (NeurIPS), 2021

Abs PDF

In this paper, we comprehensively reveal the learning dynamics of normalized neural network using Stochastic Gradient Descent (with momentum) and Weight Decay (WD), named as Spherical Motion Dynamics (SMD). Most related works focus on studying behavior of effective learning rate "inequilibrium" state, i.e. assuming weight norm remains unchanged. However, their discussion on why this equilibrium can be reached is either absent or less convincing. Our work directly explores the cause of equilibrium, as a special state of SMD. Specifically, 1) we introduce the assumptions that can lead to equilibrium state in SMD, and prove equilibrium can be reached in a linear rate regime under given assumptions; 2) we propose angular update" as a substitute for effective learning rate to depict the state of SMD, and derive the theoretical value of angular update in equilibrium state; 3) we verify our assumptions and theoretical results on various large-scale computer vision tasks including ImageNet and MSCOCO with standard settings. Experiment results show our theoretical findings agree well with empirical observations. We also show that the behavior of angular update in SMD can produce interesting effect to the optimization of neural network in practice.
NeurIPS

You only propagate once: Accelerating adversarial training via maximal principle

Dinghuai Zhang , Tianyuan Zhang , Lu , Zhanxing Zhu , and Bin Dong

In Advances in Neural Information Processing Systems (NeurIPS) , 2019

Abs PDF

Deep learning achieves state-of-the-art results in many tasks in computer vision and natural language processing. However, recent works have shown that deep networks can be vulnerable to adversarial perturbations, which raised a serious robustness issue of deep networks. Adversarial training, typically formulated as a robust optimization problem, is an effective way of improving the robustness of deep networks. A major drawback of existing adversarial training algorithms is the computational overhead of the generation of adversarial examples, typically far greater than that of the network training. This leads to the unbearable overall computational cost of adversarial training. In this paper, we show that adversarial training can be cast as a discrete time differential game. Through analyzing the Pontryagin’s Maximum Principle (PMP) of the problem, we observe that the adversary update is only coupled with the parameters of the first layer of the network. This inspires us to restrict most of the forward and back propagation within the first layer of the network during adversary updates. This effectively reduces the total number of full forward and backward propagation to only one for each group of adversary updates. Therefore, we refer to this algorithm YOPO (You Only Propagate Once). Numerical experiments demonstrate that YOPO can achieve comparable defense accuracy with approximately 1/5 1/4 GPU time of the projected gradient descent (PGD) algorithm.
ICML

The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects

Zhanxing Zhu , Jingfeng Wu , Bing Yu , Lei Wu , and Jinwen Ma

In International Conference on Machine Learning (ICML) , 2019

Abs PDF

Understanding the behavior of stochastic gradient descent (SGD) in the context of deep neural networks has raised lots of concerns recently. Along this line, we study a general form of gradient based optimization dynamics with unbiased noise, which unifies SGD and standard Langevin dynamics. Through investigating this general optimization dynamics, we analyze the behavior of SGD on escaping from minima and its regularization effects. A novel indicator is derived to characterize the efficiency of escaping from minima through measuring the alignment of noise covariance and the curvature of loss function. Based on this indicator, two conditions are established to show which type of noise structure is superior to isotropic noise in term of escaping efficiency. We further show that the anisotropic noise in SGD satisfies the two conditions, and thus helps to escape from sharp and poor minima effectively, towards more stable and flat minima that typically generalize well. We systematically design various experiments to verify the benefits of the anisotropic noise, compared with full gradient descent plus isotropic diffusion (i.e. Langevin dynamics).
IJCAI

Spatio-temporal graph convolutional neural network: A deep learning framework for traffic forecasting

Bing Yu , Haoteng Yin , and Zhanxing Zhu

In International Joint Conference of Artificial Intelligence (IJCAI) , 2018

Abs PDF

Timely accurate traffic forecast is crucial for urban traffic control and guidance. Due to the high nonlinearity and complexity of traffic flow, traditional methods cannot satisfy the requirements of mid-and-long term prediction tasks and often neglect spatial and temporal dependencies. In this paper, we propose a novel deep learning framework, Spatio-Temporal Graph Convolutional Networks (STGCN), to tackle the time series prediction problem in traffic domain. Instead of applying regular convolutional and recurrent units, we formulate the problem on graphs and build the model with complete convolutional structures, which enable much faster training speed with fewer parameters. Experiments show that our model STGCN effectively captures comprehensive spatio-temporal correlations through modeling multi-scale traffic networks and consistently outperforms state-of-the-art baselines on various real-world traffic datasets.