PhD Student Binh Nguyen presents at IEEE NER 2025

Binh presented the accepted paper, “Accelerating Neuromorphic Deep Brain Stimulation Optimization through Knowledge Distillation and Enforced Sparsity” at IEEE NER 2025 in San Diego, CA.

Abstract:
Closed-loop Deep Brain Stimulation (DBS) systems hold immense promise for treating motor symptoms in Parkinson’s disease (PD) with greater adaptability and efficiency than traditional open-loop approaches. Spiking Neural Networks (SNNs) are particularly well-suited for implementing the control logic in these systems due to their inherent energy efficiency. However, training SNNs, especially using computationally intensive methods like Reinforcement Learning (RL), presents a significant bottleneck, often requiring extensive time and resources. To address this, we introduce a Knowledge Distillation (KD) framework specifically designed to train SNN-based DBS controllers. We leverage a pre-trained, high-performance Deep Spiking Q-Network (DSQN) as a ’teacher’ to rapidly guide the training of ’student’ SNNs. Our KD approach incorporates a tunable sparsity-enforcing mechanism, allowing us to generate student networks that exhibit varying degrees of sparse, bioinspired activity. We demonstrate that this KD framework achieves a dramatic reduction in training time compared to the initial RL process. Furthermore, we conduct a comprehensive analysis of the trade-offs between network sparsity, controller performance, and the resulting DBS parameters. Our findings support KD as a powerful and practical methodology for developing efficient, sparse, and biologically plausible SNN controllers, significantly accelerating the design and in silico validation of advanced neuromodulation systems.

Undergraduate student Skye Gunasekaran publishes paper titled, “A predictive approach to enhance time-series forecasting” in Nature Communications

A predictive approach to enhance time-series forecasting
Skye Gunasekaran, Assel Kembay, Hugo Ladret, Rui-Jie Zhu, Laurent Perrinet, Omid Kavehei & Jason Eshraghian
Nature Communications volume 16, Article number: 8645 (2025)

Abstract

Accurate time-series forecasting is crucial in various scientific and industrial domains, yet deep learning models often struggle to capture long-term dependencies and adapt to data distribution shifts over time. We introduce Future-Guided Learning, an approach that enhances time-series event forecasting through a dynamic feedback mechanism inspired by predictive coding. Our method involves two models: a detection model that analyzes future data to identify critical events and a forecasting model that predicts these events based on current data. When discrepancies occur between the forecasting and detection models, a more significant update is applied to the forecasting model, effectively minimizing surprise, allowing the forecasting model to dynamically adjust its parameters. We validate our approach on a variety of tasks, demonstrating a 44.8% increase in AUC-ROC for seizure prediction using EEG data, and a 23.4% reduction in MSE for forecasting in nonlinear dynamical systems (outlier excluded). By incorporating a predictive feedback mechanism, Future-Guided Learning advances how deep learning is applied to time-series forecasting.

FGL framework

Undergraduate alumni Dustin Wang and PhD Students Rui-Jie Zhu and Taylor Kergan submit a preprint titled, “A Systematic Analysis of Hybrid Linear Attention”

A Systematic Analysis of Hybrid Linear Attention

Abstract:

Transformers face quadratic complexity and memory issues with long sequences, prompting the adoption of linear attention mechanisms using fixed-size hidden states. However, linear models often suffer from limited recall performance, leading to hybrid architectures that combine linear and full attention layers. Despite extensive hybrid architecture research, the choice of linear attention component has not been deeply explored. We systematically evaluate various linear attention models across generations – vector recurrences to advanced gating mechanisms – both standalone and hybridized. To enable this comprehensive analysis, we trained and open-sourced 72 models: 36 at 340M parameters (20B tokens) and 36 at 1.3B parameters (100B tokens), covering six linear attention variants across five hybridization ratios. Benchmarking on standard language modeling and recall tasks reveals that superior standalone linear models do not necessarily excel in hybrids. While language modeling remains stable across linear-to-full attention ratios, recall significantly improves with increased full attention layers, particularly below a 3:1 ratio. Our study highlights selective gating, hierarchical recurrence, and controlled forgetting as critical for effective hybrid models. We recommend architectures such as HGRN-2 or GatedDeltaNet with a linear-to-full ratio between 3:1 and 6:1 to achieve Transformer-level recall efficiently. Our models are open-sourced at this https URL.

Three 'generations' of linear-attention state updates.

PhD Student Rui-Jie Zhu and fellow NCG lab members submit a preprint titled, “A Survey on Latent Reasoning”

A Survey on Latent Reasoning

Abstract:
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, especially when guided by explicit chain-of-thought (CoT) reasoning that verbalizes intermediate steps. While CoT improves both interpretability and accuracy, its dependence on natural language reasoning limits the model’s expressive bandwidth. Latent reasoning tackles this bottleneck by performing multi-step inference entirely in the model’s continuous hidden state, eliminating token-level supervision. To advance latent reasoning research, this survey provides a comprehensive overview of the emerging field of latent reasoning. We begin by examining the foundational role of neural network layers as the computational substrate for reasoning, highlighting how hierarchical representations support complex transformations. Next, we explore diverse latent reasoning methodologies, including activation-based recurrence, hidden state propagation, and fine-tuning strategies that compress or internalize explicit reasoning traces. Finally, we discuss advanced paradigms such as infinite-depth latent reasoning via masked diffusion models, which enable globally consistent and reversible reasoning processes. By unifying these perspectives, we aim to clarify the conceptual landscape of latent reasoning and chart future directions for research at the frontier of LLM cognition. An associated GitHub repository collecting the latest papers and repos is available at: this https URL.

“ON-OFF neuromorphic ISING machines using Fowler-Nordheim annealers” led by Zihao Chen, Zhili Xiao, and Shantanu Chakrabartty published in Nature Communications

Abstract: We introduce NeuroSA, a neuromorphic architecture specifically designed to ensure asymptotic convergence to the ground state of an Ising problem using a Fowler-Nordheim quantum mechanical tunneling based threshold-annealing process. The core component of NeuroSA consists of a pair of asynchronous ON-OFF neurons, which effectively map classical simulated annealing dynamics onto a network of integrate-and-fire neurons. The threshold of each ON-OFF neuron pair is adaptively adjusted by an FN annealer and the resulting spiking dynamics replicates the optimal escape mechanism and convergence of SA, particularly at low-temperatures. To validate the effectiveness of our neuromorphic Ising machine, we systematically solved benchmark combinatorial optimization problems such as MAX-CUT and Max Independent Set. Across multiple runs, NeuroSA consistently generates distribution of solutions that are concentrated around the state-of-the-art results (within 99%) or surpass the current state-of-the-art solutions for Max Independent Set benchmarks. Furthermore, NeuroSA is able to achieve these superior distributions without any graph-specific hyperparameter tuning. For practical illustration, we present results from an implementation of NeuroSA on the SpiNNaker2 platform, highlighting the feasibility of mapping our proposed architecture onto a standard neuromorphic accelerator platform.

NeuroBench published in Nature Communications

The multi-institutional, large-scale project led by Jason Yik (Harvard), Vijay Janapa Reddi (Harvard), and Charlotte Frenkel (TU Delft) has been published in Nature Communications.

Abstract: Neuromorphic computing shows promise for advancing computing efficiency and capabilities of AI applications using brain-inspired principles. However, the neuromorphic research field currently lacks standardized benchmarks, making it difficult to accurately measure technological advancements, compare performance with conventional methods, and identify promising future research directions. This article presents NeuroBench, a benchmark framework for neuromorphic algorithms and systems, which is collaboratively designed from an open community of researchers across industry and academia. NeuroBench introduces a common set of tools and systematic methodology for inclusive benchmark measurement, delivering an objective reference framework for quantifying neuromorphic approaches in both hardware-independent and hardware-dependent settings. For latest project updates, visit the project website (neurobench.ai).

Fig. 1

“Autonomous Driving with Spiking Neural Networks” by Ph.D. Candidate Rui-Jie Zhu Accepted in NeurIPS 2024

Autonomous driving demands an integrated approach that encompasses perception, prediction, and planning, all while operating under strict energy constraints to enhance scalability and environmental sustainability. We present Spiking Autonomous Driving (\name{}), the first unified Spiking Neural Network (SNN) to address the energy challenges faced by autonomous driving systems through its event-driven and energy-efficient nature. SAD is trained end-to-end and consists of three main modules: perception, which processes inputs from multi-view cameras to construct a spatiotemporal bird’s eye view; prediction, which utilizes a novel dual-pathway with spiking neurons to forecast future states; and planning, which generates safe trajectories considering predicted occupancy, traffic rules, and ride comfort. Evaluated on the nuScenes dataset, SAD achieves competitive performance in perception, prediction, and planning tasks, while drawing upon the energy efficiency of SNNs. This work highlights the potential of neuromorphic computing to be applied to energy-efficient autonomous driving, a critical step toward sustainable and safety-critical automotive technology. Our code is available at https://github.com/ridgerchu/SAD.
Link: https://arxiv.org/abs/2405.19687

“Reducing Data Bottlenecks in Distributed, Heterogeneous Neural Networks” by Undergraduate Researcher Ruhai Lin Accepted at IEEE MCSoC-2024

The rapid advancement of embedded multicore and many-core systems has revolutionized computing, enabling the development of high-performance, energy-efficient solutions for a wide range of applications. As models scale up in size, data movement is increasingly the bottleneck to performance. This movement of data can exist between processor and memory, or between cores and chips. This paper investigates the impact of bottleneck size, in terms of inter-chip data traffic, on the performance of deep learning models in embedded multicore and many-core systems. We conduct a systematic analysis of the relationship between bottleneck size, computational resource utilization, and model accuracy. We apply a hardware-software co-design methodology where data bottlenecks are replaced with extremely narrow layers to reduce the amount of data traffic. In effect, time-multiplexing of signals is replaced by learnable embeddings that reduce the demands on chip IOs. Our experiments on the CIFAR100 dataset demonstrate that the classification accuracy generally decreases as the bottleneck ratio increases, with shallower models experiencing a more significant drop compared to deeper models. Hardware-side evaluation reveals that higher bottleneck ratios lead to substantial reductions in data transfer volume across the layers of the neural network. Through this research, we can determine the trade-off between data transfer volume and model performance, enabling the identification of a balanced point that achieves good performance while minimizing data transfer volume. This characteristic allows for the development of efficient models …

“Evaluation and mitigation of cognitive biases in medical language models” published in npj Digital Medicine

Increasing interest in applying large language models (LLMs) to medicine is due in part to their impressive performance on medical exam questions. However, these exams do not capture the complexity of real patient–doctor interactions because of factors like patient compliance, experience, and cognitive bias. We hypothesized that LLMs would produce less accurate responses when faced with clinically biased questions as compared to unbiased ones. To test this, we developed the BiasMedQA dataset, which consists of 1273 USMLE questions modified to replicate common clinically relevant cognitive biases. We assessed six LLMs on BiasMedQA and found that GPT-4 stood out for its resilience to bias, in contrast to Llama 2 70B-chat and PMC Llama 13B, which showed large drops in performance. Additionally, we introduced three bias mitigation strategies, which improved but did not fully restore accuracy. Our findings highlight the need to improve LLMs’ robustness to cognitive biases, in order to achieve more reliable applications of LLMs in healthcare.

Link: https://www.nature.com/articles/s41746-024-01283-6

“Neuromorphic intermediate representation: a unified instruction set for interoperable brain-inspired computing” Published in Nature Communications

Spiking neural networks and neuromorphic hardware platforms that simulate neuronal dynamics are getting wide attention and are being applied to many relevant problems using Machine Learning. Despite a well-established mathematical foundation for neural dynamics, there exists numerous software and hardware solutions and stacks whose variability makes it difficult to reproduce findings. Here, we establish a common reference frame for computations in digital neuromorphic systems, titled Neuromorphic Intermediate Representation (NIR). NIR defines a set of computational and composable model primitives as hybrid systems combining continuous-time dynamics and discrete events. By abstracting away assumptions around discretization and hardware constraints, NIR faithfully captures the computational model, while bridging differences between the evaluated implementation and the underlying mathematical formalism. NIR supports an unprecedented number of neuromorphic systems, which we demonstrate by reproducing three spiking neural network models of different complexity across 7 neuromorphic simulators and 4 digital hardware platforms. NIR decouples the development of neuromorphic hardware and software, enabling interoperability between platforms and improving accessibility to multiple neuromorphic technologies. We believe that NIR is a key next step in brain-inspired hardware-software co-evolution, enabling research towards the implementation of energy efficient computational principles of nervous systems. NIR is available at neuroir.org

Link: https://www.nature.com/articles/s41467-024-52259-9

“SpikeGPT: Generative pre-trained language model with spiking neural networks” by Ph.D. Candidate Rui-Jie Zhu Published in Transactions on Machine Learning Research

As the size of large language models continue to scale, so does the computational resources required to run it. Spiking Neural Networks (SNNs) have emerged as an energy-efficient approach to deep learning that leverage sparse and event-driven activations to reduce the computational overhead associated with model inference. While they have become competitive with non-spiking models on many computer vision tasks, SNNs have also proven to be more challenging to train. As a result, their performance lags behind modern deep learning, and we are yet to see the effectiveness of SNNs in language generation. In this paper, inspired by the Receptance Weighted Key Value (RWKV) language model, we successfully implement `SpikeGPT’, a generative language model with binary, event-driven spiking activation units. We train the proposed model on two model variants: 45M and 216M parameters. To the best of our knowledge, SpikeGPT is the largest backpropagation-trained SNN model to date, rendering it suitable for both the generation and comprehension of natural language. We achieve this by modifying the transformer block to replace multi-head self attention to reduce quadratic computational complexity O(N^2) to linear complexity O(N) with increasing sequence length. Input tokens are instead streamed in sequentially to our attention mechanism (as with typical SNNs). Our preliminary experiments show that SpikeGPT remains competitive with non-spiking models on tested benchmarks, while maintaining 20x fewer operations when processed on neuromorphic hardware that can leverage sparse, event-driven activations.

New Preprint: “Scalable MatMul-free Language Modeling” by Ph.D. Candidate Ruijie Zhu

The cost of processing language models is insane. It is estimated that the computation demands of ChatGPT are >$100,000 p/day to serve billions of requests received.

Led by Rui-Jie Zhu, we have developed the first MatMul-free language model (VMM/MMM-free) to scale beyond billion-parameters. Our previous work with SpikeGPT tapped out at about 216M parameters, but our latest model has been able to go up to 2.7B parameters (only limited by compute). We’re pretty certain it can keep going.

We provide a GPU-optimized implementation that uses 61% less VRAM over an unoptimized implementation during training.

However, there are several operations in this model that GPUs aren’t yet fully optimized for, such as ternary operations. So Ethan Sifferman, Tyler Sheaves and Dustin R. built a custom FPGA implementation to really milk it, and we can reach human-reading throughput at 13W. A little less than the power consumed by the human brain.

Preprint: https://lnkd.in/gaWbg7ss

GitHub training code: https://lnkd.in/gKFzQs_z

Pre-trained models on HuggingFace: https://lnkd.in/gDXFjPdm

No alt text provided for this image

New Preprint: “Autonomous Driving with Spiking Neural Networks” by Ph.D. Candidate Ruijie Zhu

Spiking Autonomous Driving

From the guy who built the first spiking language generation model, Rui-Jie Zhu has found a way to make spiking neural networks (SNNs) perform end-to-end autonomous vehicle control. This model takes a 6-camera input and integrates perception, prediction and planning together into a single model with approximately 75x less operations than ST-P3 at comparable performance.

Making SNNs push beyond toy datasets has been a tough time, but we’ve put a lot of effort into showing how to scale to challenging, real-world problems. The next step for this model is to push it into a closed-loop system. Deploying models like this on low-latency neuromorphic hardware can enable fast response times from sensor to control. This is necessary if we want to bridge the sim2real gap. I.e., by the time you take action, you don’t want your world to have changed by too much.

Rather than forcing “spiking” into applications for the sake of it, it’s important to take it to domains where there is a computational benefit – and I think this is one of them.

Preprint: https://arxiv.org/abs/2405.19687

Code: https://github.com/ridgerchu/SAD