Paper 2025/956
LEAF: A Low-Latency Evaluation Architecture for Feedforward Block in Privacy-Preserving Transformer Inference
Abstract
Fully homomorphic encryption (FHE) is an appealing and promising solution for privacy-preserving transformer inference to protect users' privacy. However, the huge computational overhead makes it unrealistic to apply FHE in real-world transformers for large language models (LLM). Current FHE-based approaches to secure transformer inference face significant performance challenges, with total latency exceeding 5 hours for 32-input batches. The feedforward block, comprising a large-scale matrix multiplication followed by a GELU evaluation, is widely recognized as one of the most computationally intensive components in privacy-preserving transformer inference. In the state-of-the-art system NEXUS, evaluating the feedforward block incurs a total latency of 5,378 seconds, processing up to 32 inputs per batch. We aim to reduce the latency and propose LEAF, a low-latency evaluation architecture for the feedforward block. LEAF introduces a novel combination of fast matrix multiplication and an asymptotically efficient algorithm for computing non-polynomial activations. When evaluated on the BERT-base model, LEAF reduces total latency to 53.4 seconds, offering a $100\times$ speedup over the state-of-the-art method in the same environment. Our implementations are available.
Metadata
- Available format(s)
-
PDF
- Category
- Implementation
- Publication info
- Preprint.
- Keywords
- Fully homomorphic encryptionLarge-language modelPrivacy-preserving AINon-polynomial function
- Contact author(s)
-
Linru zhang @ ntu edu sg
xiangning wang @ ntu edu sg
luxianhui @ iie ac cn
hxwang @ ntu edu sg
kwokyan lam @ ntu edu sg - History
- 2025-05-26: approved
- 2025-05-26: received
- See all versions
- Short URL
- https://ia.cr/2025/956
- License
-
CC BY
BibTeX
@misc{cryptoeprint:2025/956, author = {Linru Zhang and Xiangning Wang and Xianhui Lu and Huaxiong Wang and Kwok Yan Lam}, title = {{LEAF}: A Low-Latency Evaluation Architecture for Feedforward Block in Privacy-Preserving Transformer Inference}, howpublished = {Cryptology {ePrint} Archive, Paper 2025/956}, year = {2025}, url = {https://eprint.iacr.org/2025/956} }