[go: up one dir, main page]

What a lovely hat

Is it made out of tin foil?

Paper 2025/956

LEAF: A Low-Latency Evaluation Architecture for Feedforward Block in Privacy-Preserving Transformer Inference

Linru Zhang, Nanyang Technological University
Xiangning Wang, Nanyang Technological University
Xianhui Lu, Chinese Academy of Sciences
Huaxiong Wang, Nanyang Technological University
Kwok Yan Lam, Nanyang Technological University
Abstract

Fully homomorphic encryption (FHE) is an appealing and promising solution for privacy-preserving transformer inference to protect users' privacy. However, the huge computational overhead makes it unrealistic to apply FHE in real-world transformers for large language models (LLM). Current FHE-based approaches to secure transformer inference face significant performance challenges, with total latency exceeding 5 hours for 32-input batches. The feedforward block, comprising a large-scale matrix multiplication followed by a GELU evaluation, is widely recognized as one of the most computationally intensive components in privacy-preserving transformer inference. In the state-of-the-art system NEXUS, evaluating the feedforward block incurs a total latency of 5,378 seconds, processing up to 32 inputs per batch. We aim to reduce the latency and propose LEAF, a low-latency evaluation architecture for the feedforward block. LEAF introduces a novel combination of fast matrix multiplication and an asymptotically efficient algorithm for computing non-polynomial activations. When evaluated on the BERT-base model, LEAF reduces total latency to 53.4 seconds, offering a $100\times$ speedup over the state-of-the-art method in the same environment. Our implementations are available.

Metadata
Available format(s)
PDF
Category
Implementation
Publication info
Preprint.
Keywords
Fully homomorphic encryptionLarge-language modelPrivacy-preserving AINon-polynomial function
Contact author(s)
Linru zhang @ ntu edu sg
xiangning wang @ ntu edu sg
luxianhui @ iie ac cn
hxwang @ ntu edu sg
kwokyan lam @ ntu edu sg
History
2025-05-26: approved
2025-05-26: received
See all versions
Short URL
https://ia.cr/2025/956
License
Creative Commons Attribution
CC BY

BibTeX

@misc{cryptoeprint:2025/956,
      author = {Linru Zhang and Xiangning Wang and Xianhui Lu and Huaxiong Wang and Kwok Yan Lam},
      title = {{LEAF}: A Low-Latency Evaluation Architecture for Feedforward Block in Privacy-Preserving Transformer Inference},
      howpublished = {Cryptology {ePrint} Archive, Paper 2025/956},
      year = {2025},
      url = {https://eprint.iacr.org/2025/956}
}
Note: In order to protect the privacy of readers, eprint.iacr.org does not use cookies or embedded third party content.