LEAF: A Low-Latency Evaluation Architecture for Feedforward Block in Privacy-Preserving Transformer Inference

Paper 2025/956

LEAF: A Low-Latency Evaluation Architecture for Feedforward Block in Privacy-Preserving Transformer Inference

Linru Zhang

, Nanyang Technological University

Xiangning Wang

, Nanyang Technological University

Xianhui Lu

, Chinese Academy of Sciences

Huaxiong Wang

, Nanyang Technological University

Kwok Yan Lam

, Nanyang Technological University

Abstract

Fully homomorphic encryption (FHE) is an appealing and promising solution for privacy-preserving transformer inference to protect users' privacy. However, the huge computational overhead makes it unrealistic to apply FHE in real-world transformers for large language models (LLM). Current FHE-based approaches to secure transformer inference face significant performance challenges, with total latency exceeding 5 hours for 32-input batches. The feedforward block, comprising a large-scale matrix multiplication followed by a GELU evaluation, is widely recognized as one of the most computationally intensive components in privacy-preserving transformer inference. In the state-of-the-art system NEXUS, evaluating the feedforward block incurs a total latency of 5,378 seconds, processing up to 32 inputs per batch. We aim to reduce the latency and propose LEAF, a low-latency evaluation architecture for the feedforward block. LEAF introduces a novel combination of fast matrix multiplication and an asymptotically efficient algorithm for computing non-polynomial activations. When evaluated on the BERT-base model, LEAF reduces total latency to 53.4 seconds, offering a $100\times$ speedup over the state-of-the-art method in the same environment. Our implementations are available.

Metadata

Available format(s): PDF
Category: Implementation
Publication info: Preprint.
Keywords: Fully homomorphic encryption Large-language model Privacy-preserving AI Non-polynomial function
Contact author(s): Linru zhang @ ntu edu sg
xiangning wang @ ntu edu sg
luxianhui @ iie ac cn
hxwang @ ntu edu sg
kwokyan lam @ ntu edu sg
History: 2025-05-26: approved; 2025-05-26: received; See all versions
Short URL: https://ia.cr/2025/956
License: CC BY

BibTeX

@misc{cryptoeprint:2025/956,
      author = {Linru Zhang and Xiangning Wang and Xianhui Lu and Huaxiong Wang and Kwok Yan Lam},
      title = {{LEAF}: A Low-Latency Evaluation Architecture for Feedforward Block in Privacy-Preserving Transformer Inference},
      howpublished = {Cryptology {ePrint} Archive, Paper 2025/956},
      year = {2025},
      url = {https://eprint.iacr.org/2025/956}
}

What a lovely hat

Is it made out of tin foil?

Paper 2025/956