YOCO is a novel decoder-decoder architecture for LLMs, enhancing memory efficiency by caching key-value pairs only once. YOCO markedly reduces KV cache memory and prefilling time by orders of magnitude. YOCO makes 1M-length LLMs practical. https://msft.it/6040YnEVM
Thanks for the work! I've made an overview of the paper https://gonzoml.substack.com/p/you-only-cache-once-decoder-decoder
Jon-Paul Boyd
Senior Quant Analyst at J.P. Morgan
2dInnovative, but I do wonder what happens when a word late in the text changes the context of an earlier word. example: "I was at the farm then I went to the store and bought an apple but I was disappointed when I found the M4 chip was only available in the ipad, and they still don't offer touch screens on laptops."