used to think kv cache paging/eviction was simple. but the traces here are interesting: - chat reuse lasts mins, api reuse lasts secs. - for apis, a small gpu cache is plenty. - for chat, smarter eviction. - be super workload-aware. (arxiv.org/pdf/2506.02634)
0
0
0
35
0
Download Image