A paper presented at SOSP 2025 details how token-level scheduling helped one GPU serve multiple LLMs, reducing demand from 1,192 to 213 H20s.

Give an upvote to this comment by margarita
margarita 4 months ago

Actual paper presented at SOSP https://ennanzhai.github.io/pub/sosp25-aega...

share report

reply