A Survey on Integrated Training-Inference Architectures for Large Language Models on Multi-GPU Stream Processors
DOI:
https://doi.org/10.62306/7ds08r28Keywords:
Large Language Models, Multi-GPU Parallelism, Tensor Parallelism, Hardware-Software Co-Design, Compilation FrameworksAbstract
Large language models (LLMs) have revolutionized artificial intelligence, achieving remarkable performance in natural language understanding, generation, and multimodal tasks. However, their unprecedented scale—often comprising billions to trillions of parameters—imposes severe computational demands, particularly in training and inference phases, necessitating advanced parallel processing architectures on multi-GPU arrays. This survey provides a comprehensive overview of integrated training-inference (train-infer) architectures for LLMs on large-scale GPU stream processors, emphasizing multi-GPU stream processing, hypercube tensor parallelism, and hardware-software co-designed compilation frameworks. We trace the evolution of parallelism strategies, including data parallelism, pipeline parallelism, and tensor parallelism, highlighting innovations such as cross-cluster pipeline execution, adaptive NIC selection, and spatiotemporal tensor partitioning to mitigate communication overheads and memory bottlenecks in heterogeneous environments [1, 7, 8, 17]. Key challenges, including scalability in non-homogeneous networks and efficient compilation for diverse hardware, are analyzed alongside state-of-the-art solutions like MLIR-based frameworks and RISC-V accelerators [28, 33]. By synthesizing recent advancements, this survey identifies promising directions for scalable, energy-efficient LLM systems, paving the way for broader deployment in edge computing and high-performance clusters.