Scalable Ml Inference For Real-Time Recommendations

Ramakrishnan Sathyavageeswaran

doi:10.62647/IJITCE2025V13I4PP1-8

Authors

Ramakrishnan Sathyavageeswaran The University of Texas at Dallas Author

DOI:

https://doi.org/10.62647/IJITCE2025V13I4PP1-8

Keywords:

Scalable Machine Learning Inference, Real-Time Recommendation Systems, Low-Latency Model Serving, Distributed Systems for ML, Model Optimization and Deployment

Abstract

Modern web-scale applications such as e-commerce, streaming to social media all are based on real-time recommendation systems. Nevertheless, it is a challenge to ML inference to deliver high-quality recommendations at low latency and large request volumes. This paper introduces a low-latency, high-throughput recommendation pipeline-based scalable ML inference system. We examine trade-offs among model complexity, latency and system cost. We develop a modular architecture to combine effective feature retrieval, candidate generation and optimized model serving. In particular, we discuss such methods as model quantization, dynamic batching, caching and hardware acceleration to improve the inference performance. Experimental testing on benchmark workloads and real-world datasets demonstrates that our system can realize sub-100ms latency and state of the art recommendation quality, scalable and cost efficient to use in production.