@TencentHunyuan: ๐๐๐ We are excited to open-source HPC-Ops, Tencent HY's production-grade LLM inference operator l...
๐๐๐ We are excited to open-source HPC-Ops, Tencent HY's production-grade LLM inference operator library designed to unlock peak performance on mainstream inference cards. Highlights: ๐น 30% Throughput Boost: Delivers 30% QPM improvement for Tencent HY models and 17% for DeepSeek in production. ๐น Hardware-Optimized: Built from scratch with CUDA and CuTe to maximize GPU utilization, addressing the pain point where mainstream libraries fail to reach hardware peak. ๐น SOTA Kernel Performance: -Attention: Up to 2.22x speedup over FlashInfer/FlashAttention -GroupGEMM: Up to 1.88x speedup over DeepGEMM -FusedMoE: Up to 1.49x speedup over TensorRT-LLM ๐น Production-Ready: Comprehensive operator coverage including FusedMoE, GroupGEMM, and multi-node communication with clean abstraction for easy customization. Already powering Tencent's large-scale inference services. Unleash your hardware potential today! ๐ GitHub: https://t.co/NBj0DbheL5