newsence
ไพ†ๆบ็ฏฉ้ธ

@TencentHunyuan: ๐Ÿš€๐Ÿš€๐Ÿš€ We are excited to open-source HPC-Ops, Tencent HY's production-grade LLM inference operator l...

Twitter

๐Ÿš€๐Ÿš€๐Ÿš€ We are excited to open-source HPC-Ops, Tencent HY's production-grade LLM inference operator library designed to unlock peak performance on mainstream inference cards. Highlights: ๐Ÿ”น 30% Throughput Boost: Delivers 30% QPM improvement for Tencent HY models and 17% for DeepSeek in production. ๐Ÿ”น Hardware-Optimized: Built from scratch with CUDA and CuTe to maximize GPU utilization, addressing the pain point where mainstream libraries fail to reach hardware peak. ๐Ÿ”น SOTA Kernel Performance: -Attention: Up to 2.22x speedup over FlashInfer/FlashAttention -GroupGEMM: Up to 1.88x speedup over DeepGEMM -FusedMoE: Up to 1.49x speedup over TensorRT-LLM ๐Ÿ”น Production-Ready: Comprehensive operator coverage including FusedMoE, GroupGEMM, and multi-node communication with clean abstraction for easy customization. Already powering Tencent's large-scale inference services. Unleash your hardware potential today! ๐Ÿ”— GitHub: https://t.co/NBj0DbheL5

newsence

Loading

Fetching article data