China appears to have found a way to bypass AI chip restrictions through software optimization. DeepSeek’s FlashMLA technology significantly enhances the TFLOPS performance of NVIDIA’s Hopper H800 processors.
At OpenSource Week, which DeepSeek launched on February 24, the company introduced FlashMLA’s “decoding kernel.” This software optimizes the NVIDIA Hopper processors, pushing their capabilities far beyond standard limits. According to DeepSeek, the H800 can reach 580 TFLOPS for BF16 matrix multiplication—approximately eight times its normal capacity. Additionally, FlashMLA optimizes memory usage, enabling bandwidth speeds of up to 3000 GB/s, nearly double the H800’s original maximum. Remarkably, all these improvements are achieved purely through code, without any hardware modifications.
Software Optimization for AI Acceleration
FlashMLA works by implementing “low-rank key-value compression,” which breaks data into smaller chunks, allowing for faster processing while reducing memory consumption by 40%-60%. The technology employs a block-based “swap” system that dynamically allocates memory based on task intensity rather than relying on fixed allocation values. This approach helps AI models process variable-length sequences more efficiently and improves overall speed.
DeepSeek’s innovation highlights how software alone can enhance the performance of costly and power-intensive AI accelerators. For now, FlashMLA is specifically designed for the H800, but its potential application on the H100 remains an intriguing prospect.
China’s Progress in Computing Optimization
China has been focusing on AI computing efficiency, demonstrating notable breakthroughs in chip performance, notes NIX Solutions. Recently, scientists from Shenzhen University and the Beijing Institute of Technology boosted an NVIDIA RTX 4070’s performance by 800 times for peridynamics tasks. However, this achievement was made in collaboration with Russian researchers, raising concerns about potential military-industrial advancements.
DeepSeek’s FlashMLA represents a significant step in software-driven AI acceleration, proving that hardware limitations can be mitigated through coding ingenuity. We’ll keep you updated on further developments.