Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Axiom – C++ tensor library with NumPys API, optimized for Apple Silicon (github.com/frikallo)
3 points by noahkay13 20 days ago | hide | past | favorite | 2 comments
I kept hitting the same wall: prototype something in NumPy or PyTorch, then rewrite it in C++ for edge deployment. The rewrite always took longer than the original work. Eigen's fixed-size matrix API doesn't map to tensor workloads, xtensor is CPU-only with compile-time templated types that produce unreadable errors, and none of them have GPU support on Mac. Worse, Eigen was often slower than the Python version because PyTorch bundles optimized BLAS while Eigen uses its own limited implementation.

So I built Axiom to make that rewrite mechanical. The API mirrors NumPy/PyTorch as closely as I could — same method names, broadcasting rules, operator overloading, dynamic shapes, runtime dtypes. Code that looks like this in PyTorch:

    scores = Q.matmul(K.transpose(-2, -1)) / math.sqrt(64)
    output = scores.softmax(-1).matmul(V)
looks like this in Axiom:

    auto scores = Q.matmul(K.transpose(-2, -1)) / std::sqrt(64.0f);
    auto output = scores.softmax(-1).matmul(V);
No mental translation. No debugging subtle API differences.

What's in the box (28k LOC):

- 100+ operations: arithmetic, reductions, activations (relu, gelu, silu, softmax), pooling, FFT, full LAPACK linear algebra (SVD, QR, Cholesky, eigendecomposition, solvers) - Metal GPU via MPSGraph — all ops run on GPU, not just matmul. Compiled graphs are cached by (shape, dtype) to avoid recompilation - Seamless CPU ↔ GPU: `auto g = tensor.gpu();` — unified memory on Apple Silicon avoids copies entirely - Built-in einops: `tensor.rearrange("b h w c -> b c h w")` - Highway SIMD across architectures (NEON, AVX2, AVX-512, SSE, WASM, RISC-V) - Runtime dtypes via variant (readable errors, not template explosions) - Row-major default, column-major supported via as_f_contiguous() - Works on macOS, Linux, Windows, and WebAssembly

Performance on M4 Pro (vs Eigen with OpenBLAS, PyTorch, NumPy):

- Matmul 2048×2048: 3,196 GFLOPS (Eigen 2,911 / PyTorch 2,433) - ReLU 4096×4096: 123 GB/s (Eigen 117 / PyTorch 70) - FFT2 2048×2048: 14.9ms (PyTorch 27.6ms / NumPy 63.5ms)

To try it:

    git clone https://github.com/frikallo/axiom.git
    cd axiom && make release
Or add to your CMake project via FetchContent. Example files in examples/.

Happy to answer questions about the internals or take feedback on the API.



This is an impressive piece of engineering, no doubt. The API is clean, performance work is serious, and it’s clear a lot of effort went into making this fast. But let’s be honest: without autograd and a real training ecosystem, this is not a PyTorch replacement, it’s a very nice numerical toolbox. Also, tying GPU acceleration mostly to Metal makes this far less useful outside the Apple ecosystem. Right now, it looks like a technically excellent project searching for its real-world niche. If you add proper differentiation, broader GPU support, and prove that this scales with real users, then it could become something truly important. Until then, it’s great work — but not a revolution.


I appreciate the advice. Right now, numerical coverage, absolute performance, and DX are my biggest priorities. Looking to get traction from OSS so scope creep doesn't catch up to me and some passionate devs can jump on board, autograd and CUDA are the next really big milestones for Axiom.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: