Update README

stijnh · stijnh · commit f9ffc697973d · 2026-05-05T12:56:21.000+02:00
diff --git a/.github/workflows/cmake.yml b/.github/workflows/cmake.yml
@@ -2,6 +2,10 @@ name: CMake
 
 on:
   push:
+    paths-ignore:
+      - '**.md'
+      - 'docs/**'
+      - '.gitignore'
   pull_request:
     branches: [ "main" ]
 
diff --git a/README.md b/README.md
@@ -9,8 +9,7 @@
 ![GitHub Repo stars](https://img.shields.io/github/stars/KernelTuner/kernel_float?style=social)
 
 
-_Kernel Float_ is a header-only library for CUDA/HIP that simplifies working with vector types and reduced precision floating-point arithmetic in GPU code.
-
+_Kernel Float_ is a header-only library for CUDA/HIP that makes working with reduced-precision floating-point types and vector arithmetic simple and expressive, with zero performance overhead.
 
 ## Summary
 
@@ -21,12 +20,12 @@ mathematical operations require intrinsics (e.g., `__hadd2` performs addition fo
 type conversion is awkward (e.g., `__nv_cvt_halfraw2_to_fp8x2` converts float16 to float8),
 and some functionality is missing (e.g., one cannot convert a `__half` to `__nv_bfloat16`).
 
-_Kernel Float_ resolves this by offering a single data type `kernel_float::vec<T, N>` that stores `N` elements of type `T`.
-Internally, the data is stored as a fixed-sized array of elements.
+_Kernel Float_ resolves this by offering a single unified vector type `kernel_float::vec<T, N>` that stores `N` elements of type `T`.
+Internally, the data is stored using the optimal data layout for the given type.
 Operator overloading (like `+`, `*`, `&&`) has been implemented such that the most optimal intrinsic for the available types is selected automatically.
 Many mathematical functions (like `log`, `exp`, `sin`) and common operations (such as `sum`, `range`, `for_each`) are also available.
 
-Using Kernel Float, developers avoid the complexity of reduced precision floating-point types in CUDA and can focus on their applications.
+The generated assembly is identical to hand-written intrinsics code, meaning you get clean and maintainable source code without sacrificing performance.
 
 
 ## Features
@@ -39,18 +38,14 @@ In a nutshell, _Kernel Float_ offers the following features:
 * Support for quarter (8 bit) floating-point types.
 * Easy integration as a single header file.
 * Written for C++17.
-* Compatible with NVCC (NVIDIA Compiler) and NVRTC (NVIDIA Runtime Compilation).
-* Compatible with HIPCC (AMD HIP Compiler)
-
-
-## Example
+* Compatible with CUDA: `nvcc` (NVIDIA Compiler) and `nvrtc` (NVIDIA Runtime Compilation).
+* Compatible with HIP: `hipcc` (AMD HIP Compiler)
 
-Check out the [examples](https://github.com/KernelTuner/kernel_float/tree/master/examples) directory for some examples.
 
+## Quick Example
 
-Below shows a simple example of a CUDA kernel that adds a `constant` to the `input` array and writes the results to the `output` array.
+Below shows a simple example kernel that multiplies an `input` array by a `constant` and accumulates into an `output` array.
 Each thread processes two elements.
-Notice how easy it would be to change the precision (for example, `double` to `half`) or the vector size (for example, 4 instead of 2 items per thread).
 
 
 ```cpp
@@ -61,13 +56,17 @@ __global__ void kernel(kf::vec_ptr<const half, 2> input, int constant, kf::vec_p
     int i = blockIdx.x * blockDim.x + threadIdx.x;
     output[i] += input[i] * constant;
 }
-
 ```
 
+Notice how easy it would be to change the precision (for example, `double` to `half`) or the vector size (for example, 4 instead of 2 items per thread).
+Check out the [examples](https://github.com/KernelTuner/kernel_float/tree/main/examples) directory for some examples.
+
 Here is how the same kernel would look for CUDA without Kernel Float.
 
 ```cpp
-__global__ void kernel(const half* input, double constant, float* output) {
+#include <cuda_fp16.h>
+
+__global__ void kernel(const half* input, int constant, float* output) {
     int i = blockIdx.x * blockDim.x + threadIdx.x;
     __half in0 = input[2 * i + 0];
     __half in1 = input[2 * i + 1];
@@ -82,10 +81,9 @@ __global__ void kernel(const half* input, double constant, float* output) {
     output[2 * i + 0] += out0;
     output[2 * i + 1] += out1;
 }
-
 ```
 
-Even though the second kernel looks a lot more complex, the PTX code generated by these two kernels is nearly identical.
+Even though the second kernel looks a lot more complex, both generate nearly identical PTX code.
 
 
 ## Installation
@@ -103,9 +101,11 @@ make
 ```
 
 
-## Documentation
+## Links
 
-See the [documentation](https://kerneltuner.github.io/kernel_float/) for the [API reference](https://kerneltuner.github.io/kernel_float/api.html) of all functionality.
+- [Documentation](https://kerneltuner.github.io/kernel_float/)
+- [API reference](https://kerneltuner.github.io/kernel_float/api.html)
+- [Examples](https://github.com/KernelTuner/kernel_float/tree/main/examples)
 
 
 ## Citation
@@ -124,7 +124,7 @@ If you use Kernel Float in scholarly work, please cite the following paper:
 
 ## License
 
-Licensed under Apache 2.0. See [LICENSE](https://github.com/KernelTuner/kernel_float/blob/master/LICENSE).
+Licensed under Apache 2.0. See [LICENSE](https://github.com/KernelTuner/kernel_float/blob/main/LICENSE).
 
 
 ## Related Work