You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
_Kernel Float_ is a header-only library for CUDA/HIP that simplifies working with vector types and reduced precision floating-point arithmetic in GPU code.
13
-
12
+
_Kernel Float_ is a header-only library for CUDA/HIP that makes working with reduced-precision floating-point types and vector arithmetic simple and expressive, with zero performance overhead.
type conversion is awkward (e.g., `__nv_cvt_halfraw2_to_fp8x2` converts float16 to float8),
22
21
and some functionality is missing (e.g., one cannot convert a `__half` to `__nv_bfloat16`).
23
22
24
-
_Kernel Float_ resolves this by offering a single data type `kernel_float::vec<T, N>` that stores `N` elements of type `T`.
25
-
Internally, the data is stored as a fixed-sized array of elements.
23
+
_Kernel Float_ resolves this by offering a single unified vector type `kernel_float::vec<T, N>` that stores `N` elements of type `T`.
24
+
Internally, the data is stored using the optimal data layout for the given type.
26
25
Operator overloading (like `+`, `*`, `&&`) has been implemented such that the most optimal intrinsic for the available types is selected automatically.
27
26
Many mathematical functions (like `log`, `exp`, `sin`) and common operations (such as `sum`, `range`, `for_each`) are also available.
28
27
29
-
Using Kernel Float, developers avoid the complexity of reduced precision floating-point types in CUDA and can focus on their applications.
28
+
The generated assembly is identical to hand-written intrinsics code, meaning you get clean and maintainable source code without sacrificing performance.
30
29
31
30
32
31
## Features
@@ -39,18 +38,14 @@ In a nutshell, _Kernel Float_ offers the following features:
39
38
* Support for quarter (8 bit) floating-point types.
40
39
* Easy integration as a single header file.
41
40
* Written for C++17.
42
-
* Compatible with NVCC (NVIDIA Compiler) and NVRTC (NVIDIA Runtime Compilation).
43
-
* Compatible with HIPCC (AMD HIP Compiler)
44
-
45
-
46
-
## Example
41
+
* Compatible with CUDA: `nvcc` (NVIDIA Compiler) and `nvrtc` (NVIDIA Runtime Compilation).
42
+
* Compatible with HIP: `hipcc` (AMD HIP Compiler)
47
43
48
-
Check out the [examples](https://github.com/KernelTuner/kernel_float/tree/master/examples) directory for some examples.
49
44
45
+
## Quick Example
50
46
51
-
Below shows a simple example of a CUDA kernel that adds a `constant` to the `input` array and writes the results to the`output` array.
47
+
Below shows a simple example kernel that multiplies an `input` array by a `constant`and accumulates into an`output` array.
52
48
Each thread processes two elements.
53
-
Notice how easy it would be to change the precision (for example, `double` to `half`) or the vector size (for example, 4 instead of 2 items per thread).
Notice how easy it would be to change the precision (for example, `double` to `half`) or the vector size (for example, 4 instead of 2 items per thread).
62
+
Check out the [examples](https://github.com/KernelTuner/kernel_float/tree/main/examples) directory for some examples.
63
+
67
64
Here is how the same kernel would look for CUDA without Kernel Float.
Even though the second kernel looks a lot more complex, the PTX code generated by these two kernels is nearly identical.
86
+
Even though the second kernel looks a lot more complex, both generate nearly identical PTX code.
89
87
90
88
91
89
## Installation
@@ -103,9 +101,11 @@ make
103
101
```
104
102
105
103
106
-
## Documentation
104
+
## Links
107
105
108
-
See the [documentation](https://kerneltuner.github.io/kernel_float/) for the [API reference](https://kerneltuner.github.io/kernel_float/api.html) of all functionality.
0 commit comments