Skip to content

Commit f9ffc69

Browse files
committed
Update README
1 parent 6dcf000 commit f9ffc69

2 files changed

Lines changed: 24 additions & 20 deletions

File tree

.github/workflows/cmake.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,10 @@ name: CMake
22

33
on:
44
push:
5+
paths-ignore:
6+
- '**.md'
7+
- 'docs/**'
8+
- '.gitignore'
59
pull_request:
610
branches: [ "main" ]
711

README.md

Lines changed: 20 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,7 @@
99
![GitHub Repo stars](https://img.shields.io/github/stars/KernelTuner/kernel_float?style=social)
1010

1111

12-
_Kernel Float_ is a header-only library for CUDA/HIP that simplifies working with vector types and reduced precision floating-point arithmetic in GPU code.
13-
12+
_Kernel Float_ is a header-only library for CUDA/HIP that makes working with reduced-precision floating-point types and vector arithmetic simple and expressive, with zero performance overhead.
1413

1514
## Summary
1615

@@ -21,12 +20,12 @@ mathematical operations require intrinsics (e.g., `__hadd2` performs addition fo
2120
type conversion is awkward (e.g., `__nv_cvt_halfraw2_to_fp8x2` converts float16 to float8),
2221
and some functionality is missing (e.g., one cannot convert a `__half` to `__nv_bfloat16`).
2322

24-
_Kernel Float_ resolves this by offering a single data type `kernel_float::vec<T, N>` that stores `N` elements of type `T`.
25-
Internally, the data is stored as a fixed-sized array of elements.
23+
_Kernel Float_ resolves this by offering a single unified vector type `kernel_float::vec<T, N>` that stores `N` elements of type `T`.
24+
Internally, the data is stored using the optimal data layout for the given type.
2625
Operator overloading (like `+`, `*`, `&&`) has been implemented such that the most optimal intrinsic for the available types is selected automatically.
2726
Many mathematical functions (like `log`, `exp`, `sin`) and common operations (such as `sum`, `range`, `for_each`) are also available.
2827

29-
Using Kernel Float, developers avoid the complexity of reduced precision floating-point types in CUDA and can focus on their applications.
28+
The generated assembly is identical to hand-written intrinsics code, meaning you get clean and maintainable source code without sacrificing performance.
3029

3130

3231
## Features
@@ -39,18 +38,14 @@ In a nutshell, _Kernel Float_ offers the following features:
3938
* Support for quarter (8 bit) floating-point types.
4039
* Easy integration as a single header file.
4140
* Written for C++17.
42-
* Compatible with NVCC (NVIDIA Compiler) and NVRTC (NVIDIA Runtime Compilation).
43-
* Compatible with HIPCC (AMD HIP Compiler)
44-
45-
46-
## Example
41+
* Compatible with CUDA: `nvcc` (NVIDIA Compiler) and `nvrtc` (NVIDIA Runtime Compilation).
42+
* Compatible with HIP: `hipcc` (AMD HIP Compiler)
4743

48-
Check out the [examples](https://github.com/KernelTuner/kernel_float/tree/master/examples) directory for some examples.
4944

45+
## Quick Example
5046

51-
Below shows a simple example of a CUDA kernel that adds a `constant` to the `input` array and writes the results to the `output` array.
47+
Below shows a simple example kernel that multiplies an `input` array by a `constant` and accumulates into an `output` array.
5248
Each thread processes two elements.
53-
Notice how easy it would be to change the precision (for example, `double` to `half`) or the vector size (for example, 4 instead of 2 items per thread).
5449

5550

5651
```cpp
@@ -61,13 +56,17 @@ __global__ void kernel(kf::vec_ptr<const half, 2> input, int constant, kf::vec_p
6156
int i = blockIdx.x * blockDim.x + threadIdx.x;
6257
output[i] += input[i] * constant;
6358
}
64-
6559
```
6660
61+
Notice how easy it would be to change the precision (for example, `double` to `half`) or the vector size (for example, 4 instead of 2 items per thread).
62+
Check out the [examples](https://github.com/KernelTuner/kernel_float/tree/main/examples) directory for some examples.
63+
6764
Here is how the same kernel would look for CUDA without Kernel Float.
6865
6966
```cpp
70-
__global__ void kernel(const half* input, double constant, float* output) {
67+
#include <cuda_fp16.h>
68+
69+
__global__ void kernel(const half* input, int constant, float* output) {
7170
int i = blockIdx.x * blockDim.x + threadIdx.x;
7271
__half in0 = input[2 * i + 0];
7372
__half in1 = input[2 * i + 1];
@@ -82,10 +81,9 @@ __global__ void kernel(const half* input, double constant, float* output) {
8281
output[2 * i + 0] += out0;
8382
output[2 * i + 1] += out1;
8483
}
85-
8684
```
8785

88-
Even though the second kernel looks a lot more complex, the PTX code generated by these two kernels is nearly identical.
86+
Even though the second kernel looks a lot more complex, both generate nearly identical PTX code.
8987

9088

9189
## Installation
@@ -103,9 +101,11 @@ make
103101
```
104102

105103

106-
## Documentation
104+
## Links
107105

108-
See the [documentation](https://kerneltuner.github.io/kernel_float/) for the [API reference](https://kerneltuner.github.io/kernel_float/api.html) of all functionality.
106+
- [Documentation](https://kerneltuner.github.io/kernel_float/)
107+
- [API reference](https://kerneltuner.github.io/kernel_float/api.html)
108+
- [Examples](https://github.com/KernelTuner/kernel_float/tree/main/examples)
109109

110110

111111
## Citation
@@ -124,7 +124,7 @@ If you use Kernel Float in scholarly work, please cite the following paper:
124124

125125
## License
126126

127-
Licensed under Apache 2.0. See [LICENSE](https://github.com/KernelTuner/kernel_float/blob/master/LICENSE).
127+
Licensed under Apache 2.0. See [LICENSE](https://github.com/KernelTuner/kernel_float/blob/main/LICENSE).
128128

129129

130130
## Related Work

0 commit comments

Comments
 (0)