feat(bitaxe): add PI-based fan speed controller#33
feat(bitaxe): add PI-based fan speed controller#33jayrmotta wants to merge 1 commit into256foundation:mainfrom
Conversation
3db71c8 to
5dd129b
Compare
johnnyasantoss
left a comment
There was a problem hiding this comment.
Tested it throughly and tweaked a lot the parameters before realizing that the problem wasn't that the PI controller didn't have the correct parameters, but the output was being added to something else and then clamped.
This made the PI "fight" back the base 25% fan speed.
With the changes I proposed here I'm Running it for hours now under 65c. When it gets really hot (by tweaking core freq + voltage), the fan actually goes to 100%.
I tested this on a BitAxe Gamma (601) but had to lower the frequency steps and lower the voltage (it would get too hot with the stock settings). The settings I'm using freq 412.5Mhz (74 ramp steps) and 1.01v.
This may be unrelated but I can run much higher frequencies on ESP-Miner with the same hardware.
5dd129b to
51457c0
Compare
I'd be quite interested in getting to the bottom of any differences like this between Mujina and esp-miner. Do you mean esp-miner runs a lot cooler than Mujina for the same frequency? If so, have you compared the chip voltage in both cases? (FWIW, I'm working up a larger review of this PR.) |
|
Thanks for working on this, Jayr! It sounds like you guys are getting the PI controller math and the temperature filter dialed in. I have some architectural feedback that I think will make this simpler and easier to build on. The core job here is: read temperature, compute fan speed, set fan speed. That's a sequential operation, but right now the PR spreads it across four async tasks—fan worker dispatch loop, temperature reader, PI controller, and fan speed listener—connected by five channels. That's a lot of moving parts. Let me walk through what I'd suggest instead. Remove fan worker actorThe regulator: Option<Arc<Mutex<Tps546<BitaxeRawI2c>>>>,The regulator is shared between the stats monitor and the board lifecycle with a simple But actually, we can do even better. Instead of sharing the EMC2101 behind a mutex, let's give it a single owner. One owner for the EMC2101The stats monitor needs temperature, fan setpoint, and RPM. The fan controller needs temperature. Both need the EMC2101. Rather than sharing it, let one task own it exclusively and publish a cached snapshot for everyone else: /// Cached readings, published via watch channel.
struct ThermalReadings {
temperature: Option<f32>,
fan_percent: Option<u8>,
fan_rpm: Option<u32>,
}That single task does everything in one loop: loop {
// Read hardware (only task that touches the EMC2101)
let temp = emc2101.get_external_temperature().await.ok();
let rpm = emc2101.get_rpm().await.ok();
let pct = emc2101.get_fan_speed().await.ok().map(u8::from);
// Run control math, set speed
if let Some(temp) = temp {
if let Some(speed) = controller.update(temp, dt) {
emc2101.set_fan_speed(speed).await;
}
}
// Publish snapshot for stats monitor
snapshot_tx.send(ThermalReadings { temp, pct, rpm });
sleep(interval).await;
}The stats monitor holds a This loop lives in a single spawned task—similar to how This also leaves room for a future API override of the fan speed. The thermal task is the single writer to the EMC2101, so an API "hold fan at X%" command could just tell that task to pause the PI controller and hold a fixed speed until the API releases the constraint. The fan controller should be a plain objectThe PI controller and temperature filter are pure math—they don't need to be async, and they don't need channels. A synchronous pub struct FanController {
filter: TemperatureFilter,
pi: PiController,
target_temp: f32,
}
impl FanController {
/// Feed a raw sensor reading; get back a fan speed if the
/// reading passes the noise filter.
/// `dt` is the time since the last call. The integral term
/// needs reasonably stable intervals, but that's a given
/// since the caller runs on a `time::interval`.
pub fn update(
&mut self,
reading: f32,
dt: Duration,
) -> Option<Percent> {
let temp = self.filter.consider(reading)?;
let error = temp - self.target_temp;
let output = self.pi.update(error, dt);
let speed = (MIN + output).clamp(MIN, MAX);
Some(Percent::new_clamped(speed as u8))
}
}The caller provides a temperature, gets back a fan speed. That's the whole interface. The board decides how and when to read the sensor and how to apply the result. The PI gains are tuned assuming a roughly stable tick interval, so pub struct FanControllerConfig {
pub target_temp: f32,
pub kp: f32,
pub ki: f32,
pub integral_bounds: (f32, f32),
}
impl FanController {
pub fn new(config: FanControllerConfig) -> Self { ... }
}Where things should liveFor now, keeping the fan controller in The temperature filter is an implementation detail of the controller—I'd fold it into Summary
This should cut the non-test code size significantly and be easier to follow and extend. The PI tuning, filter, logic, and test coverage you've built are good—it's really just the plumbing around them that I'd like to see simplified. I may be oversimplifying in places; if you encounter details I'm glossing over, let's talk through it. Reaching too eagerly for tasks and channels is a common anti-pattern in async Rust. The right idea is usually to start with synchronous code and only introduce concurrency when you have a concrete reason. |
|
Because this is still a work in progress, I've marked it as a draft PR. |
51457c0 to
c721d54
Compare
c721d54 to
d17da6b
Compare
|
Indeed that's a lot simpler @rkuester, please take a look when you got the time |
|
Tested the revised architecture on a Bitaxe Gamma 601 (BM1370) running bitaxe-raw over USB — combined with PR #39 locally for macOS IOKit discovery. ~30 minute run: discovery → ASIC init → pool connect → shares submitted → thermal control active throughout. Fan settled at 47% (~4,600 RPM) with ASIC stable at 70°C. The single-task EMC2101 owner + watch channel approach works well in practice. A few things I noticed that weren't covered in the earlier discussion: Panic if EMC2101 init fails init_fan_controller handles a failed fan.init() with a warning and Ok(()), leaving self.fan as None. Then spawn_thermal_task calls .expect("Fan controller must be initialized before spawning thermal monitor") on it. If the hardware isn't there, that's a panic rather than a graceful degradation. Might be worth either propagating the error or skipping the thermal task spawn when self.fan is None, depending on whether the fan is considered required. Dangling doc reference on FanPIController Line 104: "See module-level documentation for rationale." The module doc just describes what the controller does — the rationale for omitting the derivative term (noise amplification from thermal diodes) isn't actually written down anywhere. It's a good reason and worth capturing; right now the reference just points to nothing. No unit tests for FanPIController TemperatureFilter has solid coverage. If the PI update logic, windup clamping, and output clamping ever get touched during a future tuning pass, having a few fixed-input tests there would help catch regressions. |
d17da6b to
918dbcc
Compare
|
Hey @Nickamoto, I appreciate the review, tests, and feedback 💪 I re-added the I also added an early return to the thermal task, so if the fan fails to initialize the firmware won't panic. But on the other hand it might also not set the fan to 100% speed, which could potentially risk the hw? I've observed that bitaxe-raw is intermittent with booting with the fan on/off. Thanks for the shout with the docs as well, it slipped my mind, I think it's okay now. |
On the test question, I think the guideline is on your side for keeping them. TEST.behavior is aimed at tests that break when you refactor internals while preserving the contract, like asserting a hardcoded specificity score. A PI controller is a pure function: given an error and a timestep, you get a control output. Testing that relationship is testing the behavioral contract, not the implementation. If you swap out the math for a different approach someday and the output changes, the test should catch that. I'd keep them. On the fan init failure, the early return fixes the panic. This is good, but I'm not sure Ok(()) is the right outcome when thermal protection can't start. The miner continues hashing with no fan control, and the fan stays at whatever state the hardware defaulted to rather than a known safe speed. For a Bitaxe at 70°C that might be fine, but as this code path gets reused on higher-power boards it becomes riskier. Worth considering whether a failed fan init should be a hard error that stops the board from starting rather than a silent degradation. The intermittent boot behavior on bitaxe-raw is worth noting, but I wouldn't let hardware quirks drive the error handling strategy for all boards. |
Any thoughts on this @rkuester ? Would this be a case worth panicking? |
918dbcc to
85ad4df
Compare
|
I rebased this on top of the latest main. |
There was a problem hiding this comment.
Hey Jayr! Thanks for incorporating my previous feedback. The design and simplicity of this is much improved now. I've added some comments inline.
One higher level question: Does this adequately handle failure cases? If we have any doubts about whether we have things under control, we should always attempt to set the fan full-on. E.g., update() could track whether it's getting good temperature readings and start defaulting to 100% if it encounters too many errors.
We need a "first do no harm" approach. Fan control is a convenience and optimization feature. The value is reducing fan noise when possible and letting the chip reach its most efficient operating temperature rather than overcooling at 100%. But by taking control of the fan, we take on the responsibility of not making things worse.
| /// since the caller runs on a `time::interval`. | ||
| pub fn update(&mut self, reading: f32, dt: Duration) -> Option<Percent> { | ||
| if let Some(temp) = self.filter.consider(reading) { | ||
| info!(temp_c = %temp, "Temperature reading"); |
There was a problem hiding this comment.
Info every 4s is too noisy! This should be trace.
There was a problem hiding this comment.
The problem with trace is that it gets next level noisy with serial comms and everything else. Maybe debug instead? This is what I've been using most of the time during development anyway.
If you agree then end users would see temperature along other stats every 30s and debug would also report thermal readings every 4 seconds.
There was a problem hiding this comment.
Yes, global trace is noisy. The solution is per-module RUST_LOG filtering. You can run at info globally and turn up just the fan controller:
RUST_LOG=mujina_miner=info,mujina_miner::board::bitaxe::fan_controller=trace
The CODING_GUIDELINES.md section L.filter covers the syntax in detail, and I added a note to the user-facing README about this in ece3334.
Good question. Not a panic, no. We want to avoid panics wherever possible. A panic takes down the whole process (or at minimum the task), leaving the system in an unknown state. If there's a reasonable chance something can fail, we should handle it with a proper error path. But I also don't think we should silently continue. The concern isn't controlling the fan per se, it's that if we can't reach a peripheral on the board, that's a sign something is wrong at a lower level and we probably shouldn't push forward. I'd return an error from I notice there are several |
0dfc135 to
f559648
Compare
7173104 to
59ed40b
Compare
|
Spiced it up a little @rkuester, I wasn't happy with the I'm happy to undo and just have it all plain in the original function, but I think it looks better like this and I feel better knowing we have test coverage for that as well. |
ddc5d18 to
4203ea8
Compare
rkuester
left a comment
There was a problem hiding this comment.
I appreciate all the work that's gone into this, but I think we're trending in the direction of too much complexity again. And I was finding it hard to explain how to integrate more fail-safe behavior into the current structure.
I realized some of that is on me. The existing Bitaxe board code was a mess, and didn't give clean integration points. So I stepped back and pushed a series of refactoring commits to main to clean up the foundation:
-
e14f3b1..13d6fc4: Flatten initialization into the factory function. The diff in some of these commits is larger than I normally like, but it was hard to do surgically because it's mostly moving a lot of code around. Not much is new; it's just very reorganized. The old code spread initialization across
new(),initialize(), and the factory, deferring work intoOptionfields that had to be checked later. That made error handling awkward. When something goes wrong during init, I want to bail out cleanly, and that's hard when the state is scattered across many methods. Now the factory function handles it all. If any step fails, it can error out with nothing partial left behind. -
27dce6b: Replace the old stats monitor with a single
run_monitor()loop that reads all sensors each tick (EMC2101 + regulator), publishes telemetry, and logs a summary. I think having one active loop is going to be simpler than splitting things into separate tasks for thermal, telemetry, control, etc. Everything works from the same sensor readings taken in the same pass. Components can communicate and coordinate through local state instead of mutexes or channels. It's simpler to read, and it makes shutdown much more straightforward. There's only one place to handle errors and tear things down. -
94c8850:
BitaxeAsicEnablenow records when the ASIC was last taken out of reset, shared between the hash thread and monitor via a cloned struct. As I was implementing the fail-safe behavior below, I kept getting spurious 85-90 C readings on the first tick after startup. The cause seems to be something transient when the hash thread releases the ASIC from reset, which briefly corrupts the thermal diode measurement. The monitor now waits 500ms after the ASIC is enabled before trusting temperature readings. This solved it on my hardware; curious what you see on yours. -
3487b15: Basic thermal safety. Temperature readings are classified into four categories (I2C/diode faults, out-of-range, above-emergency, normal). Anything except normal increments a consecutive bad-reading counter; after a threshold the board shuts down with the fan at 100%. This is an initial implementation to communicate some of what fail-safe behavior means to me, but feel free to adjust it as you integrate active fan control. It's better than the nothing that was there before, and it puts some of my thinking into code.
I realize this means reworking the PR against the new structure, and I'm sorry about that. I know we've been through several rounds already. I think starting from these cleaner integration points will make the result simpler, and hopefully we can get this across the line without too many more iterations.
Regarding the PR code, I think the FanController, FanPIController, TemperatureFilter, ThermalTask, ThermalStartupDevice, etc. layering has more indirection than the problem calls for. The core of it is simple PI math: given a target temperature, a current temperature, and a time delta, compute a fan speed. I think most of that could be in one struct with an update_speed(current_temp, dt) -> Percent method rather than several types that spend a lot of code constructing each other passing state around. Some temperature filtering (range checks, noise rejection) may still be needed on top of what the monitor does now, but esp-miner doesn't do any filtering, if I recall correctly.
Since everything now runs in the single board monitor loop, the ThermalTask and the ThermalStartupDevice trait shouldn't be needed anymore. I really think we can limit knowledge of the EMC2101 device to the bitaxe.rs module.
I'm sympathetic to the need to add seams for testing, and I realize there's none of that in what I added to main for thermal safety. Perhaps that could be lightly factored out into a unit where the logic would be testable.
Did some digging on ESP-miner and found out they actually do filter values (check here https://github.com/bitaxeorg/ESP-Miner/blob/master/main/tasks/fan_controller_task.c#L92-L99). They use an EMA filter in contrast to the rolling window + threshold filter I introduced. So I decided to run an experiment with a plausible time series against the two implementations.
Time series[
39.374863, 40.091479, 37.147465, 42.988044, 38.013364, 39.616776, 41.809599, 39.363224, 39.820107, 40.219403,
42.585975, 45.372985, 42.172862, 40.609544, 43.270899, 41.743951, 42.785375, 44.752452, 42.044023, 43.355246,
42.200426, 43.458816, 44.254202, 42.562058, 43.712873, 44.042709, 43.616382, 42.967325, 42.792797, 44.870260,
44.872796, 48.799311, 41.975989, 45.973110, 46.535566, 48.195230, 47.084444, 45.241217, 46.683432, 47.672825,
46.564914, 48.367573, 44.584799, 50.159591, 49.940210, 47.411057, 49.007463, 48.338288, 47.198508, 48.749656,
50.294519, 48.398210, 48.582225, 49.209902, 50.147732, 51.590936, 48.897227, 50.787720, 50.525179, 47.588966,
49.923789, 50.530150, 50.845167, 51.569407, 48.195797, 51.501544, 50.591527, 49.624761, 52.746244, 53.238598,
51.331682, 53.751122, 52.091042, 52.938929, 51.366967, 55.567685, 49.377660, 53.405535, 54.761272, 50.844273,
122.989222, 54.130403, 55.659165, 56.470309, 54.168031, 52.872971, 57.176481, 57.128336, 54.784379, 56.179599,
121.399732, 56.854438, 56.487216, 58.466828, 53.925724, 58.269779, 57.455130, 56.734599, 58.996478, 53.896656,
126.136026, 58.342936, 56.239999, 58.765217, 57.835328, 62.345279, 61.396215, 59.480130, 57.916094, 58.295732,
123.706837, 58.682261, 58.345123, 58.987913, 59.635982, 59.450805, 59.354125, 59.184552, 64.579426, 58.719888,
122.036921, 61.930001, 60.971916, 62.268903, 63.007815, 63.878845, 63.288751, 60.846042, 64.176278, 60.041066,
62.692690, 62.292940, 61.624882, 64.279460, 63.292954, 65.273501, 61.695900, 65.364945, 65.018267, 64.636992,
62.494900, 64.421334, 62.654862, 62.026776, 70.246251, 66.958805, 68.367281, 65.210751, 67.074448, 67.252155,
67.284183, 66.563278, 66.391297, 63.805129, 68.000643, 67.896552, 69.114016, 67.201702, 70.401663, 67.293573,
66.257563, 69.723828, 67.789943, 66.786261, 69.031457, 70.154854, 69.558139, 70.117998, 75.710778, 70.955299,
72.197138, 67.096608, 70.798837, 71.588259, 70.056296, 69.464925, 71.549554, 70.658727, 70.415899, 73.206663,
73.661641, 72.288115, 71.328708, 72.957486, 73.605997, 73.483621, 70.533521, 72.382421, 73.604027, 74.174537,
74.858258, 74.730520, 72.067366, 72.883592, 72.288459, 77.003197, 74.742977, 75.477988, 76.373664, 74.506496
]In my eyes the performance is similar and the trade-offs are mostly conceptual versus implementation complexity. I believe EMA's implementation is simpler but the concept is more complex and vice-versa with the RW+T filter. Which one you think we should go for @rkuester ? EMA, RW+T, or none? |
4203ea8 to
122d70e
Compare
Interesting. That's quite a recent addition (bitaxeorg/ESP-Miner#1640)—about one week ago! Well, if it's good for them it's good for us. Be sure to think through the effect of any differences in filter constants and sampling rate. Looks like you've got a good handle on it with your modeling there. I'm a fan 🤦♂️ of EMA, it being simpler. |
10396b8 to
a9f836e
Compare
Implement a closed-loop fan controller that adjusts PWM duty cycle to maintain the target ASIC temperature. The PI controller uses a proportional term for quick reactions and an integral term to eliminate steady-state offset. Anti-windup clamping prevents integral windup during extended overheat conditions.
a9f836e to
ba0d506
Compare


Summary
Currently the fan controller available on Mujina sets the fan speed to 100% as the system initializes, it sits at that value through the entire firmware lifecycle. Issue #9 describes this behavior and points to esp-miner's implementation of a PID controller to control the temperature.
Changes
Goal
Improve thermal management and protect Bitaxe hardware under sustained load by actively controlling the fan based on filtered temperature readings.
Disclaimer
I'm currently unable to stabilize the Bitaxe with the fan controller alone, in a previous experiment I was also controlling the ASIC frequency which did help but for simplicity (under @rkuester's guidance) we decided to first introduce the fan control and then maybe later introduce frequency control.
I got a second Bitaxe thinking it would potentially not heat as much as the first one but I'm struggling to flash and run bitaxe-raw on it.
Testing
Fixes: #9