Questions on Automatic Data Collection and Labeling in Data Drift #681
Replies: 1 comment
-
|
Good catch @qyy2003. You've found a real gap in the current coverage. You're right that Ch6 assumes human annotators and Ch14 doesn't address automated labeling for drift. This is an active research area. A few directions worth knowing about: Programmatic labeling (Snorkel-style) where you write heuristics or use knowledge bases to generate noisy labels, then learn to denoise them. Self-training where you use the model's confident predictions as pseudo-labels for fine-tuning. Active learning where you strategically pick the most informative samples for human review instead of labeling everything. And increasingly, foundation model distillation where a big cloud model labels your edge data so you can fine-tune the small model. The fundamental tension is you need labels to detect and fix drift, but drift means your existing model (your best automatic labeler) is becoming unreliable. Classic chicken-and-egg, and that's what makes it a rich research question. Thanks for the thoughtful feedback. This is the kind of thing that helps us improve the book. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Data drift can degrade model performance. Small models deployed on mobile and edge devices suffer more than large foundation models.
A common mitigation strategy is to fine-tune and redeploy the model. However, relying on experts to manually collect, label, and fine-tune models at regular intervals is impractical. Both model monitoring and fine-tuning require access to ground truth, raising the critical question: How can we automate data collection and, more importantly, labeling?
Note: The approach discussed in Chapter 6: Data Engineering still appears to require human annotators, and Chapter 14: On-Device Learning does not address this issue.
Beta Was this translation helpful? Give feedback.
All reactions