Member-only story
Opinion
River: the Best Python Library for Online Machine Learning
The “sklearn” for machine learning on streaming data
Conventional machine learning algorithms, such as linear regression and xgboost, operate in “batch” mode. That is, they fit a model using a full dataset in one go. Updating that model with new data requires fitting a brand new model from scratch using both the new data and the old data.
In many applications, this can be difficult or impossible! It requires all data to fit into memory, which isn’t always possible. The model itself can be slow to re-train. Retrieving older data for the model can be a big challenge, particularly in applications where data is continuously generated. Storing historical data requires data storage infrastructure with the capability of returning the full history of data quickly.
Alternatively, models can be trained “online” or in “streaming” mode. In this case, data is treated as a stream or sequence of items that are passed to a model one by one.
Incremental learning, continual learning, and stream learning are preferred terms to “online learning” because searches for “online learning” largely point to point to educational websites.