Background and API design

Background and API design#

There have been long standing efficiency issues with scikit-learn’s. In particular, the ball tree and k-d tree to not scale well to high dimensional spaces. The decision was taken that the best way to integrate other techniques was to allow all applicable unsupervised estimators methods to take a sparse matrix, typically being a KNN-graph of the points, but potentially being any estimate. These slides from PyParis 2018 explain some background, while issue #10463 and pull request #10482 give discussion, justification and benchmarks and more detail regarding the approach.

The main advantage of this technique is that the sparse matrix/KNN-graph can be built transformer from the data, and these to be sequenced using the scikit-learn pipeline mechanism. This approach allows for, for example parameter search to be done on the KNN-graph construction technique together with the estimator. Typically the transformer should closely follow the interface of KNeighborsTransformer. The exact contract is outlined in the user guide. . There is also an example notebook with early versions of the transformers in this library.