Skip to content

Scikit-learn Integrations

Classes for sklearn integration.

Despite similar names, there is a difference between pdpipe PdPipeline and sklearn.pipeline.Pipeline. PdPipeline can only chain transformers while scikit-learn Pipeline objects can further include the final estimator to provide additional methods such as predict and predict_proba.

This means that by itself, pdpipe PdPipeline does not integrate well with some of scikit-learn utility classes such as sklearn.model_selection.GridSearchCV compared to sklearn.pipeline.Pipeline.

This module resolves such integration issues. Refer to the notebooks folder of the pdpipe repository for complete examples.

Classes

PdPipelineAndSklearnEstimator

Bases: BaseEstimator

A PdPipeline object chained before an sklearn estimator object.

This kind of object can also be used with sklearn's GridSearchCV.

See the pipeline_and_model.ipynb notebook in the notebooks folder of the pdpipe repository for a tutorial on how to use this class.

Parameters:

Name Type Description Default
pipeline PdPipeline

The preprocssing pipeline to connect.

required
estimator sklearn.base.BaseEstimator

The model to connect to the pipeline.

required

Attributes:

Name Type Description
pipeline PdPipeline

The preprocssing pipeline composing this pipeline+model object.

model sklearn.base.BaseEstimator

The sklearn model composing this pipeline+model object.

Examples:

>>> import pandas as pd; import pdpipe as pdp;
>>> from pdpipe.skintegrate import PdPipelineAndSklearnEstimator;
>>> from sklearn.linear_model import LogisticRegression;
>>> DF2 = pd.DataFrame(
...    data=[['-1',0], ['-1',0], ['1',1], ['1',1]],
...    index=[1, 2, 3, 4],
...    columns=['feature1', 'target']
... )
>>> all_x = DF2[['feature1']]
>>> all_y = DF2['target']
>>> mp = PdPipelineAndSklearnEstimator(
...    pipeline=pdp.ColumnDtypeEnforcer({'feature1': int}),
...    estimator=LogisticRegression()
... )
>>> mp.fit(all_x, all_y)
<PdPipeline -> LogisticRegression>
>>> res = mp.predict(all_x)
Source code in pdpipe/skintegrate.py
class PdPipelineAndSklearnEstimator(BaseEstimator):
    """
    A PdPipeline object chained before an sklearn estimator object.

    This kind of object can also be used with sklearn's GridSearchCV.

    See the pipeline_and_model.ipynb notebook in the notebooks folder of the
    pdpipe repository for a tutorial on how to use this class.

    Parameters
    ----------
    pipeline : PdPipeline
        The preprocssing pipeline to connect.
    estimator : sklearn.base.BaseEstimator
        The model to connect to the pipeline.

    Attributes
    ----------
    pipeline : PdPipeline
        The preprocssing pipeline composing this pipeline+model object.
    model : sklearn.base.BaseEstimator
        The sklearn model composing this pipeline+model object.

    Examples
    --------
    >>> import pandas as pd; import pdpipe as pdp;
    >>> from pdpipe.skintegrate import PdPipelineAndSklearnEstimator;
    >>> from sklearn.linear_model import LogisticRegression;
    >>> DF2 = pd.DataFrame(
    ...    data=[['-1',0], ['-1',0], ['1',1], ['1',1]],
    ...    index=[1, 2, 3, 4],
    ...    columns=['feature1', 'target']
    ... )
    >>> all_x = DF2[['feature1']]
    >>> all_y = DF2['target']
    >>> mp = PdPipelineAndSklearnEstimator(
    ...    pipeline=pdp.ColumnDtypeEnforcer({'feature1': int}),
    ...    estimator=LogisticRegression()
    ... )
    >>> mp.fit(all_x, all_y)
    <PdPipeline -> LogisticRegression>
    >>> res = mp.predict(all_x)
    """

    def __init__(
        self,
        pipeline: PdPipeline,
        estimator: BaseEstimator,
    ):
        self.pipeline = pipeline
        self.estimator = estimator
        # if hasattr(estimator, "score"):
        #     def _passthrough_scorer(estimator, *args, **kwargs):
        #         """Function that wraps estimator.score"""
        #         return estimator.score(*args, **kwargs)
        #     self.score = _passthrough_scorer

    def __str__(self):
        try:
            return f"<PdPipeline -> {self._est_cls_name}>"
        except AttributeError:
            self._est_cls_name = type(self.estimator).__name__
            return self.__str__()

    def __repr__(self):
        return self.__str__()

    def score(self, X, y=None):
        if y is None:
            post_X = self.pipeline.transform(X)
            return self.estimator.score(X)
        if not isinstance(y, pd.Series):
            y = pd.Series(y)
        y.index = X.index
        post_X, post_y = self.pipeline.transform(X, y)
        assert len(post_X) == len(post_y)
        return self.estimator.score(post_X.values, post_y.values)

    @property
    def _estimator_type(self):
        return self.estimator._estimator_type

    @property
    def classes_(self):
        """
        Class labels.

        Only available when the estimator is a classifier.
        """
        _estimator_has("classes_")(self)
        return self.estimator.classes_

    def fit(self, X, y):
        """
        A reference implementation of a fitting function.

        Parameters
        ----------
        X : pandas.DataFrame, shape (n_samples, n_features)
            The training input samples.
        y : array-like, shape (n_samples,) or (n_samples, n_outputs)
            The target values (class labels in classification, real numbers in
            regression).

        Returns
        -------
        self : object
            Returns self.
        """
        # X, y = check_X_y(X, y, accept_sparse=True)
        if y is not None:
            if not isinstance(y, pd.Series):
                y = pd.Series(y)
            assert len(X) == len(y)
            y.index = X.index
            post_X, post_y = self.pipeline.fit_transform(X=X, y=y)
        else:
            post_X = self.pipeline.fit_transform(X)
            post_y = None
        if post_y is None:
            self.estimator.fit(X=post_X.values, y=None)
        else:
            assert len(post_X) == len(post_y)
            self.estimator.fit(X=post_X.values, y=post_y.values)
        self.is_fitted_ = True
        return self

    @available_if(_estimator_has("predict"))
    def predict(self, X):
        """
        A reference implementation of a predicting function.

        Parameters
        ----------
        X : {array-like, sparse matrix}, shape (n_samples, n_features)
            The training input samples.

        Returns
        -------
        y : ndarray, shape (n_samples,)
            Returns an array of ones.
            The predicted labels or values for `X` based on the estimator with
            the best found parameters.
        """
        # X = check_array(X, accept_sparse=True)
        check_is_fitted(self, "is_fitted_")
        post_X, post_y = self.pipeline.transform(X=X, y=LabelPlaceholderForPredict(X))
        y_pred = self.estimator.predict(X=post_X.values)
        return y_pred

    @available_if(_estimator_has("predict_proba"))
    def predict_proba(self, X):
        """
        Call predict_proba on the estimator with the best found parameters.

        Only available if the underlying estimator supports
        ``predict_proba``.

        Parameters
        ----------
        X : indexable, length n_samples
            Must fulfill the input assumptions of the
            underlying estimator.

        Returns
        -------
        y_pred : ndarray of shape (n_samples,) or (n_samples, n_classes)
            Predicted class probabilities for `X` based on the estimator with
            the best found parameters. The order of the classes corresponds
            to that in the fitted attribute :term:`classes_`.
        """
        check_is_fitted(self, "is_fitted_")
        post_X, post_y = self.pipeline.transform(X=X, y=LabelPlaceholderForPredict(X))
        y_pred = self.estimator.predict_proba(X=post_X.values)
        return y_pred

    @available_if(_estimator_has("predict_log_proba"))
    def predict_log_proba(self, X):
        """Call predict_log_proba on the estimator with best found parameters.
        Only available if the underlying estimator supports
        ``predict_log_proba``.

        Parameters
        ----------
        X : indexable, length n_samples
            Must fulfill the input assumptions of the
            underlying estimator.

        Returns
        -------
        y_pred : ndarray of shape (n_samples,) or (n_samples, n_classes)
            Predicted class log-probabilities for `X` based on the estimator
            with the best found parameters. The order of the classes
            corresponds to that in the fitted attribute :term:`classes_`.
        """
        check_is_fitted(self, "is_fitted_")
        post_X, post_y = self.pipeline.transform(X=X, y=LabelPlaceholderForPredict(X))
        y_pred = self.estimator.predict_log_proba(X=post_X.values)
        return y_pred

    @available_if(_estimator_has("decision_function"))
    def decision_function(self, X):
        """
        Call decision_function on the estimator with best found parameters.

        Only available if the underlying estimator supports
        ``decision_function``.

        Parameters
        ----------
        X : indexable, length n_samples
            Must fulfill the input assumptions of the
            underlying estimator.

        Returns
        -------
        y_score : ndarray of shape (n_samples,) or (n_samples, n_classes) \
                or (n_samples, n_classes * (n_classes-1) / 2)
            Result of the decision function for `X` based on the estimator with
            the best found parameters.
        """
        check_is_fitted(self, "is_fitted_")
        post_X, post_y = self.pipeline.transform(X=X, y=LabelPlaceholderForPredict(X))
        y_score = self.estimator.decision_function(X=post_X.values)
        return y_score

Attributes

pipeline = pipeline instance-attribute
estimator = estimator instance-attribute
classes_ property

Class labels.

Only available when the estimator is a classifier.

Functions

score(X, y=None)
Source code in pdpipe/skintegrate.py
def score(self, X, y=None):
    if y is None:
        post_X = self.pipeline.transform(X)
        return self.estimator.score(X)
    if not isinstance(y, pd.Series):
        y = pd.Series(y)
    y.index = X.index
    post_X, post_y = self.pipeline.transform(X, y)
    assert len(post_X) == len(post_y)
    return self.estimator.score(post_X.values, post_y.values)
fit(X, y)

A reference implementation of a fitting function.

Parameters:

Name Type Description Default
X pandas.DataFrame, shape(n_samples, n_features)

The training input samples.

required
y array-like, shape (n_samples,) or (n_samples, n_outputs)

The target values (class labels in classification, real numbers in regression).

required

Returns:

Name Type Description
self object

Returns self.

Source code in pdpipe/skintegrate.py
def fit(self, X, y):
    """
    A reference implementation of a fitting function.

    Parameters
    ----------
    X : pandas.DataFrame, shape (n_samples, n_features)
        The training input samples.
    y : array-like, shape (n_samples,) or (n_samples, n_outputs)
        The target values (class labels in classification, real numbers in
        regression).

    Returns
    -------
    self : object
        Returns self.
    """
    # X, y = check_X_y(X, y, accept_sparse=True)
    if y is not None:
        if not isinstance(y, pd.Series):
            y = pd.Series(y)
        assert len(X) == len(y)
        y.index = X.index
        post_X, post_y = self.pipeline.fit_transform(X=X, y=y)
    else:
        post_X = self.pipeline.fit_transform(X)
        post_y = None
    if post_y is None:
        self.estimator.fit(X=post_X.values, y=None)
    else:
        assert len(post_X) == len(post_y)
        self.estimator.fit(X=post_X.values, y=post_y.values)
    self.is_fitted_ = True
    return self
predict(X)

A reference implementation of a predicting function.

Parameters:

Name Type Description Default
X

The training input samples.

array-like

Returns:

Name Type Description
y ndarray, shape(n_samples)

Returns an array of ones. The predicted labels or values for X based on the estimator with the best found parameters.

Source code in pdpipe/skintegrate.py
@available_if(_estimator_has("predict"))
def predict(self, X):
    """
    A reference implementation of a predicting function.

    Parameters
    ----------
    X : {array-like, sparse matrix}, shape (n_samples, n_features)
        The training input samples.

    Returns
    -------
    y : ndarray, shape (n_samples,)
        Returns an array of ones.
        The predicted labels or values for `X` based on the estimator with
        the best found parameters.
    """
    # X = check_array(X, accept_sparse=True)
    check_is_fitted(self, "is_fitted_")
    post_X, post_y = self.pipeline.transform(X=X, y=LabelPlaceholderForPredict(X))
    y_pred = self.estimator.predict(X=post_X.values)
    return y_pred
predict_proba(X)

Call predict_proba on the estimator with the best found parameters.

Only available if the underlying estimator supports predict_proba.

Parameters:

Name Type Description Default
X indexable, length n_samples

Must fulfill the input assumptions of the underlying estimator.

required

Returns:

Name Type Description
y_pred ndarray of shape (n_samples,) or (n_samples, n_classes)

Predicted class probabilities for X based on the estimator with the best found parameters. The order of the classes corresponds to that in the fitted attribute :term:classes_.

Source code in pdpipe/skintegrate.py
@available_if(_estimator_has("predict_proba"))
def predict_proba(self, X):
    """
    Call predict_proba on the estimator with the best found parameters.

    Only available if the underlying estimator supports
    ``predict_proba``.

    Parameters
    ----------
    X : indexable, length n_samples
        Must fulfill the input assumptions of the
        underlying estimator.

    Returns
    -------
    y_pred : ndarray of shape (n_samples,) or (n_samples, n_classes)
        Predicted class probabilities for `X` based on the estimator with
        the best found parameters. The order of the classes corresponds
        to that in the fitted attribute :term:`classes_`.
    """
    check_is_fitted(self, "is_fitted_")
    post_X, post_y = self.pipeline.transform(X=X, y=LabelPlaceholderForPredict(X))
    y_pred = self.estimator.predict_proba(X=post_X.values)
    return y_pred
predict_log_proba(X)

Call predict_log_proba on the estimator with best found parameters. Only available if the underlying estimator supports predict_log_proba.

Parameters:

Name Type Description Default
X indexable, length n_samples

Must fulfill the input assumptions of the underlying estimator.

required

Returns:

Name Type Description
y_pred ndarray of shape (n_samples,) or (n_samples, n_classes)

Predicted class log-probabilities for X based on the estimator with the best found parameters. The order of the classes corresponds to that in the fitted attribute :term:classes_.

Source code in pdpipe/skintegrate.py
@available_if(_estimator_has("predict_log_proba"))
def predict_log_proba(self, X):
    """Call predict_log_proba on the estimator with best found parameters.
    Only available if the underlying estimator supports
    ``predict_log_proba``.

    Parameters
    ----------
    X : indexable, length n_samples
        Must fulfill the input assumptions of the
        underlying estimator.

    Returns
    -------
    y_pred : ndarray of shape (n_samples,) or (n_samples, n_classes)
        Predicted class log-probabilities for `X` based on the estimator
        with the best found parameters. The order of the classes
        corresponds to that in the fitted attribute :term:`classes_`.
    """
    check_is_fitted(self, "is_fitted_")
    post_X, post_y = self.pipeline.transform(X=X, y=LabelPlaceholderForPredict(X))
    y_pred = self.estimator.predict_log_proba(X=post_X.values)
    return y_pred
decision_function(X)

Call decision_function on the estimator with best found parameters.

Only available if the underlying estimator supports decision_function.

Parameters:

Name Type Description Default
X indexable, length n_samples

Must fulfill the input assumptions of the underlying estimator.

required

Returns:

Name Type Description
y_score ndarray of shape (n_samples,) or (n_samples, n_classes) or (n_samples, n_classes

Result of the decision function for X based on the estimator with the best found parameters.

Source code in pdpipe/skintegrate.py
@available_if(_estimator_has("decision_function"))
def decision_function(self, X):
    """
    Call decision_function on the estimator with best found parameters.

    Only available if the underlying estimator supports
    ``decision_function``.

    Parameters
    ----------
    X : indexable, length n_samples
        Must fulfill the input assumptions of the
        underlying estimator.

    Returns
    -------
    y_score : ndarray of shape (n_samples,) or (n_samples, n_classes) \
            or (n_samples, n_classes * (n_classes-1) / 2)
        Result of the decision function for `X` based on the estimator with
        the best found parameters.
    """
    check_is_fitted(self, "is_fitted_")
    post_X, post_y = self.pipeline.transform(X=X, y=LabelPlaceholderForPredict(X))
    y_score = self.estimator.decision_function(X=post_X.values)
    return y_score

Functions

available_if(check)

An attribute that is available only if check returns a truthy value.

Parameters:

Name Type Description Default
check callable

When passed the object with the decorated method, this should return a truthy value if the attribute is available, and either return False or raise an AttributeError if not available.

required

Returns:

Type Description
callable

A lambda based attribute.

Source code in pdpipe/skintegrate.py
def available_if(check):
    """
    An attribute that is available only if check returns a truthy value.

    Parameters
    ----------
    check : callable
        When passed the object with the decorated method, this should return
        a truthy value if the attribute is available, and either return False
        or raise an AttributeError if not available.

    Returns
    -------
    callable
        A lambda based attribute.
    """
    return lambda fn: _AvailableIfDescriptor(fn, check, attribute_name=fn.__name__)

pdpipe_scorer_from_sklearn_scorer(scorer)

Converts an sklearn scorer to one that will work with pdpipe.

The returned scorer function can then be used with sklearn's model-evaluation tools using cross-validation (such as model_selection.cross_val_score and model_selection.GridSearchCV), when searching over the hyperparameter space of a PdPipelineAndSklearnEstimator object.

See the pipeline_and_model_with_test_test.ipynb notebook in the notebooks folder of the pdpipe repository for a complete example.

Parameters:

Name Type Description Default
scorer callable

A function with the signature scorer(estimator, X, y). To build one from an sklearn score function (with a signature of the form score(y_true, y_pred, ...)) use the sklearn.metrics.make_scorer function.

required

Returns:

Name Type Description
pdpipe_scorer callable

A scorer that is aware of the fact that PdPipelineAndSklearnEstimator has an inner pipeline object that should be used to transform input X (which is a dataframe when using pdpipe, and not a numpy.ndarray).

Source code in pdpipe/skintegrate.py
def pdpipe_scorer_from_sklearn_scorer(scorer: Callable) -> Callable:
    """
    Converts an sklearn scorer to one that will work with pdpipe.

    The returned scorer function can then be used with sklearn's
    model-evaluation tools using cross-validation (such as
    model_selection.cross_val_score and model_selection.GridSearchCV), when
    searching over the hyperparameter space of a PdPipelineAndSklearnEstimator
    object.

    See the pipeline_and_model_with_test_test.ipynb notebook in the notebooks
    folder of the pdpipe repository for a complete example.

    Parameters
    ----------
    scorer : callable
        A function with the signature `scorer(estimator, X, y)`. To build one
        from an sklearn `score` function (with a signature of the form
        `score(y_true, y_pred, ...)`) use the `sklearn.metrics.make_scorer`
        function.

    Returns
    -------
    pdpipe_scorer : callable
        A scorer that is aware of the fact that PdPipelineAndSklearnEstimator
        has an inner pipeline object that should be used to transform input
        X (which is a dataframe when using pdpipe, and not a numpy.ndarray).
    """
    return _PdPipeScorer(scorer)

Last update: 2022-01-19