Building Pipelines: Generator functions VS constructors
This is a short one, showcasing two ways to dynamically build complex, parameterized, pipelines.
Generator functions for pipelines
One way to generate complex pipeline is by defining a generator function whose signature is composed of both all the parameters that define the structure of the pipeline (like whether or not to include a specific pipeline stage, or a whole section of the pipeline), AND all the parameters inner pipeline stages need to get when initializing them.
from typing import List
import pandas as pd
import pdpipe as pdp
class Halfer:
def __init__(self, columns_to_halve: List[object]) -> None:
self.columns_to_halve = columns_to_halve
def __call__(self, row: pd.Series) -> pd.Series:
new = {
f'{lbl}/2': row[lbl] / 2
for lbl in self.columns_to_halve
}
return pd.Series(new)
def pipeline_constructor(
columns_to_drop: List[object],
columns_to_half: List[object],
scale: bool,
) -> pdp.PdPipeline:
"""Constructs my pandas dataframe-processing pipeline, according to some input arguments.
Parameters
----------
columns_to_drop : list of objects
A list of the labels of the columns to drop.
Any Python object that can be used as pandas label can be included in the list.
columns_to_half : list of objects
A list of the labels of the columns to half.
For each such a column, an additional new column, containing its halved values, is generated.
Each new column has the label "x/2", where "x" is the label of the corresponding original column.
Any Python object that can be used as pandas label can be included in the list.
scale : bool
If True, the last pipeline stage min-max scales all numerical columns.
Otherwise, no such pipeline stage is appended to the pipeline.
Returns
-------
pipeline : pdpipe.PdPipeline
The resulting pipeline constructed by this constructor.
"""
stages = [
pdp.ColDrop(columns_to_drop),
pdp.ApplyToRows(
func=Halfer(columns_to_half),
follow_column='years',
),
]
if scale:
stages.append(pdp.Scale('MinMaxScaler'))
return pdp.PdPipeline(stages)
Constructors for pipelines
Another option to achieve the same result is to directly extend the pdpipe.PdPipeline
class. The created stages can be sent to the constructor of the super class, as can any extra keyword-arguments, allowing you to preserve all functionality of the PdPipeline
class.
from typing import List
import pandas as pd
import pdpipe as pdp
class Halfer:
def __init__(self, columns_to_halve: List[object]) -> None:
self.columns_to_halve = columns_to_halve
def __call__(self, row: pd.Series) -> pd.Series:
new = {
f'{lbl}/2': row[lbl] / 2
for lbl in self.columns_to_halve
}
return pd.Series(new)
class MyPipeline(pdp.PdPipeline):
def __init__(
columns_to_drop: List[object],
columns_to_half: List[object],
scale: bool,
**kwargs: object, # (1)
) -> None:
"""My pandas dataframe-processing pipeline, according to some input arguments.
Parameters
----------
columns_to_drop : list of objects
A list of the labels of the columns to drop.
Any Python object that can be used as pandas label can be included in the list.
columns_to_half : list of objects
A list of the labels of the columns to half.
For each such a column, an additional new column, containing its halved values, is generated.
Each new column has the label "x/2", where "x" is the label of the corresponding original column.
Any Python object that can be used as pandas label can be included in the list.
scale : bool
If True, the last pipeline stage min-max scales all numerical columns.
Otherwise, no such pipeline stage is appended to the pipeline.
"""
stages = [
pdp.ColDrop(columns_to_drop),
pdp.ApplyToRows(
func=Halfer(columns_to_half),
follow_column='years',
),
]
if scale:
stages.append(pdp.Scale('MinMaxScaler'))
super().__init__(stages=stages, **kwargs)
- This is the correct way to type-hint the
**kwargs
variable argument operator. We only need to hint the type of values in thekwargs
dict, and thus, if we don't want to contrain them at all, we type-hintobject
.
That's it!
Getting help
Remember you can get help on our Gitter chat or on our GitHub Discussions forum.