Starting out with pdpipe
So how does using pdpipe
looks like? Let's first import pandas
and pdpipe
, an intialize a nice little dataframe:
import pandas as pd
raw_df = pd.DataFrame(
data=[
[42, 23, 'Jo', 'M', True, False, 0.07, 'USA', 'Living life to its fullest'],
[81, 23, 'Dana', 'F', True, True, 0.3, 'USA', 'the pen is mightier then the sword'],
[11, 25, 'Bo', 'M', False, True, 2.3, 'Greece', 'all for one and one for all'],
[14, 44, 'Derek', 'M', True, True, 1.1, 'Denmark', 'every life is precious'],
[22, 72, 'Regina', 'F', True, False, 7.1, 'Greece', 'all of you get off my porch'],
[48, 50, 'Jim', 'M', False, False, 0.2, 'Germany', 'boy do I love dogs and cats'],
[50, 80, 'Richy', 'M', False, True, 100.2, 'Finland', 'I love Euro bills'],
[80, 80, 'Wealthus', 'F', False, True, 123.2, 'Finland', 'In Finance We Trust'],
],
columns=['Id', 'Age', 'Name', 'Gender', 'Smoking', 'Runs', 'Savings', 'Country', 'Quote'],
)
This results in the following dataframe:
Constructing pipelines
We can create different pipeline stage object by calling their constructors,
which can be of course identified by their camel-cased names, such as
pdp.ColDrop
for dropping columns and pdp.Encode
to encode them, etc.
To build a pipeline, we will usually call the PdPipeline
class constructor,
and provide it with a list of pipeline stage objects:
import pdpipe as pdp
from pdpipe import df
pipeline = pdp.PdPipeline([
df.set_index('Id'),
pdp.ColDrop('Name'),
df.drop_rows_where['Savings'] > 100,
df['Healthy'] << df['Runs'] & ~df['Smoking'],
pdp.Bin({'Savings': [1]}, drop=False),
pdp.Scale('StandardScaler'),
pdp.TokenizeText('Quote'),
pdp.SnowballStem('EnglishStemmer', columns=['Quote']),
pdp.RemoveStopwords('English', 'Quote'),
pdp.Encode('Gender'),
pdp.OneHotEncode('Country'),
])
pdpipe's df handle
pdpipe
has a powerful handle named df
, which can be used in several ways:
-
Creating column assignment pipeline stages that use series-level operators and functions, such as with
df['c'] << df['a'] + df['b'].map({1: 3, 2:4})
. -
Create pipeline stages from
panads.DataFrame
methods that represent dataframe-to-dataframe transforms, such asset_index
,fillna
,rename
, etc. - Use custom fly handles such as
drop_rows_where
andkeep_rows where
, such as with(df.drop_rows_where['a'] < 4) & (df.drop_rows_where['b'] >12)
.
Chaining constructor syntax
pdpipe
also has a chaining syntax that you can use to construct pipelines
with characteristic one-liners (although admittedly, it is mainly
convenient for the creation of simple, short pipelines, in dynamic Python
shells):
pipeline = pdp.ColDrop('Name').RowDrop({'Savings': lambda x: x > 100}) \
.Bin({'Savings': [1]}, drop=False).Scale('StandardScaler') \
.TokenizeText('Quote').SnowballStem('EnglishStemmer', columns=['Quote']) \
.RemoveStopwords('English', 'Quote').Encode('Gender').OneHotEncode('Country')
Note: All pipeline stage constructors are available in this way, but
some advanced handles such as df
, and fly handles such as
drop_rows_where
, are not available through this syntax.
Printing the pipeline object displays it in order.
A pdpipe pipeline:
[ 0] Apply dataframe method set_index with kwargs {}
[ 1] Drop columns Name
[ 2] Drop rows by qualifier <RowQualifier: Qualify rows with df[Savings] > 100>
[ 3] Assign column Healthy with df[Runs] & ~df[Smoking]
[ 4] Bin Savings by [1].
[ 5] Scale columns Columns of dtypes <class 'numpy.number'>
[ 6] Tokenize Quote
[ 7] Stemming tokens in Quote...
[ 8] Remove stopwords from Quote
[ 9] Encode Gender
[10] One-hot encode Country
Pipeline slicing
The numbers presented in square brackets are the indices of the
corresponding pipeline stages, and they can be used to retrieve either the
specific pipeline stage objects composing the pipeline, e.g. with
pipeline[5]
, or sub-pipelines composed of sub-sequences of
the pipeline, e.g. with pipeline[2:6]
.
Applying pipelines
The pipeline can now be applied to an input dataframe using the apply
method. We will also provide the verbose
keyword with True
to show
informative prints and the progress of dataframe processing, stage by stage:
- set_index: Apply dataframe method set_index with kwargs {}
- Drop columns Name
- Drop rows by qualifier <RowQualifier: Qualify rows with df[Savings] > 100>
2 rows dropped.
- Assign column Healthy with df[Runs] & ~df[Smoking]
- Bin Savings by [1].
Savings: 100%|ββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 158.35it/s]
- Scale columns Columns of dtypes <class 'numpy.number'>
- Tokenize Quote
- Stemming tokens in Quote...
- Remove stopwords from Quote
- Encode Gender
100%|βββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 297.36it/s]
- One-hot encode Country
Country: 100%|βββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 240.78it/s]
We will thus get the dataframe below. We can see all numerical columns were scaled, the Country
column was one-hot-encoded, Savings
also got a binned version and the textual Quote
column underwent some word-level manipulations:
Fit and transform
Pipelines are also callable objects themselves, so calling pipeline(df)
is
equivalent to calling pipeline.apply(df)
.
Additionally, pipelines inherently have a fit state. If none of the stages
composing them is fittable in nature this doesn't make a lot of a difference,
but many stage have a fit_transform
vs transform
logic, like encoders,
scalers and so forth.
apply()
vs fit_transform()
The apply
pipeline method uses either fit_transform
and transform
in an intelligent and sensible way: If the pipeline is not fitted, calling
it is equivalent to calling fit_transform
, while if it is fitted, the
call is practically a transform
call.
Let's say we want to utilize pdpipe's powerful slicing syntax to apply only
some of the pipeline stages to the raw dataframe. We will now use the
fit_transform
method of the pipeline itself to force all encompassed pipeline
stages to fit-transform themselves:
Here, we will use pipeline[4:7]
to apply the binning, scaling and
tokenization stages only: