Standartizing Dataframes
One of the most important roles of processing pipelines is to standartize their output, and to make sure the assumptions made about their output by the models consuming it are kept
pdpipe provide several pipeline stages that relate to this role: The
Schematize stage, the ColumnDtypeEnforcer stage and the ConditionValidator
stage. Let's take a look.
Schematization
The Schematize pipeline stage provides a concise way to align any dataframe
passing through it to a specific column schema.
>>> import pandas as pd; import pdpipe as pdp;
>>> df = pd.DataFrame([[2, 4, 8],[3, 6, 9]], [1, 2], ['a', 'b', 'c'])
>>> pdp.Schematize(['a', 'c']).apply(df)
a c
1 2 8
2 3 9
>>> pdp.Schematize(['c', 'b']).apply(df)
c b
1 8 4
2 9 6
It perhaps more useful mode is the adaptive mode, activated by providing the
first parameter, columns, with None. In the adaptive mode, the stage learns
the schema of the dataframe passed through it on a fit_transform call, and
applies it to any future dataframe passed through it in transform calls.
>>> import pandas as pd; import pdpipe as pdp;
>>> df1 = pd.DataFrame([[9, 4],[5, 5]], [1, 2], ['a', 'c'])
>>> df1
a c
1 9 4
2 5 5
>>> schematize = pdp.Schematize(None)
>>> schematize(df1) # (1)
a c
1 9 4
2 5 5
>>> df2 = pd.DataFrame([[2, 4, 8],[3, 6, 9]], [1, 2], ['a', 'b', 'c'])
>>> df2
a b c
1 2 4 8
2 3 6 9
>>> schematize(df2) # (2)
a c
1 2 8
2 3 9
-
Using the stage as a callable is akin to calling
apply, and since the the stage is still unfitted,fit_transformis callsed and the schema is learned. The input dataframe is thus outputted without change. -
On the second apply call the stage is already fitted, so
transformis called internally, and the input dataframe is coerced into the schema the stage has learned.
Enforcing data types
The ColumnDtypeEnforcer stage allows us to coerce dataframe columns into a
desired datatype, with some optional powerful capabilities.
In the basic way to use the stage, we can just provide a dictionary mapping column labels to the dtype the should be coerced into; columns not detailed by this mapping will remain untouched:
>>> import pandas as pd; import pdpipe as pdp;
>>> df = pd.DataFrame([[8,'a'],[5,'b']], [1,2], ['num', 'initial'])
>>> df
num initial
1 8 a
2 5 b
>>> pdp.ColumnDtypeEnforcer({'num': float}).apply(df)
num initial
1 8.0 a
2 5.0 b
However, the more interesting way in which ColumnDtypeEnforcer can be used is
by using column qualifier objects to describe critera for groups of columns to
cast to a certian dtype. This synergizes with pdpipe's capability for building
pipelines that are reslient to shifts in data schema, and can be written in a
generalizable way.
For example, here we build a dtype enforcer that will coerce into float any
column with a label starting with 'n':
>>> import pandas as pd; import pdpipe as pdp;
>>> df = pd.DataFrame([[8,'a'],[5,'b']], [1,2], ['num', 'initial'])
>>> pdp.ColumnDtypeEnforcer({pdp.cq.StartsWith('n'): float}).apply(df)
num initial
1 8.0 a
2 5.0 b
Tip
Remember, column qualifiers are powerful objects, and pdpipe boasts
built-in qualifiers that can help us choose columns by their data types or
by the rate of missing values they have. See more in
the section introducing column qualifiers.
Validating conditions
The ConditionValidator stage allows us to to make sure various conditions
hold for input dataframes.
The most straightforward way to use it is to provide it with a function - or a
list of them - that return True or False for input dataframes:
>>> import pandas as pd; import pdpipe as pdp;
>>> df = pd.DataFrame([[1,4],[4,None],[1,11]], [1,2,3], ['a','b'])
>>> df
a b
1 1 4
2 4 None
3 1 11
>>> validator = pdp.ConditionValidator(lambda df: len(df.columns) == 5)
>>> validator(df)
Traceback (most recent call last):
...
pdpipe.exceptions.FailedConditionError: ConditionValidator stage failed; some conditions did not hold for the input dataframe!
But again, pdpipe includes a special built-in type that makes this stage more
power; in this case, Condition objects, defined in the pdpipe.cond module.
For example:
>>> df
a b
1 1 4
2 4 None
3 1 11
>>> validator = pdp.ConditionValidator(pdp.cond.HasNoMissingValues())
>>> validator(df)
Traceback (most recent call last):
...
pdpipe.exceptions.FailedConditionError: ConditionValidator stage failed; some conditions did not hold for the input dataframe!
The cond module includes other useful conditions, such as
HasAtMostMissingValues, HasAllColumns and per-column conditions.
Additionally, condition objects support all boolean operators, so both
~ cond.HasAllColumns(['a', 'b']) and
cond.HasAtMostMissingValues(0.1) & HasNoColumn('forbidden_column') are valid
complex conditions that can be fed to ConditionValidator.
You can read more about condition objects in our Getting Started section:
That's it!
Getting help
Remember you can get help on our Gitter chat or on our GitHub Discussions forum.