Column Qualifiers
You can find an introduction to Column Qualifiers in our Getting Started section.
An Introduction to Column Qualifiers
Column qualifiers for pdpipe.
Classes
UnfittedColumnQualifierError
ColumnQualifier
Bases: object
A fittable qualifier that returns column labels from an input dataframe.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
func |
callable
|
A callable that given an input pandas.DataFrame objects returns a list of labels of a subset of the columns of the input dataframe. |
required |
fittable |
bool, default True
|
If set to false, this qualifier becomes unfittable, and |
None
|
subset |
bool, default False
|
If set to true, fitted qualifiers return the subset of fitted columns found in input dataframes during transform, in the order they appeared when fitted (NOT in the order they appear in the input dataframe of the transform). False by default, which means fitted qualifiers return the FULL list of fitted columns, ignoring input dataframes completely on transforms. When combined with most pipeline stages, this means the stage will fail on its precondition if trying to transform with it a dataframe that is missing some values in the fitted qualifier. |
None
|
Examples:
>>> import numpy as np; import pdpipe as pdp;
>>> cq = pdp.cq.ColumnQualifier(lambda df: [
... l for l, s in df.iteritems()
... if s.dtype == np.int64 and l in ['a', 'b', 5]
... ])
>>> cq
<ColumnQualifier: Qualify columns by function>
>>> col_drop = pdp.ColDrop(columns=cq)
Source code in pdpipe/cq.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 |
|
Functions
fit_transform(X)
Fit this qualifier and return the labels of the qualifying columns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
pandas.DataFrame
|
The input dataframe, from which columns are selected. |
required |
Returns:
Type | Description |
---|---|
list of objects
|
A list of labels of the qualified columns for the input dataframe. |
Source code in pdpipe/cq.py
fit(X)
Fit this qualifier on the input dataframe.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
pandas.DataFrame
|
The input dataframe, from which columns are selected. |
required |
transform(X)
Apply and returns the labels of the qualifying columns.
If this ColumnQualifier is fittable, it will return the list of column labels that was determined when fitted (or the subset of it that can be found in the input dataframe). It will throw an exception if it is not.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
pandas.DataFrame
|
The input dataframe, from which columns are selected. |
required |
Returns:
Type | Description |
---|---|
list of objects
|
A list of labels of the qualified columns for the input dataframe. |
Source code in pdpipe/cq.py
AllColumns
Bases: ColumnQualifier
Select all columns in input dataframes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
**kwargs |
Accepts all keyword arguments of the constructor of
ColumnQualifier. See the documentation of |
{}
|
Examples:
>>> import pandas as pd; import pdpipe as pdp;
>>> df = pd.DataFrame([[8,1],[5,2]], [1,2], ['a', 'b'])
>>> cq = pdp.cq.AllColumns()
>>> cq
<ColumnQualifier: Qualify all columns>
>>> cq(df)
['a', 'b']
>>> df2 = pd.DataFrame([[8,1],[5,2]], [1,2], ['b', 'c'])
>>> cq(df2)
['a', 'b']
>>> cq = pdp.cq.AllColumns(fittable=False)
>>> cq(df)
['a', 'b']
>>> cq(df2)
['b', 'c']
>>> cq = pdp.cq.AllColumns(subset=True)
>>> cq(df)
['a', 'b']
>>> cq(df2)
['b']
Source code in pdpipe/cq.py
ByColumnCondition
Bases: ColumnQualifier
A fittable column qualifier based on a per-column condition.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cond |
callable
|
A callable that given an input pandas.Series object returns a boolean value. |
required |
safe |
bool, default False
|
If set to True, every call to given condition |
False
|
**kwargs |
Additionaly accepts all keyword arguments of the constructor of
ColumnQualifier. See the documentation of |
{}
|
Examples:
>>> import pandas as pd; import pdpipe as pdp;
>>> df = pd.DataFrame(
... [[1, 2, 'A'],[4, 1, 'C']], [1,2], ['age', 'count', 'grade'])
>>> cq = pdp.cq.ByColumnCondition(lambda s: s.sum() > 3, safe=True)
>>> cq(df)
['age']
Source code in pdpipe/cq.py
ByLabels
Bases: ColumnQualifier
Select all columns with the given label or labels.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
labels |
single label or list-like
|
Column labels which qualify. |
required |
**kwargs |
Additionaly accepts all keyword arguments of the constructor of
ColumnQualifier. See the documentation of |
{}
|
Examples:
>>> import pandas as pd; import pdpipe as pdp;
>>> df = pd.DataFrame(
... [[8,'a',5],[5,'b',7]], [1,2], ['num', 'chr', 'nur'])
>>> cq = pdp.cq.ByLabels('num')
>>> cq(df)
['num']
>>> cq = pdp.cq.ByLabels(['chr', 'nur'])
>>> cq(df)
['chr', 'nur']
>>> cq = pdp.cq.ByLabels(['num', 'foo'])
>>> cq(df)
['num']
Source code in pdpipe/cq.py
StartsWith
Bases: ColumnQualifier
Select all columns that start with the given string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prefix |
str
|
The prefix which qualifies columns. |
required |
**kwargs |
Additionaly accepts all keyword arguments of the constructor of
ColumnQualifier. See the documentation of |
{}
|
Examples:
>>> import pandas as pd; import pdpipe as pdp;
>>> df = pd.DataFrame(
... [[8,'a',5],[5,'b',7]], [1,2], ['num', 'chr', 'nur'])
>>> cq = pdp.cq.StartsWith('nu')
>>> cq
<ColumnQualifier: Columns starting with nu>
>>> cq(df)
['num', 'nur']
Source code in pdpipe/cq.py
OfDtypes
Bases: ColumnQualifier
Select all columns that are of a given dtypes.
Use dtypes=np.number
to qualify all numeric columns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dtypes |
object or list of objects
|
The dtype or dtypes which qualify columns. Support all valid arguments
to the |
required |
**kwargs |
Additionaly accepts all keyword arguments of the constructor of
ColumnQualifier. See the documentation of |
{}
|
Examples:
>>> import pandas as pd; import pdpipe as pdp; import numpy as np;
>>> df = pd.DataFrame(
... [[8.2,'a',5],[5.1,'b',7]], [1,2], ['ph', 'grade', 'age'])
>>> cq = pdp.cq.OfDtypes(np.number)
>>> cq(df)
['ph', 'age']
>>> cq = pdp.cq.OfDtypes([np.number, object])
>>> cq(df)
['ph', 'grade', 'age']
>>> cq = pdp.cq.OfDtypes(np.int64)
>>> cq
<ColumnQualifier: With dtypes in <class 'numpy.int64'>>
>>> cq(df)
['age']
Source code in pdpipe/cq.py
OfNumericDtypes
Bases: OfDtypes
Select all columns that are of a numeric dtypes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
**kwargs |
Additionaly accepts all keyword arguments of the constructor of
ColumnQualifier. See the documentation of |
{}
|
Examples:
>>> import pandas as pd; import pdpipe as pdp; import numpy as np;
>>> df = pd.DataFrame(
... [[8.2,'a',5],[5.1,'b',7]], [1,2], ['ph', 'grade', 'age'])
>>> cq = pdp.cq.OfNumericDtypes()
>>> cq
<ColumnQualifier: With dtypes in <class 'numpy.number'>>
>>> cq(df)
['ph', 'age']
Source code in pdpipe/cq.py
WithAtMostMissingValues
Bases: ColumnQualifier
Select all columns with no more than X missing values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_missing |
int
|
The maximum number of missing values with which columns can still qualify. |
required |
**kwargs |
Additionaly accepts all keyword arguments of the constructor of
ColumnQualifier. See the documentation of |
{}
|
Examples:
>>> import pandas as pd; import pdpipe as pdp; import numpy as np;
>>> df = pd.DataFrame(
... [[None, 1, 2],[None, None, 5]], [1,2], ['ph', 'grade', 'age'])
>>> cq = pdp.cq.WithAtMostMissingValues(1)
>>> cq
<ColumnQualifier: With at most 1 missing values>
['grade', 'age']
Source code in pdpipe/cq.py
WithoutMissingValues
Bases: WithAtMostMissingValues
Select all columns with no missing values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
**kwargs |
Accepts all keyword arguments of the constructor of ColumnQualifier.
See the documentation of |
{}
|
Examples:
>>> import pandas as pd; import pdpipe as pdp; import numpy as np;
>>> df = pd.DataFrame(
... [[None, 1, 2],[None, None, 5]], [1,2], ['ph', 'grade', 'age'])
>>> cq = pdp.cq.WithoutMissingValues()
>>> cq
<ColumnQualifier: Without missing values>
>>> cq(df)
['age']
Source code in pdpipe/cq.py
WithAtMostMissingValueRate
Bases: ColumnQualifier
Select all columns with no more than P% missing values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
rate |
float, between 0 and 1
|
The maximum rate of missing values with which columns can still qualify. |
required |
**kwargs |
Additionaly accepts all keyword arguments of the constructor of
ColumnQualifier. See the documentation of |
{}
|
Examples:
>>> import pandas as pd; import pdpipe as pdp; import numpy as np;
>>> df = pd.DataFrame(
... [[None, 1, 2],[None, None, 5]], [1,2], ['ph', 'grade', 'age'])
>>> cq = pdp.cq.WithAtMostMissingValueRate(0.6)
>>> cq
<ColumnQualifier: With at most 0.6 missing value rate>
>>> cq(df)
['grade', 'age']
Source code in pdpipe/cq.py
WithAtLeastMissingValueRate
Bases: ColumnQualifier
Select all columns with no less than P% missing values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
rate |
float, between 0 and 1
|
The minimum rate of missing values with which columns can still qualify. |
required |
**kwargs |
Additionaly accepts all keyword arguments of the constructor of
ColumnQualifier. See the documentation of |
{}
|
Examples:
>>> import pandas as pd; import pdpipe as pdp; import numpy as np;
>>> df = pd.DataFrame(
... [[None, 1, 2],[None, None, 5]], [1,2], ['ph', 'grade', 'age'])
>>> cq = pdp.cq.WithAtLeastMissingValueRate(0.6)
>>> cq
<ColumnQualifier: With at least 0.6 missing value rate>
>>> cq(df)
['ph']
Source code in pdpipe/cq.py
Functions
is_fittable_column_qualifier(obj)
Return True for objects that are fittable ColumnQualifier objects.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
obj |
object
|
The object to examine. |
required |
Returns:
Type | Description |
---|---|
bool
|
True if the given object is an instance of ColumnQualifier and fittable, False otherwise. |
Source code in pdpipe/cq.py
columns_to_qualifier(columns)
Convert the given columns parameter to an equivalent column qualifier.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns |
single label, list-like or callable
|
The label, or an iterable of labels, of columns. Alternatively,
this parameter can be assigned a callable returning an iterable of
labels from an input pandas.DataFrame. See |
required |
Returns:
Type | Description |
---|---|
ColumnQualifier
|
The equivalent ColumnQualifier object. |
Examples:
>>> import pdpipe as pdp;
>>> pdp.cq.columns_to_qualifier('nu')
<ColumnQualifier: By labels in nu>
>>> pdp.cq.columns_to_qualifier(['nu', 'bu'])
<ColumnQualifier: By labels in nu, bu>
>>> pdp.cq.columns_to_qualifier(lambda df: [l for l in df.columns])
<ColumnQualifier: Qualify columns by function>