Text Processing Stages
Also look at NLTK-based stages.
Text processing pdpipe pipeline stages.
Attributes
Classes
RegexReplace
Bases: ApplyByCols
A pipeline stage replacing regex occurences in a text column.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns |
single label, list-like or callable
|
Column labels in the DataFrame which regex replacement be applied to.
Alternatively, this parameter can be assigned a callable returning an
iterable of labels from an input pandas.DataFrame. See |
required |
pattern |
str
|
The regex whose occurences will be replaced. |
required |
replace |
str
|
The replacement string to use. This is equivalent to repl in re.sub. |
required |
flags |
int, default 0
|
Regex flags that are compatible with Python's |
0
|
result_columns |
label or list-like of labels, default None
|
The labels of the new columns resulting from the mapping operation. Must be of the same length as columns. If None, behavior depends on the drop parameter: If drop is True, the label of the source column is used; otherwise, the label of the source column is casted to a string and concatenated with the suffix '_reg'. |
None
|
drop |
bool, default True
|
If set to True, source columns are dropped after being transformed. |
True
|
**kwargs |
object
|
All PdPipelineStage constructor parameters are supported. |
{}
|
Examples:
>>> import pandas as pd; import pdpipe as pdp; import re;
>>> data = [[4, "more than 12"], [5, "with 5 more"]]
>>> df = pd.DataFrame(data, [1,2], ["age","text"])
>>> clean_num = pdp.RegexReplace('text', r'\b[0-9]+\b', "NUM")
>>> clean_num(df)
age text
1 4 more than NUM
2 5 with NUM more
>>> data = [["Mr. John", 18], ["MR. Bob", 25]]
>>> df = pd.DataFrame(data, [1,2], ["name","age"])
>>> match_men = r'^mr.*'
>>> censor_men = pdp.RegexReplace(
... 'name', match_men, "x", flags=re.IGNORECASE
... )
>>> censor_men(df)
name age
1 x 18
2 x 25
Source code in pdpipe/text_stages.py
DropTokensByLength
Bases: ApplyByCols
A pipeline stage removing tokens by length in string-token list columns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns |
single label, list-like or callable
|
Names of token list columns on which to apply token filtering.
Alternatively, this parameter can be assigned a callable returning an
iterable of labels from an input pandas.DataFrame. See |
required |
min_len |
int
|
The minimum length of tokens to keep. Tokens of shorter length are removed from all token lists. |
required |
max_len |
int, default None
|
The maximum length of tokens to keep. If provided, tokens of longer length are removed from all token lists. |
None
|
result_columns |
str or list-like, default None
|
The names of the new columns resulting from the mapping operation. Must be of the same length as columns. If None, behavior depends on the drop parameter: If drop is True, the name of the source column is used; otherwise, the name of the source column is used with the suffix '_filtered'. |
None
|
drop |
bool, default True
|
If set to True, source columns are dropped after being transformed. |
True
|
**kwargs |
object
|
All PdPipelineStage constructor parameters are supported. |
{}
|
Examples:
>>> import pandas as pd; import pdpipe as pdp;
>>> data = [[4, ["a", "bad", "nice"]], [5, ["good", "university"]]]
>>> df = pd.DataFrame(data, [1,2], ["age","text"])
>>> filter_tokens = pdp.DropTokensByLength('text', 3, 5)
>>> filter_tokens(df)
age text
1 4 [bad, nice]
2 5 [good]
Source code in pdpipe/text_stages.py
106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 |
|
DropTokensByList
Bases: ApplyByCols
A pipeline stage removing specific tokens in string-token list columns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns |
single label, list-like or callable
|
Names of token list columns on which to apply token filtering.
Alternatively, this parameter can be assigned a callable returning an
iterable of labels from an input pandas.DataFrame. See |
required |
bad_tokens |
list of str
|
The list of string tokens to remove from all token lists. |
required |
result_columns |
str or list-like, default None
|
The names of the new columns resulting from the mapping operation. Must be of the same length as columns. If None, behavior depends on the drop parameter: If drop is True, the name of the source column is used; otherwise, the name of the source column is used with the suffix '_filtered'. |
None
|
drop |
bool, default True
|
If set to True, source columns are dropped after being transformed. |
True
|
**kwargs |
object
|
All PdPipelineStage constructor parameters are supported. |
{}
|
Examples:
>>> import pandas as pd; import pdpipe as pdp;
>>> data = [[4, ["a", "bad", "cat"]], [5, ["bad", "not", "good"]]]
>>> df = pd.DataFrame(data, [1,2], ["age","text"])
>>> filter_tokens = pdp.DropTokensByList('text', ['bad'])
>>> filter_tokens(df)
age text
1 4 [a, cat]
2 5 [not, good]