sktools package

Submodules

sktools.encoders module

Nested target encoder

class sktools.encoders.NestedTargetEncoder(verbose=0, cols=None, drop_invariant=False, feature_mapping={}, return_df=True, handle_unknown='value', handle_missing='value', random_state=None, randomized=False, sigma=0.05, m_prior=1.0, m_parent=1.0)[source]

Bases: sklearn.base.BaseEstimator, category_encoders.utils.TransformerWithTargetMixin

Estimate of likelihood for nested data.

This is a generalization of the m-probability estimate. The main difference is that instead of using a global prior, it can use a more fine-tuned prior. This only works for nested data. For instance, I have individuals who live in counties, that are inside states. If I want to estimate the likelihood encoding for a county, it is better to use as prior the estimate for the state instead of the global estimate.

verbose: int
integer indicating verbosity of the output. 0 for none.
cols: list
a list of columns to encode, if None, all string columns will be encoded.
drop_invariant: bool
boolean for whether or not to drop encoded columns with 0 variance.
feature_mapping: dict
dictionary representing the child - parent relationship. keys are children.
return_df: bool
boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
handle_missing: str
options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which returns the prior probability.
handle_unknown: str
options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which returns the prior probability.
randomized: bool,
adds normal (Gaussian) distribution noise into training data in order to decrease overfitting (testing data are untouched).
sigma: float
standard deviation (spread or “width”) of the normal distribution.
m_prior: float
this is the “m” in the m-probability estimate for the global mean. Higher value of m results into stronger shrinking. It is used whenever we estimate a likelihood using the global mean as a prior. M is non-negative.
m_parent: float
this is the “m” in the m-probability estimate. Higher value of m results into stronger shrinking. It is used whenever we estimate a likelihood using the parent mean as a prior. M is non-negative.
>>> from sktools import NestedTargetEncoder
>>> import pandas as pd
>>> X = pd.DataFrame(
>>>     {
>>>         "child": ["a", "a", "b", "b", "b", "c", "c", "d", "d", "d"],
>>>         "parent": ["e", "e", "e", "e", "e", "f", "f", "f", "f", "f",]
>>>     }
>>> )
>>> y = pd.Series([1, 2, 3, 1, 2, 4, 4, 5, 4, 4.5])
>>> ne = NestedTargetEncoder(feature_mapping={"child": "parent"}, m_prior=0)
>>> ne.fit_transform(X, y)
      child  parent
0  2.016667     1.8
1  2.016667     1.8
2  2.262500     1.8
3  2.262500     1.8
4  2.262500     1.8
5  3.683333     4.3
6  3.683333     4.3
7  4.137500     4.3
8  4.137500     4.3
9  4.137500     4.3
[1]Additive smoothing, from https://en.wikipedia.org/wiki/Additive_smoothing#Generalized_to_the_case_of_known_incidence_rates
fit(X, y, **kwargs)[source]

Fit encoder according to X and binary y.

X : array-like, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and n_features is the number of features.
y : array-like, shape = [n_samples]
Binary target values.
self : encoder
Returns self.
get_feature_names()[source]

Returns the names of all transformed / added columns.

feature_names: list
A list with all feature names transformed or added. Note: potentially dropped features are not included!
transform(X, y=None, override_return_df=False)[source]

Perform the transformation to new categorical data.

When the data are used for model training, it is important to also pass the target in order to apply leave one out.

X : array-like, shape = [n_samples, n_features] y : array-like, shape = [n_samples] when transform by leave one out

None, when transform without target information (such as transform test set)
p : array, shape = [n_samples, n_numeric + N]
Transformed values with encoding applied.
class sktools.encoders.QuantileEncoder(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_missing='value', handle_unknown='value', quantile=0.5, m=1.0)[source]

Bases: sklearn.base.BaseEstimator, category_encoders.utils.TransformerWithTargetMixin

Quantile Encoding for categorical features.

This a statistically modified version of target MEstimate encoder where selected features are replaced the statistical quantile instead than the mean. Replacing with the median is a particular case where self.quantile = 0.5. In comparison to MEstimateEncoder it has two tunable parameter m and quantile

verbose: int
integer indicating verbosity of the output. 0 for none.
quantile: int
integer indicating statistical quantile. ´0.5´ for median.
m: int
integer indicating the smoothing parameter. 0 for no smoothing.
cols: list
a list of columns to encode, if None, all string columns will be encoded.
drop_invariant: bool
boolean for whether or not to drop columns with 0 variance.
return_df: bool
boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
handle_missing: str
options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target quantile.
handle_unknown: str
options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target quantile.
>>> from sktools import QuantileEncoder
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = QuantileEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
dtypes: float64(13)
memory usage: 51.5 KB
None
[1]Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems, https://arxiv.org/abs/2105.13783
[2]A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems, equation 7, from https://dl.acm.org/citation.cfm?id=507538
[3]On estimating probabilities in tree pruning, equation 1, from https://link.springer.com/chapter/10.1007/BFb0017010
[4]Additive smoothing, from https://en.wikipedia.org/wiki/Additive_smoothing#Generalized_to_the_case_of_known_incidence_rates
[5]Target encoding done the right way https://maxhalford.github.io/blog/target-encoding/
fit(X, y, **kwargs)[source]

Fit encoder according to X and y.

X : array-like, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and n_features is the number of features.
y : array-like, shape = [n_samples]
Target values.
self : encoder
Returns self.
fit_quantile_encoding(X, y)[source]
get_feature_names()[source]

Returns the names of all transformed / added columns.

feature_names: list
A list with all feature names transformed or added. Note: potentially dropped features are not included!
quantile_encode(X_in)[source]
transform(X, y=None, override_return_df=False)[source]

Perform the transformation to new categorical data.

X : array-like, shape = [n_samples, n_features] y : array-like, shape = [n_samples] when transform by leave one out

None, when transform without target info (such as transform test set)
p : array, shape = [n_samples, n_numeric + N]
Transformed values with encoding applied.
class sktools.encoders.SummaryEncoder(cols, quantiles, m=1.0)[source]

Bases: sklearn.base.BaseEstimator, category_encoders.utils.TransformerWithTargetMixin

fit(X, y)[source]
transform(X, y=None)[source]

sktools.ensemble module

class sktools.ensemble.MedianForestRegressor(*args, **kwargs)[source]

Bases: object

Random forest with median aggregation

Very similar to random forest regressor, but aggregating using the median instead of the mean. Can improve the mean absolute error a little.

>>> from sktools import MedianForestRegressor
>>> from sklearn.datasets import load_boston
>>> boston = load_boston()['data']
>>> y = load_boston()['target']
>>> mf = MedianForestRegressor()
>>> mf.fit(boston, y)
>>> mf.predict(boston)[0:10]
array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9])
fit(X, y)[source]
predict(X)[source]

sktools.imputer module

class sktools.imputer.IsEmptyExtractor(keep_trivial=False, cols=None)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Transformer that adds columns indicating wether columns have NaN values in a row

keep_trivial:
If a column doesn’t have NaN, don’t add the column
cols:
List of columns to transform. If None, all columns are transformed. It only works if input is a DataFrame
>>> from sktools import IsEmptyExtractor
>>> import pandas as pd
>>> import numpy as np
>>> X = pd.DataFrame(
>>>     {
>>>         "x": ["a", "b", np.nan],
>>>         "y": ["c", np.nan, "d"]
>>>     }
>>> )
>>> IsEmptyExtractor().fit_transform(X)
     x    y   x_na   y_na
0    a    c  False  False
1    b  NaN  False   True
2  NaN    d   True  False
fit(X, y=None)[source]
transform(X)[source]

For each column, it creates a new one indicating if that column is na

transform_data_frame(X)[source]

Transform method in case of receiving a pandas data frame

sktools.linear_model module

class sktools.linear_model.QuantileRegression(quantile=0.5, add_intercept=True)[source]

Bases: object

Quantile regression wrapper

It can work on sklearn pipelines

>>> from sktools import QuantileRegression
>>> from sklearn.datasets import load_boston
>>> boston = load_boston()['data']
>>> y = load_boston()['target']
>>> qr = QuantileRegression(quantile=0.9)
>>> qr.fit(boston, y)
>>> qr.predict(boston)[0:5].round(2)
array([34.87, 28.98, 34.86, 32.67, 32.52])
fit(X, y)[source]
predict(X, y=None)[source]
preprocess(X)[source]

sktools.matrix_denser module

Main module.

class sktools.matrix_denser.MatrixDenser[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Converts matrix to dense.

Useful when doing an union between dense and sparse matrices.

>>> from sktools import MatrixDenser
>>> from scipy.sparse import csr_matrix
>>> import numpy as np
>>> sparse_matrix = csr_matrix((3, 4), dtype=np.int8)
>>> dense_matrix = MatrixDenser().fit_transform(sparse_matrix)
>>> print(dense_matrix)
[[0 0 0 0]
 [0 0 0 0]
 [0 0 0 0]]
fit(X, y=None)[source]
transform(X)[source]

sktools.model_selection module

class sktools.model_selection.BootstrapFold(n_bootstraps=10, size_fraction=1)[source]

Bases: object

Create folds based on bootsrapping

For each fold, create a bootstrap sample, training data is the bootstrapped data. The test data is the rest of the data, the data that is not in the bootstrap sample

The average size of the test data is 1/e of the total data.

n_bootstraps: int
number of folds of our cross-validation setting
size_fraction: float
fraction of the training data being sampled. The lower, the bigger the test set
>>> import numpy as np
>>> from sktools.model_selection import BootstrapFold
>>> X = np.array([
>>>     np.random.randint(1, 3, 1000),
>>>     np.random.randint(0, 2, 1000)]
>>> ).T
>>> loo = BootstrapFold(10, size_fraction=1)
>>> for train_index, test_index in loo.split(X):
>>>     print(f"Train length: {len(train_index)} Test length: {len(test_index)}")
Train length: 1000 Test length: 393
Train length: 1000 Test length: 367
Train length: 1000 Test length: 372
Train length: 1000 Test length: 377
Train length: 1000 Test length: 361
Train length: 1000 Test length: 356
Train length: 1000 Test length: 366
Train length: 1000 Test length: 369
Train length: 1000 Test length: 390
Train length: 1000 Test length: 365
[1]Out of sample data for bootstrap sample, from https://stats.stackexchange.com/questions/88980/
get_n_splits(X=None, y=None, groups=None)[source]
split(X, y=None, groups=None)[source]

Generator to iterate over the indices :param X: Array to split on :param y: Always ignored, exists for compatibility :param groups: Always ignored, exists for compatibility

sktools.preprocessing module

class sktools.preprocessing.CyclicFeaturizer(cols, period_mapping=None)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Cyclic featurizer

Given some numeric columns, applies sine and cosine transformations to obtain cyclic features. This is specially suited to month of the year, day of the week, day of the month, hour of the day, etc, where the plain numeric representation doesn’t work very well.

cols : list
columns to be encoded using sine and cosine transformations. Should be numeric columns
period_mapping : dict
keys should be names of cols and values should be tuples indicating minimum and maximum values
>>> from sktools import CyclicFeaturizer
>>> import pandas as pd
>>> df = pd.DataFrame(
>>>     {
>>>         "posted_at": pd.date_range(
>>>             start="1/1/2018", periods=365 * 3, freq="d"
>>>         ),
>>>         "created_at": pd.date_range(
>>>             start="1/1/2018", periods=365 * 3, freq="h"
>>>         )
>>>     }
>>> )
>>> df["month_posted"] = df.posted_at.dt.month
>>> df["hour_created"] = df.created_at.dt.hour
>>> transformed_df = CyclicFeaturizer(
>>>     cols=["month_posted", "hour_created"]
>>> ).fit_transform(df)
fit(X)[source]
transform(X)[source]
class sktools.preprocessing.GradientBoostingFeatureGenerator(stack_to_X=True, add_probs=False, regression=False, **kwargs)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Feature generator from a gradient boosting.

Gradient boosting decision trees are a powerful and very convenient way to implement non-linear and tuple transformations. We treat each individual tree as a categorical feature that takes as value the index of the leaf an instance ends up falling in and then perform one hot encoding for these features.

Parameters
stack_to_X: bool, default = True
Generates leaves features using the fitted self.gbm and saves them in R. If stack_to_X is True then .transform returns the original features with ‘R’ appended as columns. If stack_to_X is False then .transform returns only the leaves features from ‘R’
add_probs: bool, default = False
If add_probs is True then the created features are appended a probability [0,1]. If add_probs is False features are binary
>>> from sktools import GradientBoostingFeatureGenerator
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification()
>>> mf = GradientBoostingFeatureGenerator()
>>> mf.fit(X, y)
>>> mf.transform(X)
[1]Practical Lessons from Predicting Clicks on Ads at Facebook, from

https://research.fb.com/wp-content/uploads/2016/11/practical-lessons-from-predicting-clicks-on-ads-at-facebook.pdf

[2]Feature Generation with Gradient Boosted Decision Trees, Towards Data Science, Carlos Mougan
fit(X, y)[source]
transform(X)[source]

R contains the matrix with the encoded leaves. The shape depends upon the parameters. P contains a two columns array with the probability.

sktools.quantilegroups module

Grouped Quantile Featurizer

class sktools.quantilegroups.GroupedQuantileTransformer(feature_mapping, handle_missing='value', n_quantiles=1000, subsample=100000, random_state=None, copy=True)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Computes the group quantile of a numeric feature with respect to a categorical feature.

For instance, if each datum is an apartment, and we have both the price and the city, this feature tries to model how expensive is an apartment in its city. The most expensive apartment in the city will score 1, and the cheapest will score 0.

It is equivalent at what it is done in: https://stackoverflow.com/questions/33899369/ranking-order-per-group-in-pandas

feature_mapping: dict
mapping from numeric variables to categories that want to be used as groups.
n_quantiles: int
number of quantiles per category.
subsample: int
Maximum number of samples used to estimate the quantiles for computational efficiency.
random_state: Any
Determines random number generation for subsampling and smoothing noise. Please see subsample for more details. Pass an int for reproducible results across multiple function calls. See :term:`Glossary `
copy: bool
Set to False to perform inplace transformation and avoid a copy (if the input is already a numpy array).
>>> from sktools import GroupedQuantileTransformer
>>> import pandas as pd
>>> X = pd.DataFrame(
>>>         {
>>>             "price": [1, 2, 3, 3, 2, 10, 0],
>>>             "city": ["a", "a", "a", "b", "b", None, None],
>>>         }
>>>     )
>>> featurizer = GroupedQuantileTransformer(feature_mapping={"price": "city"})
>>> print(featurizer.fit_transform(X).columns)
Index(['price', 'city', 'price_quantile_city'], dtype='object')
fit(X, y=None)[source]
transform(X)[source]
class sktools.quantilegroups.MeanGroupFeaturizer(feature_mapping, create_features=True, handle_missing='value', handle_unknown='value')[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Creates features establishing a relationship between a numeric and a categorical feature, by using the mean of the numeric feature in each cateogry.

For instance, if each datum is an apartment, and we have both the price and the city, the features model how expensive is an apartment with respect to the mean in the city.

feature_mapping: dict
mapping from numeric variables to categories that want to be used as groups.
percentile: int
percentile used to compute features
create_features: bool
If false, it just computes percentiles by category
handle_missing: str
options are ‘return_nan’ and ‘value’, defaults to ‘value’, which uses the global quantile.
handle_unknown: str
options are ‘return_nan’ and ‘value’, defaults to ‘value’, which uses the global quantile.
>>> from sktools import MeanGroupFeaturizer
>>> import pandas as pd
>>> X = pd.DataFrame(
>>>         {
>>>             "price": [1, 2, 3, 3, 2, 10, 0],
>>>             "city": ["a", "a", "a", "b", "b", None, None],
>>>         }
>>>     )
>>> featurizer = MeanGroupFeaturizer(
>>>     feature_mapping={"price": "city"}
>>> )
>>> print(featurizer.fit_transform(X).columns)
Index(['price', 'city', 'mean_price_city', 'diff_mean_price_city',
       'relu_diff_mean_price_city', 'ratio_mean_price_city'],
      dtype='object')
fit(X, y=None)[source]
transform(X)[source]
class sktools.quantilegroups.PercentileGroupFeaturizer(feature_mapping, percentile=50, create_features=True, handle_missing='value', handle_unknown='value')[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Creates features establishing a relationship between a numeric and a categorical feature, by using a given percentile of the numeric feature in each cateogry.

For instance, if each datum is an apartment, and we have both the price and the city, if we use the percentile 50 the features model how expensive is an apartment with respect to the median in the city.

feature_mapping: dict
mapping from numeric variables to categories that want to be used as groups.
percentile: int
percentile used to compute features
create_features: bool
If false, it just computes percentiles by category
handle_missing: str
options are ‘return_nan’ and ‘value’, defaults to ‘value’, which uses the global quantile.
handle_unknown: str
options are ‘return_nan’ and ‘value’, defaults to ‘value’, which uses the global quantile.
>>> from sktools import PercentileGroupFeaturizer
>>> import pandas as pd
>>> X = pd.DataFrame(
>>>         {
>>>             "price": [1, 2, 3, 3, 2, 10, 0],
>>>             "city": ["a", "a", "a", "b", "b", None, None],
>>>         }
>>>     )
>>> featurizer = PercentileGroupFeaturizer(
>>>     feature_mapping={"price": "city"}
>>> )
>>> print(featurizer.fit_transform(X).columns)
Index(['price', 'city', 'p50_price_city', 'diff_p50_price_city',
       'relu_diff_p50_price_city', 'ratio_p50_price_city'],
      dtype='object')
fit(X, y=None)[source]
transform(X)[source]

sktools.selectors module

class sktools.selectors.ItemSelector(key)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

For data grouped by feature, select subset of data at a provided key.

The data is expected to be stored in a 2D data structure, where the first index is over features and the second is over samples. i.e.

>>> len(data[key]) == n_samples

Please note that this is the opposite convention to scikit-learn feature matrices (where the first index corresponds to sample).

ItemSelector only requires that the collection implement getitem (data[key]). Examples include: a dict of lists, 2D numpy array, Pandas DataFrame, numpy record array, etc.

>>> data = {'a': [1, 5, 2, 5, 2, 8],
            'b': [9, 4, 1, 4, 1, 3]}
>>> ds = ItemSelector(key='a')
>>> data['a'] == ds.transform(data)

ItemSelector is not designed to handle data grouped by sample. (e.g. a list of dicts). If your data is structured this way, consider a transformer along the lines of sklearn.feature_extraction.DictVectorizer.

key : hashable, required
The key corresponding to the desired value in a mappable.
fit(X, y=None)[source]
transform(X)[source]
class sktools.selectors.TypeSelector(dtype)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Transformer that filters a type of columns of a given data frame. This can be useful if we want to treat numeric and object columns differently

dtype : required
The type we want to filter
>>> from sktools import TypeSelector
>>> import pandas as pd
>>> X = pd.DataFrame(
>>>         {
>>>             "price": [1., 2., 3.],
>>>             "city": ["a", "a", "b"]
>>>         }
>>>     )
>>> selector = TypeSelector(
>>>     dtype='float'
>>> )
>>> print(selector.fit_transform(X))
    price
0    1.0
1    2.0
2    3.0
fit(X, y=None)[source]
transform(X)[source]

Module contents

Top-level package for sktools.