sktools package¶

Submodules¶

sktools.encoders module¶

Nested target encoder

class sktools.encoders.NestedTargetEncoder(verbose=0, cols=None, drop_invariant=False, feature_mapping={}, return_df=True, handle_unknown='value', handle_missing='value', random_state=None, randomized=False, sigma=0.05, m_prior=1.0, m_parent=1.0)[source]¶

Bases: sklearn.base.BaseEstimator, category_encoders.utils.TransformerWithTargetMixin

Estimate of likelihood for nested data.

This is a generalization of the m-probability estimate. The main difference is that instead of using a global prior, it can use a more fine-tuned prior. This only works for nested data. For instance, I have individuals who live in counties, that are inside states. If I want to estimate the likelihood encoding for a county, it is better to use as prior the estimate for the state instead of the global estimate.

verbose: int: integer indicating verbosity of the output. 0 for none.
cols: list: a list of columns to encode, if None, all string columns will be encoded.
drop_invariant: bool: boolean for whether or not to drop encoded columns with 0 variance.
feature_mapping: dict: dictionary representing the child - parent relationship. keys are children.
return_df: bool: boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
handle_missing: str: options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which returns the prior probability.
handle_unknown: str: options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which returns the prior probability.
randomized: bool,: adds normal (Gaussian) distribution noise into training data in order to decrease overfitting (testing data are untouched).
sigma: float: standard deviation (spread or “width”) of the normal distribution.
m_prior: float: this is the “m” in the m-probability estimate for the global mean. Higher value of m results into stronger shrinking. It is used whenever we estimate a likelihood using the global mean as a prior. M is non-negative.
m_parent: float: this is the “m” in the m-probability estimate. Higher value of m results into stronger shrinking. It is used whenever we estimate a likelihood using the parent mean as a prior. M is non-negative.

>>> from sktools import NestedTargetEncoder
>>> import pandas as pd
>>> X = pd.DataFrame(
>>>     {
>>>         "child": ["a", "a", "b", "b", "b", "c", "c", "d", "d", "d"],
>>>         "parent": ["e", "e", "e", "e", "e", "f", "f", "f", "f", "f",]
>>>     }
>>> )
>>> y = pd.Series([1, 2, 3, 1, 2, 4, 4, 5, 4, 4.5])
>>> ne = NestedTargetEncoder(feature_mapping={"child": "parent"}, m_prior=0)
>>> ne.fit_transform(X, y)
      child  parent
0  2.016667     1.8
1  2.016667     1.8
2  2.262500     1.8
3  2.262500     1.8
4  2.262500     1.8
5  3.683333     4.3
6  3.683333     4.3
7  4.137500     4.3
8  4.137500     4.3
9  4.137500     4.3

[1]	Additive smoothing, from https://en.wikipedia.org/wiki/Additive_smoothing#Generalized_to_the_case_of_known_incidence_rates

fit(X, y, **kwargs)[source]¶

Fit encoder according to X and binary y.

X : array-like, shape = [n_samples, n_features]: Training vectors, where n_samples is the number of samples and n_features is the number of features.
y : array-like, shape = [n_samples]: Binary target values.

self : encoder: Returns self.

get_feature_names()[source]¶

Returns the names of all transformed / added columns.

feature_names: list: A list with all feature names transformed or added. Note: potentially dropped features are not included!

transform(X, y=None, override_return_df=False)[source]¶

Perform the transformation to new categorical data.

When the data are used for model training, it is important to also pass the target in order to apply leave one out.

X : array-like, shape = [n_samples, n_features] y : array-like, shape = [n_samples] when transform by leave one out

None, when transform without target information (such as transform test set)

p : array, shape = [n_samples, n_numeric + N]: Transformed values with encoding applied.

class sktools.encoders.QuantileEncoder(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_missing='value', handle_unknown='value', quantile=0.5, m=1.0)[source]¶

Bases: sklearn.base.BaseEstimator, category_encoders.utils.TransformerWithTargetMixin

Quantile Encoding for categorical features.

This a statistically modified version of target MEstimate encoder where selected features are replaced the statistical quantile instead than the mean. Replacing with the median is a particular case where self.quantile = 0.5. In comparison to MEstimateEncoder it has two tunable parameter m and quantile

verbose: int: integer indicating verbosity of the output. 0 for none.
quantile: int: integer indicating statistical quantile. ´0.5´ for median.
m: int: integer indicating the smoothing parameter. 0 for no smoothing.
cols: list: a list of columns to encode, if None, all string columns will be encoded.
drop_invariant: bool: boolean for whether or not to drop columns with 0 variance.
return_df: bool: boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
handle_missing: str: options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target quantile.
handle_unknown: str: options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target quantile.

>>> from sktools import QuantileEncoder
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = QuantileEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
dtypes: float64(13)
memory usage: 51.5 KB
None

[1]	Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems, https://arxiv.org/abs/2105.13783

[2]	A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems, equation 7, from https://dl.acm.org/citation.cfm?id=507538

[3]	On estimating probabilities in tree pruning, equation 1, from https://link.springer.com/chapter/10.1007/BFb0017010

[4]	Additive smoothing, from https://en.wikipedia.org/wiki/Additive_smoothing#Generalized_to_the_case_of_known_incidence_rates

[5]	Target encoding done the right way https://maxhalford.github.io/blog/target-encoding/

fit(X, y, **kwargs)[source]¶

Fit encoder according to X and y.

X : array-like, shape = [n_samples, n_features]: Training vectors, where n_samples is the number of samples and n_features is the number of features.
y : array-like, shape = [n_samples]: Target values.

self : encoder: Returns self.

fit_quantile_encoding(X, y)[source]¶

get_feature_names()[source]¶

Returns the names of all transformed / added columns.

feature_names: list: A list with all feature names transformed or added. Note: potentially dropped features are not included!

quantile_encode(X_in)[source]¶

transform(X, y=None, override_return_df=False)[source]¶

Perform the transformation to new categorical data.

X : array-like, shape = [n_samples, n_features] y : array-like, shape = [n_samples] when transform by leave one out

None, when transform without target info (such as transform test set)

p : array, shape = [n_samples, n_numeric + N]: Transformed values with encoding applied.

class sktools.encoders.SummaryEncoder(cols, quantiles, m=1.0)[source]¶

Bases: sklearn.base.BaseEstimator, category_encoders.utils.TransformerWithTargetMixin

fit(X, y)[source]¶

transform(X, y=None)[source]¶

sktools.ensemble module¶

class sktools.ensemble.MedianForestRegressor(*args, **kwargs)[source]¶

Bases: object

Random forest with median aggregation

Very similar to random forest regressor, but aggregating using the median instead of the mean. Can improve the mean absolute error a little.

>>> from sktools import MedianForestRegressor
>>> from sklearn.datasets import load_boston
>>> boston = load_boston()['data']
>>> y = load_boston()['target']
>>> mf = MedianForestRegressor()
>>> mf.fit(boston, y)
>>> mf.predict(boston)[0:10]
array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9])

fit(X, y)[source]¶

predict(X)[source]¶

sktools.imputer module¶

class sktools.imputer.IsEmptyExtractor(keep_trivial=False, cols=None)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Transformer that adds columns indicating wether columns have NaN values in a row

keep_trivial:: If a column doesn’t have NaN, don’t add the column
cols:: List of columns to transform. If None, all columns are transformed. It only works if input is a DataFrame

>>> from sktools import IsEmptyExtractor
>>> import pandas as pd
>>> import numpy as np
>>> X = pd.DataFrame(
>>>     {
>>>         "x": ["a", "b", np.nan],
>>>         "y": ["c", np.nan, "d"]
>>>     }
>>> )
>>> IsEmptyExtractor().fit_transform(X)
     x    y   x_na   y_na
0    a    c  False  False
1    b  NaN  False   True
2  NaN    d   True  False

fit(X, y=None)[source]¶

transform(X)[source]¶: For each column, it creates a new one indicating if that column is na

transform_data_frame(X)[source]¶: Transform method in case of receiving a pandas data frame

sktools.linear_model module¶

class sktools.linear_model.QuantileRegression(quantile=0.5, add_intercept=True)[source]¶

Bases: object

Quantile regression wrapper

It can work on sklearn pipelines

>>> from sktools import QuantileRegression
>>> from sklearn.datasets import load_boston
>>> boston = load_boston()['data']
>>> y = load_boston()['target']
>>> qr = QuantileRegression(quantile=0.9)
>>> qr.fit(boston, y)
>>> qr.predict(boston)[0:5].round(2)
array([34.87, 28.98, 34.86, 32.67, 32.52])

fit(X, y)[source]¶

predict(X, y=None)[source]¶

preprocess(X)[source]¶

sktools.matrix_denser module¶

Main module.

class sktools.matrix_denser.MatrixDenser[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Converts matrix to dense.

Useful when doing an union between dense and sparse matrices.

>>> from sktools import MatrixDenser
>>> from scipy.sparse import csr_matrix
>>> import numpy as np
>>> sparse_matrix = csr_matrix((3, 4), dtype=np.int8)
>>> dense_matrix = MatrixDenser().fit_transform(sparse_matrix)
>>> print(dense_matrix)
[[0 0 0 0]
 [0 0 0 0]
 [0 0 0 0]]

fit(X, y=None)[source]¶

transform(X)[source]¶

sktools.model_selection module¶

class sktools.model_selection.BootstrapFold(n_bootstraps=10, size_fraction=1)[source]¶

Bases: object

Create folds based on bootsrapping

For each fold, create a bootstrap sample, training data is the bootstrapped data. The test data is the rest of the data, the data that is not in the bootstrap sample

The average size of the test data is 1/e of the total data.

n_bootstraps: int: number of folds of our cross-validation setting
size_fraction: float: fraction of the training data being sampled. The lower, the bigger the test set

>>> import numpy as np
>>> from sktools.model_selection import BootstrapFold
>>> X = np.array([
>>>     np.random.randint(1, 3, 1000),
>>>     np.random.randint(0, 2, 1000)]
>>> ).T
>>> loo = BootstrapFold(10, size_fraction=1)
>>> for train_index, test_index in loo.split(X):
>>>     print(f"Train length: {len(train_index)} Test length: {len(test_index)}")
Train length: 1000 Test length: 393
Train length: 1000 Test length: 367
Train length: 1000 Test length: 372
Train length: 1000 Test length: 377
Train length: 1000 Test length: 361
Train length: 1000 Test length: 356
Train length: 1000 Test length: 366
Train length: 1000 Test length: 369
Train length: 1000 Test length: 390
Train length: 1000 Test length: 365

[1]	Out of sample data for bootstrap sample, from https://stats.stackexchange.com/questions/88980/

get_n_splits(X=None, y=None, groups=None)[source]¶

split(X, y=None, groups=None)[source]¶: Generator to iterate over the indices :param X: Array to split on :param y: Always ignored, exists for compatibility :param groups: Always ignored, exists for compatibility

sktools.preprocessing module¶

class sktools.preprocessing.CyclicFeaturizer(cols, period_mapping=None)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Cyclic featurizer

Given some numeric columns, applies sine and cosine transformations to obtain cyclic features. This is specially suited to month of the year, day of the week, day of the month, hour of the day, etc, where the plain numeric representation doesn’t work very well.

cols : list: columns to be encoded using sine and cosine transformations. Should be numeric columns
period_mapping : dict: keys should be names of cols and values should be tuples indicating minimum and maximum values

>>> from sktools import CyclicFeaturizer
>>> import pandas as pd
>>> df = pd.DataFrame(
>>>     {
>>>         "posted_at": pd.date_range(
>>>             start="1/1/2018", periods=365 * 3, freq="d"
>>>         ),
>>>         "created_at": pd.date_range(
>>>             start="1/1/2018", periods=365 * 3, freq="h"
>>>         )
>>>     }
>>> )
>>> df["month_posted"] = df.posted_at.dt.month
>>> df["hour_created"] = df.created_at.dt.hour
>>> transformed_df = CyclicFeaturizer(
>>>     cols=["month_posted", "hour_created"]
>>> ).fit_transform(df)

fit(X)[source]¶

transform(X)[source]¶

class sktools.preprocessing.GradientBoostingFeatureGenerator(stack_to_X=True, add_probs=False, regression=False, **kwargs)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Feature generator from a gradient boosting.

Gradient boosting decision trees are a powerful and very convenient way to implement non-linear and tuple transformations. We treat each individual tree as a categorical feature that takes as value the index of the leaf an instance ends up falling in and then perform one hot encoding for these features.

Parameters

stack_to_X: bool, default = True: Generates leaves features using the fitted self.gbm and saves them in R. If stack_to_X is True then .transform returns the original features with ‘R’ appended as columns. If stack_to_X is False then .transform returns only the leaves features from ‘R’
add_probs: bool, default = False: If add_probs is True then the created features are appended a probability [0,1]. If add_probs is False features are binary

>>> from sktools import GradientBoostingFeatureGenerator
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification()
>>> mf = GradientBoostingFeatureGenerator()
>>> mf.fit(X, y)
>>> mf.transform(X)

[1]	Practical Lessons from Predicting Clicks on Ads at Facebook, from

https://research.fb.com/wp-content/uploads/2016/11/practical-lessons-from-predicting-clicks-on-ads-at-facebook.pdf

[2]	Feature Generation with Gradient Boosted Decision Trees, Towards Data Science, Carlos Mougan

fit(X, y)[source]¶

transform(X)[source]¶: R contains the matrix with the encoded leaves. The shape depends upon the parameters. P contains a two columns array with the probability.

sktools.quantilegroups module¶

Grouped Quantile Featurizer

class sktools.quantilegroups.GroupedQuantileTransformer(feature_mapping, handle_missing='value', n_quantiles=1000, subsample=100000, random_state=None, copy=True)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Computes the group quantile of a numeric feature with respect to a categorical feature.

For instance, if each datum is an apartment, and we have both the price and the city, this feature tries to model how expensive is an apartment in its city. The most expensive apartment in the city will score 1, and the cheapest will score 0.

It is equivalent at what it is done in: https://stackoverflow.com/questions/33899369/ranking-order-per-group-in-pandas

feature_mapping: dict: mapping from numeric variables to categories that want to be used as groups.
n_quantiles: int: number of quantiles per category.
subsample: int: Maximum number of samples used to estimate the quantiles for computational efficiency.
random_state: Any: Determines random number generation for subsampling and smoothing noise. Please see subsample for more details. Pass an int for reproducible results across multiple function calls. See :term:`Glossary `
copy: bool: Set to False to perform inplace transformation and avoid a copy (if the input is already a numpy array).

>>> from sktools import GroupedQuantileTransformer
>>> import pandas as pd
>>> X = pd.DataFrame(
>>>         {
>>>             "price": [1, 2, 3, 3, 2, 10, 0],
>>>             "city": ["a", "a", "a", "b", "b", None, None],
>>>         }
>>>     )
>>> featurizer = GroupedQuantileTransformer(feature_mapping={"price": "city"})
>>> print(featurizer.fit_transform(X).columns)
Index(['price', 'city', 'price_quantile_city'], dtype='object')

fit(X, y=None)[source]¶

transform(X)[source]¶

class sktools.quantilegroups.MeanGroupFeaturizer(feature_mapping, create_features=True, handle_missing='value', handle_unknown='value')[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Creates features establishing a relationship between a numeric and a categorical feature, by using the mean of the numeric feature in each cateogry.

For instance, if each datum is an apartment, and we have both the price and the city, the features model how expensive is an apartment with respect to the mean in the city.

feature_mapping: dict: mapping from numeric variables to categories that want to be used as groups.
percentile: int: percentile used to compute features
create_features: bool: If false, it just computes percentiles by category
handle_missing: str: options are ‘return_nan’ and ‘value’, defaults to ‘value’, which uses the global quantile.
handle_unknown: str: options are ‘return_nan’ and ‘value’, defaults to ‘value’, which uses the global quantile.

>>> from sktools import MeanGroupFeaturizer
>>> import pandas as pd
>>> X = pd.DataFrame(
>>>         {
>>>             "price": [1, 2, 3, 3, 2, 10, 0],
>>>             "city": ["a", "a", "a", "b", "b", None, None],
>>>         }
>>>     )
>>> featurizer = MeanGroupFeaturizer(
>>>     feature_mapping={"price": "city"}
>>> )
>>> print(featurizer.fit_transform(X).columns)
Index(['price', 'city', 'mean_price_city', 'diff_mean_price_city',
       'relu_diff_mean_price_city', 'ratio_mean_price_city'],
      dtype='object')

fit(X, y=None)[source]¶

transform(X)[source]¶

class sktools.quantilegroups.PercentileGroupFeaturizer(feature_mapping, percentile=50, create_features=True, handle_missing='value', handle_unknown='value')[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Creates features establishing a relationship between a numeric and a categorical feature, by using a given percentile of the numeric feature in each cateogry.

For instance, if each datum is an apartment, and we have both the price and the city, if we use the percentile 50 the features model how expensive is an apartment with respect to the median in the city.

feature_mapping: dict: mapping from numeric variables to categories that want to be used as groups.
percentile: int: percentile used to compute features
create_features: bool: If false, it just computes percentiles by category
handle_missing: str: options are ‘return_nan’ and ‘value’, defaults to ‘value’, which uses the global quantile.
handle_unknown: str: options are ‘return_nan’ and ‘value’, defaults to ‘value’, which uses the global quantile.

>>> from sktools import PercentileGroupFeaturizer
>>> import pandas as pd
>>> X = pd.DataFrame(
>>>         {
>>>             "price": [1, 2, 3, 3, 2, 10, 0],
>>>             "city": ["a", "a", "a", "b", "b", None, None],
>>>         }
>>>     )
>>> featurizer = PercentileGroupFeaturizer(
>>>     feature_mapping={"price": "city"}
>>> )
>>> print(featurizer.fit_transform(X).columns)
Index(['price', 'city', 'p50_price_city', 'diff_p50_price_city',
       'relu_diff_p50_price_city', 'ratio_p50_price_city'],
      dtype='object')

fit(X, y=None)[source]¶

transform(X)[source]¶

sktools.selectors module¶

class sktools.selectors.ItemSelector(key)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

For data grouped by feature, select subset of data at a provided key.

The data is expected to be stored in a 2D data structure, where the first index is over features and the second is over samples. i.e.

>>> len(data[key]) == n_samples

Please note that this is the opposite convention to scikit-learn feature matrices (where the first index corresponds to sample).

ItemSelector only requires that the collection implement getitem (data[key]). Examples include: a dict of lists, 2D numpy array, Pandas DataFrame, numpy record array, etc.

>>> data = {'a': [1, 5, 2, 5, 2, 8],
            'b': [9, 4, 1, 4, 1, 3]}
>>> ds = ItemSelector(key='a')
>>> data['a'] == ds.transform(data)

ItemSelector is not designed to handle data grouped by sample. (e.g. a list of dicts). If your data is structured this way, consider a transformer along the lines of sklearn.feature_extraction.DictVectorizer.

key : hashable, required: The key corresponding to the desired value in a mappable.

fit(X, y=None)[source]¶

transform(X)[source]¶

class sktools.selectors.TypeSelector(dtype)[source]¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Transformer that filters a type of columns of a given data frame. This can be useful if we want to treat numeric and object columns differently

dtype : required: The type we want to filter

>>> from sktools import TypeSelector
>>> import pandas as pd
>>> X = pd.DataFrame(
>>>         {
>>>             "price": [1., 2., 3.],
>>>             "city": ["a", "a", "b"]
>>>         }
>>>     )
>>> selector = TypeSelector(
>>>     dtype='float'
>>> )
>>> print(selector.fit_transform(X))
    price
0    1.0
1    2.0
2    3.0

fit(X, y=None)[source]¶

transform(X)[source]¶

Module contents¶

Top-level package for sktools.