aboutsummaryrefslogtreecommitdiff
path: root/README.rst
diff options
context:
space:
mode:
Diffstat (limited to 'README.rst')
-rw-r--r--README.rst47
1 files changed, 41 insertions, 6 deletions
diff --git a/README.rst b/README.rst
index a401384..636b574 100644
--- a/README.rst
+++ b/README.rst
@@ -6,8 +6,8 @@ This module provides a bridge between `Scikit-Learn <http://scikit-learn.org/sta
In particular, it provides:
-1. a way to map DataFrame columns to transformations, which are later recombined into features
-2. a way to cross-validate a pipeline that takes a pandas DataFrame as input.
+1. A way to map ``DataFrame`` columns to transformations, which are later recombined into features.
+2. A way to cross-validate a pipeline that takes a pandas ``DataFrame`` as input.
Installation
------------
@@ -32,7 +32,7 @@ Import
Import what you need from the ``sklearn_pandas`` package. The choices are:
* ``DataFrameMapper``, a class for mapping pandas data frame columns to different sklearn transformations
-* ``cross_val_score``, similar to `sklearn.cross_validation.cross_val_score` but working on pandas DataFrames
+* ``cross_val_score``, similar to ``sklearn.cross_validation.cross_val_score`` but working on pandas DataFrames
For this demonstration, we will import both::
@@ -44,6 +44,7 @@ For these examples, we'll also use pandas, numpy, and sklearn::
>>> import numpy as np
>>> import sklearn.preprocessing, sklearn.decomposition, \
... sklearn.linear_model, sklearn.pipeline, sklearn.metrics
+ >>> from sklearn.feature_extraction.text import CountVectorizer
Load some Data
**************
@@ -67,16 +68,16 @@ The mapper takes a list of pairs. The first is a column name from the pandas Dat
... (['children'], sklearn.preprocessing.StandardScaler())
... ])
-The difference between specifying the column selector as `'column'` (as a simple string) and `['column']` (as a list with one element) is the shape of the array that is passed to the transformer. In the first case, a one dimensional array with be passed, while in the second case it will be a 2-dimensional array with one column, i.e. a column vector.
+The difference between specifying the column selector as ``'column'`` (as a simple string) and ``['column']`` (as a list with one element) is the shape of the array that is passed to the transformer. In the first case, a one dimensional array with be passed, while in the second case it will be a 2-dimensional array with one column, i.e. a column vector.
-This behaviour mimics the same pattern as pandas' dataframes `__getitem__` indexing:
+This behaviour mimics the same pattern as pandas' dataframes ``__getitem__`` indexing:
>>> data['children'].shape
(8,)
>>> data[['children']].shape
(8, 1)
-Be aware that some transformers expect a 1-dimensional input (the label-oriented ones) while some others, like `OneHotEncoder` or `Imputer`, expect 2-dimensional input, with the shape `[n_samples, n_features]`.
+Be aware that some transformers expect a 1-dimensional input (the label-oriented ones) while some others, like ``OneHotEncoder`` or ``Imputer``, expect 2-dimensional input, with the shape ``[n_samples, n_features]``.
Test the Transformation
***********************
@@ -156,6 +157,20 @@ Only columns that are listed in the DataFrameMapper are kept. To keep a column b
[ 1., 0., 0., 5.],
[ 0., 0., 1., 4.]])
+
+Working with sparse features
+****************************
+
+``DataFrameMapper``s will return a dense feature array by default. Setting ``sparse=True`` in the mapper will return a sparse array whenever any of the extracted features is sparse. Example:
+
+ >>> mapper4 = DataFrameMapper([
+ ... ('pet', CountVectorizer()),
+ ... ], sparse=True)
+ >>> type(mapper4.fit_transform(data))
+ <class 'scipy.sparse.csr.csr_matrix'>
+
+The stacking of the sparse features is done without ever densifying them.
+
Cross-Validation
----------------
@@ -175,6 +190,25 @@ Sklearn-pandas' ``cross_val_score`` function provides exactly the same interface
Changelog
---------
+1.1.0 (2015-12-06)
+*******************
+
+* Delete obsolete ``PassThroughTransformer``. If no transformation is desired for a given column, use ``None`` as transformer.
+* Factor out code in several modules, to avoid having everything in ``__init__.py``.
+* Use custom ``TransformerPipeline`` class to allow transformation steps accepting only a X argument. Fixes #46.
+* Add compatibility shim for unpickling mappers with list of transformers created before 1.0.0. Fixes #45.
+
+
+1.0.0 (2015-11-28)
+*******************
+
+* Change version numbering scheme to SemVer.
+* Use ``sklearn.pipeline.Pipeline`` instead of copying its code. Resolves #43.
+* Raise ``KeyError`` when selecting unexistent columns in the dataframe. Fixes #30.
+* Return sparse feature array if any of the features is sparse and ``sparse`` argument is ``True``. Defaults to ``False`` to avoid potential breaking of existing code. Resolves #34.
+* Return model and prediction in custom CV classes. Fixes #27.
+
+
0.0.12 (2015-11-07)
********************
@@ -191,4 +225,5 @@ Other contributors:
* Paul Butler
* Cal Paterson
* Israel Saeta PĂ©rez
+* Zac Stewart
* Olivier Grisel