Automate Exploratory Data Analysis

12 May 2017

Exploratory Data Analysis (EDA) is one of the first workflows when starting out a machine learning project. Throwing in a bunch of plots at a dataset is not difficult. What is much more useful is to derive insights, metrics, and observations based on the state of the datasets, guiding next stages in our ML workflow.

The 0.9.2 release of Speedml focuses on productivity improvements during exploratory data analysis workflow.

Can the datasets tell us which plots to use to detect outliers, which features to engineer next, which model algorithms to use, or how to cleanse our datasets? This release of Speedml starts answering some of these questions for you.

We also address Python 2.x / 3.x compatibility issues and current bug fixes.

Speed EDA

Once you have imported Speedml and initialized the datasets, you can run the eda method to speed EDA your datasets.

from speedml import Speedml
sml = Speedml('../input/titanic/train.csv',
              '../input/titanic/test.csv',
              target='Survived', uid='PassengerId')

sml.eda()

The resulting dataframe automatically provides insights and observations based on current state of your datasets. Observations include suitable plots to use, pre-processing methods to apply, and ideal results to achieve for model readiness. We even include model recommendations (regression or classification) based on the target variable analysis.

Exploratory Data Analysis

As you progress in the pre-processing workflow you can run the method again and it will tailor the result according to latest state. Once you convert all text columns to numerical, the text features related rows will disappear. Think of this method as a dashboard of progress throughout your workflow.

You can query any of the results like so.

sml.eda().get_value('Numerical Categorical', 'Results')

This will output the complete text or list value of the results column for the queried metric.

New plots

We are adding a couple of new plots in this release.

Using Strip plots we can plot categorical feature against a continuous feature across the target variable.

sml.plot.strip('Pclass', 'Fare')

Strip Plot

We can plot categorical variables against target variable using a simple bar graph.

sml.plot.bar('Deck', 'Survived')

Bar Graph

Plot this right next to a crosstab table using plot.crosstab method to gain a text based and visual view of the same data.

Automatic notebook configuration

init.py Revert to using absolute/relative paths as passed into public API. Use Base.outpath for internal file output (when storing Xgb feature map file).

Now you do not need to specify %matplotlib inline magic command in the notebook using Speedml. We configure the notebook automatically.

Capability to output more than one results from one input cell. Nice to have when using single line method calls returning single line results. Specially when these methods can group as one seamless workflow like so.

sml.feature.extract(new='Title', a='Name', regex=' ([A-Za-z]+)\.')
sml.plot.crosstab('Title', 'Sex')
sml.feature.replace(a='Title', match=['Lady', 'Countess','Capt', 'Col',\
'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], new='Rare')
sml.feature.replace('Title', 'Mlle', 'Miss')
sml.feature.replace('Title', 'Ms', 'Miss')
sml.feature.replace('Title', 'Mme', 'Mrs')
sml.train[['Name', 'Title']].head()

This code in one input cell will now output a series of dataframes and string messages based on output from each of the methods.

Python 3 and 2 compatible support

base.py Added @staticmethod decorator for data_n() method to address #8

setup.py Added dependency for future package to support Python 3/2 compatible code.

setup.py Updated open method without encoding param for Py27 compatibility.

From this release we are adding release testing and feature exploration notebooks for Python 3.x and Python 2.x kernels. You can view and download these from GitHub for 0.9.2 release running for Python 2.x or 3.x kernels.

Saving performance parameters

We can now save metrics from model performance methods as a file slug as part of our submission. This enables us to record and order submissions by changes to these metrics.

sml.save_results(
    columns={ 'PassengerId': sml.uid,
             'Survived': sml.xgb.predictions },
    file_path='output/titanic-speedml-{}.csv'.format(sml.slug()))

This results in file names like these.

titanic-speedml-e16.63-m83.33-s86.53-f78.73.csv

Here the CV error is e=0.1663. Model accuracy relative to other models is m=0.8333. Accuracy of number of samples predicted is s=86.45%. The feature selection accuracy score is f=78.73%.

Speedml Machine Learning Speed Start

Automate Exploratory Data Analysis

Speed EDA

New plots

Automatic notebook configuration

Python 3 and 2 compatible support

Saving performance parameters

Related Posts

Machine Learning ROI Report MIT Technology Review 25 May 2017

Property Listing Optimization 17 May 2017

Linear Pipeline For Data Science Workflow 13 May 2017