Linear Pipeline For Data Science Workflow

13 May 2017

Data science workflow can be iterative and take circutous paths. What futher adds to this complexity is leaving the memory to the data scientist, of how the workflow progresses over the course of a project.

Speedml memorizes the data science workflow for the data scientist.

It does so using the simple Speedml.eda method. In this release 0.9.2 we further optimize the method making it user configurable and progressively updating based on the workflow status.

Linear Pipeline

Progressively updating workflow status

Now when you call Speedml.eda method at the start of your workflow, during pre-processing, and before model run, it returns a table which progressively hides away the metrics which are complete.

Within the same notebook you can scroll to prior or next EDA result to note the changes based on your workflow steps.

This ends up making the call the Speedml.eda akin to an automatically updating to do list.

See how this feature works in the notebook Titanic Solution Using Speedml from our GitHub repository.

Pipelining from EDA to pre-processing

The Speedml.eda method now returns a list of features instead of tuples with cardinality. This helps in taking the cell output straight into pandas dataframe filter or feature engineering methods like feature.density or feature.labels for next stage workflow.

Cardinality is still available for three bands - high, normal (within threshold), and continuous or unique. For most workflows this information is enough.

Following code demonstrates how we can pipeline results from the Speedml.eda method into the next stage in our workflow for pre-processing the features.

# Display top 5 samples with text unique features
sml.train[sml.eda().get_value('Text Unique', 'Results')).head()

# Convert categorical text features to numeric labels
text_categoricals = sml.eda().get_value('Text Categorical', 'Results')
sml.feature.labels(text_categoricals)

The Speedml.feature.density method now takes string feature name or a list of strings of feature names as parameter to create density features for one or more high-cardinality features. This way you can now pipe the eda method’s high-cardinality features list to the density method like so.

# Create density features for High-cardinality text features
text_high_cardinality = sml.eda().get_value('Text High-cardinality', 'Results')
sml.feature.density(text_high_cardinality)

That is easy.

User configurable EDA rules

Speedml EDA rules are now configurable using the API. You can configure how Speedml analyzes outliers, over-fitting, high-cardinality, unique or continuous features.

# Display the configuration dictonary
sml.config
# Used by data out path 'internally' within Speedml methods
sml.configure('outpath', 'output/')
# Positive and negative skew within +- this value
sml.configure('outliers_threshold', 3)
# #Features/#Samples Train < this value
sml.configure['overfit_threshold'] = 0.01
# Feature is high-cardinality if categories > this value
sml.configure('high_cardinality', 10)
# Unique (continuous) if sml.config('unique_ratio')% non-repeat values
sml.configure('unique_ratio', 80)

Of course Speedml sets up the natural defaults so you do not have to.

# Display the configuration dictonary
sml.configuration()

Outlier detection during EDA

The Speedml.eda method now performs automatic outliers detection based on amount of skew of feature values from normal distribution. The outlier detection threshold is user configurable like so.

# Positive and negative skew within +- this value
sml.config('outliers_threshold', 3)

Depending of existance of outliers the Speedml.eda method results suggest usage of upper or lower percentile during Speedml.feature.outlier method call.

Speedml Machine Learning Speed Start

Linear Pipeline For Data Science Workflow

Progressively updating workflow status

Pipelining from EDA to pre-processing

User configurable EDA rules

Outlier detection during EDA

Related Posts

Machine Learning ROI Report MIT Technology Review 25 May 2017

Property Listing Optimization 17 May 2017

Automate Exploratory Data Analysis 12 May 2017