Linear Pipeline For Data Science Workflow
13 May 2017Data science workflow can be iterative and take circutous paths. What futher adds to this complexity is leaving the memory
to the data scientist, of how the workflow progresses over the course of a project.
Speedml
memorizes
the data science workflow for the data scientist.
It does so using the simple Speedml.eda
method. In this release 0.9.2
we further optimize the method making it user configurable and progressively updating based on the workflow status.
Progressively updating workflow status
Now when you call Speedml.eda
method at the start of your workflow, during pre-processing, and before model run, it returns a table which progressively hides away the metrics which are complete.
Within the same notebook you can scroll to prior or next EDA result to note the changes based on your workflow steps.
This ends up making the call the Speedml.eda
akin to an automatically updating to do list.
See how this feature works in the notebook Titanic Solution Using Speedml from our GitHub repository.
Pipelining from EDA to pre-processing
The Speedml.eda
method now returns a list of features instead of tuples with cardinality. This helps in taking the cell output straight into pandas dataframe filter or feature engineering methods like feature.density
or feature.labels
for next stage workflow.
Cardinality is still available for three bands - high, normal (within threshold), and continuous or unique. For most workflows this information is enough.
Following code demonstrates how we can pipeline results from the Speedml.eda
method into the next stage in our workflow for pre-processing the features.
# Display top 5 samples with text unique features
sml.train[sml.eda().get_value('Text Unique', 'Results')).head()
# Convert categorical text features to numeric labels
text_categoricals = sml.eda().get_value('Text Categorical', 'Results')
sml.feature.labels(text_categoricals)
The Speedml.feature.density
method now takes string feature name or a list of strings of feature names as parameter to create density features for one or more high-cardinality features. This way you can now pipe the eda
method’s high-cardinality features list to the density
method like so.
# Create density features for High-cardinality text features
text_high_cardinality = sml.eda().get_value('Text High-cardinality', 'Results')
sml.feature.density(text_high_cardinality)
That is easy.
User configurable EDA rules
Speedml EDA rules are now configurable using the API. You can configure how Speedml analyzes outliers, over-fitting, high-cardinality, unique or continuous features.
# Display the configuration dictonary
sml.config
# Used by data out path 'internally' within Speedml methods
sml.configure('outpath', 'output/')
# Positive and negative skew within +- this value
sml.configure('outliers_threshold', 3)
# #Features/#Samples Train < this value
sml.configure['overfit_threshold'] = 0.01
# Feature is high-cardinality if categories > this value
sml.configure('high_cardinality', 10)
# Unique (continuous) if sml.config('unique_ratio')% non-repeat values
sml.configure('unique_ratio', 80)
Of course Speedml sets up the natural defaults so you do not have to.
# Display the configuration dictonary
sml.configuration()
Outlier detection during EDA
The Speedml.eda
method now performs automatic outliers detection based on amount of skew of feature values from normal distribution. The outlier detection threshold is user configurable like so.
# Positive and negative skew within +- this value
sml.config('outliers_threshold', 3)
Depending of existance of outliers the Speedml.eda
method results suggest usage of upper or lower percentile during Speedml.feature.outlier
method call.