Linear Pipeline For Data Science Workflow13 May 2017
Data science workflow can be iterative and take circutous paths. What futher adds to this complexity is leaving the
memory to the data scientist, of how the workflow progresses over the course of a project.
memorizesthe data science workflow for the data scientist.
It does so using the simple
Speedml.eda method. In this release
0.9.2 we further optimize the method making it user configurable and progressively updating based on the workflow status.
Progressively updating workflow status
Now when you call
Speedml.eda method at the start of your workflow, during pre-processing, and before model run, it returns a table which progressively hides away the metrics which are complete.
Within the same notebook you can scroll to prior or next EDA result to note the changes based on your workflow steps.
This ends up making the call the
Speedml.eda akin to an automatically updating to do list.
See how this feature works in the notebook Titanic Solution Using Speedml from our GitHub repository.
Pipelining from EDA to pre-processing
Speedml.eda method now returns a list of features instead of tuples with cardinality. This helps in taking the cell output straight into pandas dataframe filter or feature engineering methods like
feature.labels for next stage workflow.
Cardinality is still available for three bands - high, normal (within threshold), and continuous or unique. For most workflows this information is enough.
Following code demonstrates how we can pipeline results from the
Speedml.eda method into the next stage in our workflow for pre-processing the features.
# Display top 5 samples with text unique features sml.train[sml.eda().get_value('Text Unique', 'Results')).head() # Convert categorical text features to numeric labels text_categoricals = sml.eda().get_value('Text Categorical', 'Results') sml.feature.labels(text_categoricals)
Speedml.feature.density method now takes string feature name or a list of strings of feature names as parameter to create density features for one or more high-cardinality features. This way you can now pipe the
eda method’s high-cardinality features list to the
density method like so.
# Create density features for High-cardinality text features text_high_cardinality = sml.eda().get_value('Text High-cardinality', 'Results') sml.feature.density(text_high_cardinality)
That is easy.
User configurable EDA rules
Speedml EDA rules are now configurable using the API. You can configure how Speedml analyzes outliers, over-fitting, high-cardinality, unique or continuous features.
# Display the configuration dictonary sml.config # Used by data out path 'internally' within Speedml methods sml.configure('outpath', 'output/') # Positive and negative skew within +- this value sml.configure('outliers_threshold', 3) # #Features/#Samples Train < this value sml.configure['overfit_threshold'] = 0.01 # Feature is high-cardinality if categories > this value sml.configure('high_cardinality', 10) # Unique (continuous) if sml.config('unique_ratio')% non-repeat values sml.configure('unique_ratio', 80)
Of course Speedml sets up the natural defaults so you do not have to.
# Display the configuration dictonary sml.configuration()
Outlier detection during EDA
Speedml.eda method now performs automatic outliers detection based on amount of skew of feature values from normal distribution. The outlier detection threshold is user configurable like so.
# Positive and negative skew within +- this value sml.config('outliers_threshold', 3)
Depending of existance of outliers the
Speedml.eda method results suggest usage of upper or lower percentile during
Speedml.feature.outlier method call.