Speedml Flow 0.8.0 First Public Release

27 Apr 2017

This release 0.8.0 is our first public release. We nicknamed this release “Flow” as it represents the state we want to reach when coding Machine Learning projects, where we focus on the data, the problem at hand, and the solution, without the API coming in our way.

Here is a list of features in this release categorized by typical ML workflow sequence.

Exploratory Data Analysis

distribute Plot numerical features within training dataset as a set of histograms to understand distributions, skews for features of the dataset at-a-glance.

correlate Plot numerical features within training dataset as a correlation matrix heatmap plot to determine which features are best related to the target and which features may be candidates for engineering or removal. The plot automatically flexes based on number of features, displaying more details for fewer features, and more compact graph for larger number of features.

ordinal Plot ordinal features (categorical numeric) using Violin plot against target feature. Use this to determine outliers within ordinal features spread across associated target feature values.

continuous Plot continuous features (numeric) using scatter plot. Use this to determine outliers within continuous features.

model_ranks Plot ranking among accuracy offered by models based on our datasets.

importance Plot importance of features based on ExtraTreesClassifier.

xgb_importance Plot importance of features based on XGBoost.

Pre-processing and data wrangling

drop Drop one or more list of strings naming features from train and test datasets.

impute Replace empty values in the entire dataframe with median value for numerical features and most common values for text features.

fill_na Fills empty or null values in a feature name with new string value.

replace In feature a values match string or list of strings and replace with a new string.

outliers_fix Fix outliers for lower or upper or both percentile of values within a feature.

density Create new feature named a feature name + suffix ‘_density’, based on density or value_counts for each unique value in a feature.

add Update a numeric feature by adding num number to each values.

sum Create new numeric feature by adding a + b feature values.

And similarly other methods for working on numerical features including diff, divide, product, and round.

Methods to apply concat, list_len for count of items in a list, word_cound for word count in free-text values of features.

regex_extract Match regex regular expression with a text feature values to update a feature with matching text if new = None. Otherwise create new feature based on matching text.

labels Generate numerical labels replacing text values from list of categorical features.

XGBoost model capabilities

hyper Tune XGBoost hyper-parameters by selecting from permutations of values from the select_params dictionary. Remaining parameters with single values in the fixed_params dictionary. Returns a dataframe with ranking of select_params items.

cv Calculate the Cross-Validation (CV) score for XGBoost model based on grid_params parameters. Sets xgb.cv_results variable to the resulting dataframe.

Methods for defining params, classifier, performing model training fit, and to predict the model results.

Speedml workflow

Methods spanning model evaluation, feature selection, and dataset file input/output.

Speedml Machine Learning Speed Start

Speedml Flow 0.8.0 First Public Release

Exploratory Data Analysis

Pre-processing and data wrangling

XGBoost model capabilities

Speedml workflow

Related Posts

Machine Learning ROI Report MIT Technology Review 25 May 2017

Property Listing Optimization 17 May 2017

Linear Pipeline For Data Science Workflow 13 May 2017