ml.rf¶
Functions for the random forest classifier.
Module Contents¶
Functions¶
|
Load data from CSV files for each mimo in the given list. |
|
Preprocess data for training and testing by splitting it based on the given test fraction. |
|
Train random forest classifiers for the distance and charge features. |
|
Evaluate the random forest classifiers and return confusion matrices for both features. |
|
Plot the average charge and distance data for the given MIMO types. |
|
Plot the ROC curve for the test data of the charge and distance features. |
|
Plot confusion matrices for distance and charge features. |
|
Plot Gini importance bar plots for the top 20 features for each feature type. |
|
General plotting parameters for the Kulik Lab. |
|
|
|
|
|
Attributes¶
- ml.rf.load_data(mimos, include_esp, data_loc)[source]¶
Load data from CSV files for each mimo in the given list.
- Parameters:
mimos (list of str) – List of mimo names
data_loc (str) – The location of the (e.g, /home/kastner/packages/molecuLearn/ml/data)
- Returns:
df_charge (dict) – Dictionary with mimo names as keys and charge data as values in pandas DataFrames.
df_dist (dict) – Dictionary with mimo names as keys and distance data as values in pandas DataFrames.
- ml.rf.preprocess_data(df_charge, df_dist, mimos, data_split_type, test_frac=0.8)[source]¶
Preprocess data for training and testing by splitting it based on the given test fraction.
- Parameters:
df_charge (dict) – Dictionary with mimo names as keys and charge data as values in pandas DataFrames.
df_dist (dict) – Dictionary with mimo names as keys and distance data as values in pandas DataFrames.
mimos (list of str) – List of mimo names.
data_split_type (int) – Integer value of 1 or 2 to pick the type of data split.
test_frac (float, optional, default: 0.8) – Fraction of data to use for training (the remaining data will be used for testing).
- Returns:
data_split (dict) – Dictionary containing the training and testing data for distance and charge features.
df_dist (dict) – Revised dictionary with mimo names as keys and distance data as values in pandas DataFrames.
df_charge (dict) – Revised dictionary with mimo names as keys and charge data as values in pandas DataFrames.
Notes
In data_split_type, 1 corresponds to splitting each trajectory into train/test then stitching together the train/test sets from each trajectory together to get an overall train/test set. The splitting within each trajectory is based on the provided fractional parameter. 2 corresponds to splitting the entire dataset such that the first set of trajectories belong to the train set, and the second set of trajectories belong to the test set. The splitting of the trajectories is based on the provided fractional parameter.
- ml.rf.train_random_forest(feature, data_split, n_estimators, max_depth, min_samples_split, min_samples_leaf)[source]¶
Train random forest classifiers for the distance and charge features.
- Parameters:
data_split (dict) – Dictionary containing the training and testing data for distance and charge features.
n_estimators (int) – Number of trees in the random forest.
max_depth (int) – Maximum depth of the trees in the random forest.
- Returns:
rf_cls – Dictionary containing trained random forest classifiers for distance and charge features.
- Return type:
dict
- ml.rf.evaluate(rf_cls, data_split, mimos)[source]¶
Evaluate the random forest classifiers and return confusion matrices for both features.
- Parameters:
rf_cls (dict) – Dictionary containing trained random forest classifiers for distance and charge features.
data_split (dict) – Dictionary containing the training and testing data for distance and charge features.
mimos (list) – List of MIMO types, e.g. [‘mc6’, ‘mc6s’, ‘mc6sa’]
- Returns:
cms (dict) – Dictionary containing confusion matrices for distance and charge features.
y_true (dict) – Dictionary containing 1D-array test data ground truth labels for distance and charge features.
y_pred_proba (dict) – Softmax probs dict (2D-array, Ncolumns = number of classes) of the predicted labels for distance and charge features.
- ml.rf.plot_data(df_charge, df_dist, mimos)[source]¶
Plot the average charge and distance data for the given MIMO types.
- Parameters:
df_charge (dict) – Dictionary of DataFrames containing charge data for each MIMO type.
df_dist (dict) – Dictionary of DataFrames containing distance data for each MIMO type.
mimos (list) – List of MIMO types, e.g. [‘mc6’, ‘mc6s’, ‘mc6sa’]
- ml.rf.plot_roc_curve(y_true, y_pred_proba, mimos, data_set_type)[source]¶
Plot the ROC curve for the test data of the charge and distance features.
- Parameters:
y_true (dict) – Dictionary containing 1D-array test data ground truth labels for distance and charge features.
y_pred_proba (dict) – Softmax probs dict (2D-array, Ncolumns = number of classes) of the predicted labels for distance and charge features.
mimos (list) – List of MIMO types, e.g. [‘mc6’, ‘mc6s’, ‘mc6sa’]
- ml.rf.plot_confusion_matrices(cms, mimos)[source]¶
Plot confusion matrices for distance and charge features.
- Parameters:
cms (dict) – Dictionary containing confusion matrices for distance and charge features.
mimos (list) – List of MIMO types, e.g. [‘mc6’, ‘mc6s’, ‘mc6sa’]
- ml.rf.plot_gini_importance(rf_cls, df_dist, df_charge)[source]¶
Plot Gini importance bar plots for the top 20 features for each feature type.
- Parameters:
rf_cls (dict) – Dictionary containing trained RF classifiers for distance and charge features
df_dist (dict) – Dictionary of DataFrames containing distance data for each MIMO type.
df_charge (dict) – Dictionary of DataFrames containing charge data for each MIMO type.
- ml.rf.parser¶