`ml.mlp`¶

Functions for the multi-layer perceptron classifier.

Module Contents¶

Classes¶

`MimoMLP`	Base class for all neural network modules.
`MDDataset`	An abstract class representing a `Dataset`.

Functions¶

`load_data`(mimos, include_esp, data_loc)	Load data from CSV files for each mimo in the given list.
`gradient_step`(model, dataloader, optimizer, device)	A function to train on the entire dataset for one epoch.
`validate`(model, dataloader, device)	A function to validate on the validation dataset for one epoch.
`train`(feature, layers, lr, n_epochs, l2, ...)	A function to train and validate the model over all epochs.
`evaluate_model`(feature, mlp_cls, test_dataloader, ...)	A function to evaluate the model on test data.
`preprocess_data`(df_charge, df_dist, mimos, data_split_type)	Split train and test based on the given test and validation fractions.
`build_dataloaders`(data_split)	A function to build the DataLoaders from the data split.
`plot_data`(df_charge, df_dist, mimos)	Plot the average charge and distance data for the given MIMO types.
`plot_train_val_losses`(train_loss_per_epoch, ...)	Plot the train and validation losses as a function of epoch number.
`plot_roc_curve`(y_true, y_pred_proba, mimos, data_set_type)	Plot the ROC curve for the test data of the charge and distance features.
`plot_confusion_matrices`(cms, mimos)	Plot confusion matrices for distance and charge features.
`shap_analysis`(mlp_cls, train_loader, test_loader, ...)	Plot SHAP dot plots for each mimichrome to identify importance
`create_layers`(input_size, n_neurons)
`run_mlp`(data_split_type, include_esp, n_epochs, ...)
`train_with_hyperparameters`(trial, feature, ...)
`optuna_mlp`(data_split_type, include_esp, n_trials, ...)
`format_plots`(→ None)	General plotting parameters for the Kulik Lab.

Attributes¶

parser

ml.mlp.load_data(mimos, include_esp, data_loc)[source]¶

Load data from CSV files for each mimo in the given list.

Parameters:

mimos (list[str]) – List of mimo names.
data_loc (str) – The location of the (e.g, /home/kastner/packages/molecuLearn/ml/data)

Returns:

df_charge (dict) – Dict with mimo names as keys and charge data as values
df_dist (dict) – Dict with mimo names as keys and distance data as values

ml.mlp.gradient_step(model, dataloader, optimizer, device)[source]¶

A function to train on the entire dataset for one epoch.

Parameters:

model (torch.nn.Module) – The model
dataloader (torch.utils.data.DataLoader) – DataLoader object for the train data
optimizer (torch.optim.Optimizer(())) – optimizer object to interface gradient calculation and optimization
device (str) – The device (usually ‘cuda:0’ for GPU or ‘cpu’ for CPU)

Returns:

loss – Loss averaged over all the batches

Return type:

float

ml.mlp.validate(model, dataloader, device)[source]¶

A function to validate on the validation dataset for one epoch.

Parameters:

model (torch.nn.Module) – The model
dataloader (torch.utils.data.DataLoader) – DataLoader object for the validation data
device (str) – Your device (usually ‘cuda:0’ for GPU or ‘cpu’ for CPU)

Returns:

loss – Loss averaged over all the batches

Return type:

float

ml.mlp.train(feature, layers, lr, n_epochs, l2, train_dataloader, val_dataloader, device)[source]¶

A function to train and validate the model over all epochs.

Parameters:

layers (dict) – Dict containing model architecture for distance and charge features
lr (float) – Step size for adjusting parameters given computed error gradient
n_epochs (int) – Number of epochs over training and validation sets
train_dataloader (dict of torch.utils.data.DataLoader) – Dict of DataLoader objects for distance and charge training data
val_dataloader (dict[torch.utils.data.DataLoader]) – Dict containing DataLoader object for distance and charge validation
device (str) – Your device (usually ‘cuda:0’ for GPU or ‘cpu’ for CPU)

Returns:

mlp_cls (dict) – Dict containing trained MLP models for distance and charge features
train_loss_per_epoch (dict) – Dict containing training loss as a function of epoch number
val_loss_per_epoch (dict) – Dictionary containing validation loss as a function of epoch number

ml.mlp.evaluate_model(feature, mlp_cls, test_dataloader, device, mimos)[source]¶

A function to evaluate the model on test data.

Parameters:

mlp_cls (dict of torch.nn.Module) – Dict containing trained MLP classifiers
test_dataloader (dict of torch.utils.data.DataLoader) – Dict containing DataLoader object for the test data
device (str) – Your device (usually ‘cuda:0’ for GPU or ‘cpu’ for CPU)
mimos (list) – List of MIMO types, e.g. [‘mc6’, mc6s’, ‘mc6sa’]

Returns:

test loss (dict) – Dict containing average test loss
y_true (dict) – Dict containing test data ground truth labels
y_pred_proba (dict) – Dict containing softmax probabilities of the predicted labels
y_pred (dict) – Dict containing prediction labels
cms (dict) – Dict containing confusion matrices

class ml.mlp.MimoMLP(layers)[source]¶

Bases: torch.nn.Module

Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes:

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will have their parameters converted too when you call to(), etc.

Note

As per the example above, an __init__() call to the parent class must be made before assignment on the child.

Variables:: training (bool) – Boolean represents whether this module is in training or evaluation mode.

forward(x)[source]¶

ml.mlp.preprocess_data(df_charge, df_dist, mimos, data_split_type, val_frac=0.6, test_frac=0.8)[source]¶

Split train and test based on the given test and validation fractions.

Parameters:

df_charge (dict) – Dict with mimo names as keys and charge data as values
df_dist (dict) – Dictionary with mimo names as keys and distance data as values
mimos (list of str) – List of mimo names.
data_split_type (int) – Integers 1 (each traj as train/val/test) or 2 (split the entire dataset)
val_frac (float, optional, default: 0.6) – Fraction of data to use for training (rest for val and test)
test_frac (float, optional, default: 0.8) – Fraction of data to use for train and val (rest for testing)

Returns:

data_split (dict) – Dict containing the train and test data for distance and charge features
df_charge (dict) – Revised dict with mimo names as keys and charge data as values
df_dist (dict) – Revised dict with mimo names as keys and distance data as values

ml.mlp.build_dataloaders(data_split)[source]¶

A function to build the DataLoaders from the data split.

Parameters:

data split (dict) – Dict containing the train and testg data all features.

Returns:

train_loader (dict) – Dict containing DataLoader object for the train data for all features
val_loader (dict) – Dict containing DataLoader object for the val data for all features
test_loader (dict) – Dict containing DataLoader object for the test data for all features

ml.mlp.plot_data(df_charge, df_dist, mimos)[source]¶

Plot the average charge and distance data for the given MIMO types.

Parameters:

df_charge (dict) – Dictionary of DataFrames containing charge data for each MIMO type.
df_dist (dict) – Dictionary of DataFrames containing distance data for each MIMO type.
mimos (list) – List of MIMO types, e.g. [‘mc6’, ‘mc6s’, ‘mc6sa’]

ml.mlp.plot_train_val_losses(train_loss_per_epoch, val_loss_per_epoch)[source]¶

Plot the train and validation losses as a function of epoch number.

Parameters:

train_loss_per_epoch (dict) – Dict of np.arrays containing train losses per epoch
val_loss_per_epoch (dict) – Dict of np.arrays containing val losses per epoch

ml.mlp.plot_roc_curve(y_true, y_pred_proba, mimos, data_set_type)[source]¶

Plot the ROC curve for the test data of the charge and distance features.

Parameters:

y_true (dict) – Dict[np.arrays] containing ground truth labels of test data
y_pred_proba (dict) – Dict[np.arrays] containing softmaxed probability predictions of the test
mimos (list) – List of MIMO types, e.g. [‘mc6’, ‘mc6s’, ‘mc6sa’]

ml.mlp.plot_confusion_matrices(cms, mimos)[source]¶

Plot confusion matrices for distance and charge features.

Parameters:

cms (dict) – Dict containing confusion matrices for distance and charge features.
mimos (list) – List of MIMO types, e.g. [‘mc6’, ‘mc6s’, ‘mc6sa’]

ml.mlp.shap_analysis(mlp_cls, train_loader, test_loader, val_loader, df_dist, df_charge, mimos)[source]¶

Plot SHAP dot plots for each mimichrome to identify importance

Parameters:

mlp_cls (dict) – Dict containing trained MLP classifiers for distance and charge features
train_loader (dict) – Dict containing DataLoader object for the train data for distance and charge features
test_loader (dict) – Dict containing DataLoader object for the test data for distance and charge features
val_loader (dict) – Dict containing DataLoader object for the val data for distance and charge features
df_dist (dict) – Dict of DataFrames containing distance data for each MIMO type.
df_charge (dict) – Dict of DataFrames containing charge data for each MIMO type.
mimos (list) – List of MIMO types, e.g. [‘mc6’, ‘mc6s’, ‘mc6sa’]

class ml.mlp.MDDataset(X, y)[source]¶

Bases: torch.utils.data.Dataset

An abstract class representing a Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite __getitem__(), supporting fetching a data sample for a given key. Subclasses could also optionally overwrite __len__(), which is expected to return the size of the dataset by many Sampler implementations and the default options of DataLoader. Subclasses could also optionally implement __getitems__(), for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.

Note

DataLoader by default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

ml.mlp.create_layers(input_size, n_neurons)[source]¶

ml.mlp.run_mlp(data_split_type, include_esp, n_epochs, hyperparams)[source]¶

ml.mlp.train_with_hyperparameters(trial, feature, train_loader, val_loader, n_dist, n_charge)[source]¶

ml.mlp.optuna_mlp(data_split_type, include_esp, n_trials, out_name)[source]¶

ml.mlp.format_plots() → None[source]¶: General plotting parameters for the Kulik Lab.

ml.mlp.parser¶

ml.mlp¶

Module Contents¶

Classes¶

Functions¶

Attributes¶

`ml.mlp`¶