Model Modules

StackingModel

class pymaftools.model.StackingModel.OmicsStackingModel(omics_dict, class_order, base_model=<class 'sklearn.ensemble._forest.RandomForestClassifier'>, final_model=<class 'sklearn.linear_model._logistic.LogisticRegression'>, random_state=42)[source]

Bases: object

Multi-omics stacking classifier.

Builds a StackingClassifier where each base estimator operates on a single omics layer, and a final meta-learner combines their predictions.

Parameters:

omics_dict (dict[str, PivotTable]) – Mapping of omics names to PivotTable objects (features as index).
class_order (list[str]) – Ordered class labels used for encoding/decoding.
base_model (type, default RandomForestClassifier) – Class of the base estimator (instantiated per omics layer).
final_model (type, default LogisticRegression) – Class of the final meta-learner.
random_state (int, default 42) – Random seed for reproducibility.

build_model()[source]

Build the stacking classifier from omics_dict.

Return type:: None

encode_y(y)[source]

Encode labels to integer indices using class_order.

Return type:: ndarray

decode_y(y_encoded)[source]

Decode integer indices back to original labels.

Return type:: ndarray

fit(X, y)[source]

Fit the stacking model.

Parameters:

X (pd.DataFrame) – Training data (samples as rows, all omics features as columns).
y (array-like) – Target labels.

Return type:

None

predict(X)[source]

Predict class labels.

Parameters:: X (pd.DataFrame) – Input data.
Return type:: ndarray
Returns:: np.ndarray – Decoded class labels.

predict_proba(X)[source]

Predict class probabilities.

Parameters:: X (pd.DataFrame) – Input data.
Return type:: ndarray
Returns:: np.ndarray – Probability matrix of shape (n_samples, n_classes).

get_omics_feature_importance(omics_key)[source]

Get feature importances for a specific omics layer.

Parameters:: omics_key (str) – Key in omics_dict identifying the omics layer.
Return type:: Series
Returns:: pd.Series – Feature importances indexed by feature names.

get_omics_weights()[source]

Return the weights of each omics layer in the final meta-learner.

Return type:: DataFrame
Returns:: pd.DataFrame – Weights with omics as rows. Includes abs_mean and abs_ratio columns for interpretability.
Raises:: ValueError – If the model has not been fitted or the final estimator does not expose coef_.

plot_final_coefficients()[source]

Plot the final meta-learner coefficients as a heatmap.

Return type:: None

confusion_matrix(y_true, y_pred, title=None)[source]

Plot a confusion matrix heatmap.

Parameters:

y_true (array-like) – True labels.
y_pred (array-like) – Predicted labels.
title (str, optional) – Plot title.

Return type:

None

evaluate(X, y_true, average='macro', show=True)[source]

Evaluate classification performance.

Parameters:

X (pd.DataFrame) – Input data.
y_true (array-like) – True labels.
average (str, default "macro") – Averaging strategy for multi-class metrics.
show (bool, default True) – Whether to print the metrics.

Return type:

dict[str, float | None]

Returns:

dict[str, float | None] – Dictionary with keys accuracy, f1, precision, recall, and roc_auc.

class pymaftools.model.StackingModel.ASCStackingModel(omics_dict, class_order, random_state=42)[source]

Bases: OmicsStackingModel

Stacking model pre-configured for ASC (adenosquamous carcinoma) analysis.

Parameters:

omics_dict (dict[str, PivotTable]) – Mapping of omics names to PivotTable objects.
class_order (list[str]) – Ordered class labels.
random_state (int, default 42) – Random seed.

soft_score(X)[source]

Compute the LUSC probability score for each sample.

Parameters:: X (pd.DataFrame) – Input data.
Return type:: ndarray
Returns:: np.ndarray – LUSC class probability for each sample.

modelUtils

pymaftools.model.modelUtils.get_importance(model)[source]

Extract feature importance from a fitted model.

Supports sklearn estimators with feature_importances_ and OmicsStackingModel instances.

Parameters:: model (object) – A fitted model.
Return type:: Series
Returns:: pd.Series – Feature importances indexed by feature names.
Raises:: ValueError – If the model type is not supported.

pymaftools.model.modelUtils.evaluate_model(model, X_test, y_test)[source]

Evaluate a single model and return metric dictionary.

Parameters:

model (object) – A fitted model with predict and predict_proba methods.
X_test (pd.DataFrame) – Test features.
y_test (array-like) – True labels.

Return type:

dict[str, float]

Returns:

dict[str, float] – Dictionary with keys acc, f1, and auc.

pymaftools.model.modelUtils.cross_validate_importance(X, y, model_func, model_name, n_seeds=5, n_splits=5, random_state_base=0, verbose=True, evaluate_func=None)[source]

Run repeated stratified cross-validation, collecting feature importances and metrics.

Parameters:

X (pd.DataFrame) – Feature matrix (samples as rows).
y (pd.Series) – Target labels.
model_func (callable) – Factory model_func(seed) -> model returning a fresh model instance.
model_name (str) – Name identifier for this model.
n_seeds (int, default 5) – Number of random seeds (repetitions).
n_splits (int, default 5) – Number of CV folds per seed.
random_state_base (int, default 0) – Base value added to each seed for reproducibility.
verbose (bool, default True) – Whether to display a progress bar.
evaluate_func (callable, optional) – Function (model, X_test, y_test) -> dict returning per-fold metrics.

Return type:

tuple[pd.DataFrame, pd.DataFrame | None]

Returns:

importance_df (pd.DataFrame) – Long-format feature importance table.
metric_df (pd.DataFrame or None) – Long-format metrics table (None if evaluate_func is not provided).

pymaftools.model.modelUtils.plot_metric_comparison_with_annotation(data, metrics=None, group_col='model', order=None, palette='Set2', test='Mann-Whitney', alpha=0.8, fontsize=14, figsize=None, title_prefix=None, save_path=None, **save_kwargs)[source]

Plot metric comparison boxplots with statistical annotations.

Parameters:

data (pd.DataFrame) – DataFrame containing model metrics.
metrics (list[str], optional) – Metric column names to plot. Default ["acc", "f1", "auc"].
group_col (str, default "model") – Column used for grouping.
order (list[str], optional) – Display order of groups.
palette (str, default "Set2") – Seaborn color palette.
test (str, default "Mann-Whitney") – Statistical test for annotations.
alpha (float, default 0.8) – Box transparency.
fontsize (int, default 14) – Font size.
figsize (tuple, optional) – Figure size.
title_prefix (str, optional) – Title prefix (None disables titles).
save_path (str, optional) – Path to save the figure.
**save_kwargs – Additional arguments passed to save method.

Returns:

ModelPlot – The plotter instance.

pymaftools.model.modelUtils.to_importance_table(all_importance_df, omic)[source]

Convert long-format importance data to a sorted PivotTable.

Parameters:

all_importance_df (pd.DataFrame) – Long-format importance DataFrame with columns model, seed, fold, feature, importance.
omic (str) – Omics name to filter by.

Return type:

PivotTable

Returns:

PivotTable – Feature x seed matrix sorted by mean importance (descending).

pymaftools.model.modelUtils.plot_top_feature_importance_heatmap(mean_importance_df, omic, top_n=20, cmap='viridis', figsize=(10, 6), title=None, save_path=None, **save_kwargs)[source]

Plot heatmap of top-N most important features.

Parameters:

mean_importance_df (pd.DataFrame) – Feature importance data.
omic (str) – Omics name identifier.
top_n (int, default 20) – Number of top features to display.
cmap (str, default "viridis") – Colormap for the heatmap.
figsize (tuple, default (10, 6)) – Figure size.
title (str, optional) – Plot title (None disables title).
save_path (str, optional) – Path to save the figure.
**save_kwargs – Additional arguments passed to save method.

Returns:

ModelPlot – The plotter instance.

pymaftools.model.modelUtils.run_rfecv_feature_selection(pivot, label_col='subtype', estimator=None, step=10, scoring='accuracy', min_features_to_select=10, plot=True, random_state=42, title=None, save_path=None, **save_kwargs)[source]

Run RFECV feature selection on a PivotTable.

Parameters:

pivot (PivotTable) – Feature x sample table.
label_col (str, default "subtype") – Column in sample_metadata containing target labels.
estimator (sklearn estimator, optional) – Model to use (default: RandomForestClassifier).
step (int, default 10) – Number of features removed per iteration.
scoring (str, default "accuracy") – Scoring metric (e.g. "accuracy", "f1_macro").
min_features_to_select (int, default 10) – Minimum number of features to keep.
plot (bool, default True) – Whether to plot the performance curve.
random_state (int, default 42) – Random seed.
title (str, optional) – Plot title (None disables title).
save_path (str, optional) – Path to save the figure.
**save_kwargs – Additional arguments passed to save method.

Return type:

tuple[list[str], RFECV]

Returns:

selected_features (list[str]) – Selected feature names.
selector (RFECV) – Fitted RFECV object.

pymaftools.model.modelUtils.run_model_evaluation(model_configs, y, n_seeds=100, n_splits=5, evaluate_func=None, verbose=True)[source]

Run cross-validation and importance analysis for multiple models.

Parameters:

model_configs (list[dict]) – Each dict must have keys "name" (str), "model_func" (callable), and "X" (pd.DataFrame).
y (pd.Series) – Target labels.
n_seeds (int, default 100) – Number of random seeds.
n_splits (int, default 5) – Number of CV folds.
evaluate_func (callable, optional) – Evaluation function (model, X_test, y_test) -> dict.
verbose (bool, default True) – Whether to print progress.

Return type:

tuple[dict, pd.DataFrame, pd.DataFrame]

Returns:

result_dict (dict) – Per-model results with "importance" and "metrics" keys.
all_importance_df (pd.DataFrame) – Combined long-format feature importance data.
all_metrics_df (pd.DataFrame) – Combined long-format classification metrics.