Model Modules

StackingModel

class pymaftools.model.StackingModel.OmicsStackingModel(omics_dict, class_order, base_model=<class 'sklearn.ensemble._forest.RandomForestClassifier'>, final_model=<class 'sklearn.linear_model._logistic.LogisticRegression'>, random_state=42)[source]

Bases: object

Multi-omics stacking classifier.

Builds a StackingClassifier where each base estimator operates on a single omics layer, and a final meta-learner combines their predictions.

Parameters:
  • omics_dict (dict[str, PivotTable]) – Mapping of omics names to PivotTable objects (features as index).

  • class_order (list[str]) – Ordered class labels used for encoding/decoding.

  • base_model (type, default RandomForestClassifier) – Class of the base estimator (instantiated per omics layer).

  • final_model (type, default LogisticRegression) – Class of the final meta-learner.

  • random_state (int, default 42) – Random seed for reproducibility.

build_model()[source]

Build the stacking classifier from omics_dict.

Return type:

None

encode_y(y)[source]

Encode labels to integer indices using class_order.

Return type:

ndarray

decode_y(y_encoded)[source]

Decode integer indices back to original labels.

Return type:

ndarray

fit(X, y)[source]

Fit the stacking model.

Parameters:
  • X (pd.DataFrame) – Training data (samples as rows, all omics features as columns).

  • y (array-like) – Target labels.

Return type:

None

predict(X)[source]

Predict class labels.

Parameters:

X (pd.DataFrame) – Input data.

Return type:

ndarray

Returns:

np.ndarray – Decoded class labels.

predict_proba(X)[source]

Predict class probabilities.

Parameters:

X (pd.DataFrame) – Input data.

Return type:

ndarray

Returns:

np.ndarray – Probability matrix of shape (n_samples, n_classes).

get_omics_feature_importance(omics_key)[source]

Get feature importances for a specific omics layer.

Parameters:

omics_key (str) – Key in omics_dict identifying the omics layer.

Return type:

Series

Returns:

pd.Series – Feature importances indexed by feature names.

get_omics_weights()[source]

Return the weights of each omics layer in the final meta-learner.

Return type:

DataFrame

Returns:

pd.DataFrame – Weights with omics as rows. Includes abs_mean and abs_ratio columns for interpretability.

Raises:

ValueError – If the model has not been fitted or the final estimator does not expose coef_.

plot_final_coefficients()[source]

Plot the final meta-learner coefficients as a heatmap.

Return type:

None

confusion_matrix(y_true, y_pred, title=None)[source]

Plot a confusion matrix heatmap.

Parameters:
  • y_true (array-like) – True labels.

  • y_pred (array-like) – Predicted labels.

  • title (str, optional) – Plot title.

Return type:

None

evaluate(X, y_true, average='macro', show=True)[source]

Evaluate classification performance.

Parameters:
  • X (pd.DataFrame) – Input data.

  • y_true (array-like) – True labels.

  • average (str, default "macro") – Averaging strategy for multi-class metrics.

  • show (bool, default True) – Whether to print the metrics.

Return type:

dict[str, float | None]

Returns:

dict[str, float | None] – Dictionary with keys accuracy, f1, precision, recall, and roc_auc.

class pymaftools.model.StackingModel.ASCStackingModel(omics_dict, class_order, random_state=42)[source]

Bases: OmicsStackingModel

Stacking model pre-configured for ASC (adenosquamous carcinoma) analysis.

Parameters:
  • omics_dict (dict[str, PivotTable]) – Mapping of omics names to PivotTable objects.

  • class_order (list[str]) – Ordered class labels.

  • random_state (int, default 42) – Random seed.

soft_score(X)[source]

Compute the LUSC probability score for each sample.

Parameters:

X (pd.DataFrame) – Input data.

Return type:

ndarray

Returns:

np.ndarray – LUSC class probability for each sample.

modelUtils

pymaftools.model.modelUtils.get_importance(model)[source]

Extract feature importance from a fitted model.

Supports sklearn estimators with feature_importances_ and OmicsStackingModel instances.

Parameters:

model (object) – A fitted model.

Return type:

Series

Returns:

pd.Series – Feature importances indexed by feature names.

Raises:

ValueError – If the model type is not supported.

pymaftools.model.modelUtils.evaluate_model(model, X_test, y_test)[source]

Evaluate a single model and return metric dictionary.

Parameters:
  • model (object) – A fitted model with predict and predict_proba methods.

  • X_test (pd.DataFrame) – Test features.

  • y_test (array-like) – True labels.

Return type:

dict[str, float]

Returns:

dict[str, float] – Dictionary with keys acc, f1, and auc.

pymaftools.model.modelUtils.cross_validate_importance(X, y, model_func, model_name, n_seeds=5, n_splits=5, random_state_base=0, verbose=True, evaluate_func=None)[source]

Run repeated stratified cross-validation, collecting feature importances and metrics.

Parameters:
  • X (pd.DataFrame) – Feature matrix (samples as rows).

  • y (pd.Series) – Target labels.

  • model_func (callable) – Factory model_func(seed) -> model returning a fresh model instance.

  • model_name (str) – Name identifier for this model.

  • n_seeds (int, default 5) – Number of random seeds (repetitions).

  • n_splits (int, default 5) – Number of CV folds per seed.

  • random_state_base (int, default 0) – Base value added to each seed for reproducibility.

  • verbose (bool, default True) – Whether to display a progress bar.

  • evaluate_func (callable, optional) – Function (model, X_test, y_test) -> dict returning per-fold metrics.

Return type:

tuple[pd.DataFrame, pd.DataFrame | None]

Returns:

  • importance_df (pd.DataFrame) – Long-format feature importance table.

  • metric_df (pd.DataFrame or None) – Long-format metrics table (None if evaluate_func is not provided).

pymaftools.model.modelUtils.plot_metric_comparison_with_annotation(data, metrics=None, group_col='model', order=None, palette='Set2', test='Mann-Whitney', alpha=0.8, fontsize=14, figsize=None, title_prefix=None, save_path=None, **save_kwargs)[source]

Plot metric comparison boxplots with statistical annotations.

Parameters:
  • data (pd.DataFrame) – DataFrame containing model metrics.

  • metrics (list[str], optional) – Metric column names to plot. Default ["acc", "f1", "auc"].

  • group_col (str, default "model") – Column used for grouping.

  • order (list[str], optional) – Display order of groups.

  • palette (str, default "Set2") – Seaborn color palette.

  • test (str, default "Mann-Whitney") – Statistical test for annotations.

  • alpha (float, default 0.8) – Box transparency.

  • fontsize (int, default 14) – Font size.

  • figsize (tuple, optional) – Figure size.

  • title_prefix (str, optional) – Title prefix (None disables titles).

  • save_path (str, optional) – Path to save the figure.

  • **save_kwargs – Additional arguments passed to save method.

Returns:

ModelPlot – The plotter instance.

pymaftools.model.modelUtils.to_importance_table(all_importance_df, omic)[source]

Convert long-format importance data to a sorted PivotTable.

Parameters:
  • all_importance_df (pd.DataFrame) – Long-format importance DataFrame with columns model, seed, fold, feature, importance.

  • omic (str) – Omics name to filter by.

Return type:

PivotTable

Returns:

PivotTable – Feature x seed matrix sorted by mean importance (descending).

pymaftools.model.modelUtils.plot_top_feature_importance_heatmap(mean_importance_df, omic, top_n=20, cmap='viridis', figsize=(10, 6), title=None, save_path=None, **save_kwargs)[source]

Plot heatmap of top-N most important features.

Parameters:
  • mean_importance_df (pd.DataFrame) – Feature importance data.

  • omic (str) – Omics name identifier.

  • top_n (int, default 20) – Number of top features to display.

  • cmap (str, default "viridis") – Colormap for the heatmap.

  • figsize (tuple, default (10, 6)) – Figure size.

  • title (str, optional) – Plot title (None disables title).

  • save_path (str, optional) – Path to save the figure.

  • **save_kwargs – Additional arguments passed to save method.

Returns:

ModelPlot – The plotter instance.

pymaftools.model.modelUtils.run_rfecv_feature_selection(pivot, label_col='subtype', estimator=None, step=10, scoring='accuracy', min_features_to_select=10, plot=True, random_state=42, title=None, save_path=None, **save_kwargs)[source]

Run RFECV feature selection on a PivotTable.

Parameters:
  • pivot (PivotTable) – Feature x sample table.

  • label_col (str, default "subtype") – Column in sample_metadata containing target labels.

  • estimator (sklearn estimator, optional) – Model to use (default: RandomForestClassifier).

  • step (int, default 10) – Number of features removed per iteration.

  • scoring (str, default "accuracy") – Scoring metric (e.g. "accuracy", "f1_macro").

  • min_features_to_select (int, default 10) – Minimum number of features to keep.

  • plot (bool, default True) – Whether to plot the performance curve.

  • random_state (int, default 42) – Random seed.

  • title (str, optional) – Plot title (None disables title).

  • save_path (str, optional) – Path to save the figure.

  • **save_kwargs – Additional arguments passed to save method.

Return type:

tuple[list[str], RFECV]

Returns:

  • selected_features (list[str]) – Selected feature names.

  • selector (RFECV) – Fitted RFECV object.

pymaftools.model.modelUtils.run_model_evaluation(model_configs, y, n_seeds=100, n_splits=5, evaluate_func=None, verbose=True)[source]

Run cross-validation and importance analysis for multiple models.

Parameters:
  • model_configs (list[dict]) – Each dict must have keys "name" (str), "model_func" (callable), and "X" (pd.DataFrame).

  • y (pd.Series) – Target labels.

  • n_seeds (int, default 100) – Number of random seeds.

  • n_splits (int, default 5) – Number of CV folds.

  • evaluate_func (callable, optional) – Evaluation function (model, X_test, y_test) -> dict.

  • verbose (bool, default True) – Whether to print progress.

Return type:

tuple[dict, pd.DataFrame, pd.DataFrame]

Returns:

  • result_dict (dict) – Per-model results with "importance" and "metrics" keys.

  • all_importance_df (pd.DataFrame) – Combined long-format feature importance data.

  • all_metrics_df (pd.DataFrame) – Combined long-format classification metrics.