Utility Modules

geneset

pymaftools.utils.geneset.read_GMT(filepath)[source]

Read a GMT (Gene Matrix Transposed) file into a DataFrame.

Parameters:

filepath (str) – Path to the GMT file.

Return type:

DataFrame

Returns:

pd.DataFrame – DataFrame indexed by pathway name with columns Link and Genes.

pymaftools.utils.geneset.fetch_msigdb_geneset(geneset_name, species='human')[source]

Fetch a gene set from MSigDB by scraping its HTML page.

Parameters:
  • geneset_name (str) – Name of the gene set on MSigDB (e.g. "HALLMARK_APOPTOSIS").

  • species (str, default "human") – Species identifier used in the MSigDB URL.

Return type:

DataFrame

Returns:

pd.DataFrame – DataFrame with columns source_id, entrez_id, gene_symbol, and description.

reduction

class pymaftools.utils.reduction.PCA_CCA(n_pca_components=20, n_cca_components=1, random_state=42)[source]

Bases: object

Perform PCA on SNV and CNV tables separately, followed by Canonical Correlation Analysis (CCA).

This class provides utilities to project two omics data tables (SNV and CNV) into a shared latent space using PCA for dimensionality reduction and CCA for capturing cross-omics correlations. It also allows mapping the canonical weights back to the original feature space for interpretation.

Parameters:
  • n_pca_components (int, default=20) – Number of PCA components to retain for each omics table.

  • n_cca_components (int, default=1) – Number of canonical components for CCA.

  • random_state (int, default=42) – Random state for reproducibility.

fit_transform(snv_table, cnv_table)[source]

Fit PCA on SNV and CNV tables, then fit CCA on the reduced embeddings.

Parameters:
  • snv_table (pd.DataFrame) – SNV data table with features as rows and samples as columns.

  • cnv_table (pd.DataFrame) – CNV data table with features as rows and samples as columns.

Returns:

  • cca_snv (ndarray of shape (n_samples, n_cca_components)) – Canonical variates for SNV data.

  • cca_cnv (ndarray of shape (n_samples, n_cca_components)) – Canonical variates for CNV data.

transform(snv_table, cnv_table)[source]

Project new SNV and CNV data into the canonical space using fitted PCA and CCA models.

Parameters:
  • snv_table (pd.DataFrame) – SNV data table with features as rows and samples as columns.

  • cnv_table (pd.DataFrame) – CNV data table with features as rows and samples as columns.

Returns:

  • cca_snv (ndarray of shape (n_samples, n_cca_components)) – Canonical variates for SNV data.

  • cca_cnv (ndarray of shape (n_samples, n_cca_components)) – Canonical variates for CNV data.

get_weights()[source]

Retrieve feature weights in the canonical variates.

This method back-projects the CCA weights from the PCA-reduced space into the original feature space, enabling interpretation of which features contribute most to the canonical correlation.

Returns:

  • df_snv (pd.DataFrame) – DataFrame of SNV feature weights in canonical components.

  • df_cnv (pd.DataFrame) – DataFrame of CNV feature weights in canonical components.

Raises:

ValueError – If the model has not been fitted yet.

geneinfo

pymaftools.utils.geneinfo.get_ncbi_gene_ID(gene_symbol)[source]

Query the NCBI Entrez API for a gene symbol and return its Gene ID.

Parameters:

gene_symbol (str) – The gene symbol to query (e.g. "TP53").

Return type:

str | None

Returns:

str or None – The first matching Gene ID, or None if no result is found.

pymaftools.utils.geneinfo.get_ncbi_gene_IDs(gene_symbols)[source]

Batch-query gene symbols and return their corresponding Gene IDs.

Parameters:

gene_symbols (list[str]) – List of gene symbols to query.

Return type:

dict[str, str | None]

Returns:

dict[str, str or None] – Mapping from gene symbol to Gene ID (or None).

pymaftools.utils.geneinfo.get_gene_info_json(gene_ids)[source]

Retrieve detailed gene information from NCBI for multiple Gene IDs.

Parameters:

gene_ids (dict[str, str or None]) – Mapping from gene symbol to Gene ID.

Return type:

dict[str, dict]

Returns:

dict[str, dict] – Mapping from gene symbol to its detailed information dictionary, or None for symbols whose ID was missing or not found.

pymaftools.utils.geneinfo.parse_gene_info(gene_info)[source]

Extract summary descriptions from detailed gene information.

Parameters:

gene_info (dict[str, dict]) – Mapping from gene symbol to its detailed information dictionary.

Return type:

dict[str, str | None]

Returns:

dict[str, str or None] – Mapping from gene symbol to its summary string (or None).

pymaftools.utils.geneinfo.get_gene_description_df(gene_symbols)[source]

Look up gene descriptions from NCBI for a list of gene symbols.

Combines get_ncbi_gene_IDs, get_gene_info_json, and parse_gene_info into a single convenience function.

Parameters:

gene_symbols (list[str]) – List of gene symbols to query.

Return type:

DataFrame

Returns:

pd.DataFrame – DataFrame with columns Gene and Description.