Core Modules

MAF

class pymaftools.core.MAF.MAF(*args, **kwargs)[source]

Bases: DataFrame

A pandas DataFrame subclass for Mutation Annotation Format (MAF) files.

Provides methods to read, filter, merge, and convert MAF data commonly used in cancer genomics pipelines.

index_col

Default columns used to build the row index.

Type:: list[str]

vaild_variant_classfication

All recognised variant classification labels.

Type:: list[str]

nonsynonymous_types

Variant classifications considered nonsynonymous.

Type:: list[str]

index_col = ['Hugo_Symbol', 'Start_Position', 'End_Position', 'Reference_Allele', 'Tumor_Seq_Allele1', 'Tumor_Seq_Allele2']

vaild_variant_classfication = ['Frame_Shift_Del', 'Frame_Shift_Ins', 'In_Frame_Del', 'In_Frame_Ins', 'Missense_Mutation', 'Nonsense_Mutation', 'Silent', 'Splice_Site', 'Translation_Start_Site', 'Nonstop_Mutation', "3'UTR", "3'Flank", "5'UTR", "5'Flank", 'IGR', 'Intron', 'RNA', 'Targeted_Region']

nonsynonymous_types = ['Frame_Shift_Del', 'Frame_Shift_Ins', 'In_Frame_Del', 'In_Frame_Ins', 'Missense_Mutation', 'Nonsense_Mutation', 'Splice_Site', 'Translation_Start_Site', 'Nonstop_Mutation']

classmethod read_maf(maf_path, sample_ID, preffix='', suffix='')[source]

Read a MAF file and return a MAF object.

Parameters:

maf_path (str or os.PathLike) – Path to the MAF file (first row is skipped as a comment line).
sample_ID (str) – Sample identifier to assign to all rows.
preffix (str, default "") – Prefix prepended to the sample ID.
suffix (str, default "") – Suffix appended to the sample ID.

Return type:

MAF

Returns:

MAF – A MAF DataFrame with a composite index built from index_col.

filter_maf(mutation_types)[source]

Filter rows by variant classification.

Parameters:: mutation_types (list[str]) – Variant classification values to keep.
Return type:: MAF
Returns:: MAF – Filtered MAF containing only the specified mutation types.

static merge_mutations(column)[source]

Merge multiple mutations for a single gene–sample pair.

If all values are False the result is False. When more than one non-false mutation exists the result is "Multi_Hit", following the maftools convention (see maftools issue #347).

Parameters:: column (pd.Series) – Series of variant classifications (or False) for one gene–sample combination.
Return type:: str | bool
Returns:: str or bool – False if no mutation, a single classification string, or "Multi_Hit" when multiple mutations are present.

to_pivot_table()[source]

Create a gene-by-sample pivot table of variant classifications.

Return type:: PivotTable
Returns:: PivotTable – Pivot table with genes as rows, samples as columns, and variant classifications (or "Multi_Hit" / False) as values.

to_mutation_table()[source]

Create a mutation-level pivot table.

Each row corresponds to a unique mutation (composite index) rather than a gene, providing finer resolution than to_pivot_table().

Return type:: PivotTable
Returns:: PivotTable – Pivot table indexed by individual mutations.

change_index_level(index_col=None)[source]

Rebuild the row index from the specified columns.

Parameters:: index_col (list[str] or None, default None) – Columns to concatenate into the index. When None the class default index_col is used.
Return type:: MAF
Returns:: MAF – A copy of this MAF with the new composite index.

property mutations_count: Series

Count the number of mutations per sample.

Returns:: pd.Series – Series indexed by sample ID with mutation counts as values.

sort_by_chrom()[source]

Sort rows by genomic coordinates.

Return type:: MAF
Returns:: MAF – MAF sorted by Chromosome, Start_Position, and End_Position.

static merge_mafs(mafs)[source]

Concatenate multiple MAF objects into one.

Parameters:: mafs (list[MAF]) – MAF objects to concatenate.
Return type:: MAF
Returns:: MAF – A single MAF containing all rows from the input MAFs.

classmethod read_csv(csv_path, sep='\\t', reindex=False)[source]

Read a CSV/TSV file into a MAF object.

Parameters:

csv_path (str or os.PathLike) – Path to the CSV or TSV file.
sep (str, default "t") – Column delimiter.
reindex (bool, default False) – If True, rebuild the composite index from index_col after reading. Otherwise the first column is used as the index.

Return type:

MAF

Returns:

MAF – MAF constructed from the file contents.

to_csv(csv_path, **kwargs)[source]

Write the MAF to a CSV/TSV file.

Default behaviour writes a tab-separated file with the index included.

Parameters:

csv_path (str or os.PathLike) – Destination file path.
**kwargs (Any) – Additional keyword arguments forwarded to pd.DataFrame.to_csv.

Return type:

None

to_MAF(maf_path, **kwargs)[source]

Write the data as a standard MAF file (no index column).

Parameters:

maf_path (str or os.PathLike) – Destination file path.
**kwargs (Any) – Additional keyword arguments forwarded to pd.DataFrame.to_csv.

Return type:

None

to_base_change_pivot_table()[source]

Build a base-change pivot table with transition/transversion stats.

Only SNPs are considered. The returned PivotTable has base-change categories as rows and samples as columns, with ti, tv, and ti/tv ratio stored in sample_metadata.

Return type:: PivotTable
Returns:: PivotTable – Pivot table of base-change counts with ti/tv metadata.

get_protein_info(gene)[source]

Extract protein mutation information for a given gene.

Parameters:

gene (str) – Hugo gene symbol to query.

Return type:

tuple[int | None, list[dict]]

Returns:

AA_length (int or None) – Total amino-acid length of the protein, or None if unavailable.
mutations_data (list[dict]) – List of dicts with keys "position", "type", and "count" describing nonsynonymous mutations.

static get_domain_info(gene_name, AA_length, protein_domains_path=None)[source]

Look up protein domain annotations for a gene.

Parameters:

gene_name (str) – HGNC gene symbol.
AA_length (int) – Amino-acid length used to match the correct transcript.
protein_domains_path (str, os.PathLike, or None, default None) – Path to a protein domains CSV. When None the bundled dataset (derived from maftools) is used.

Return type:

tuple[list[dict], str]

Returns:

domains (list[dict]) – List of dicts with "Start", "End", and "Label" keys.
refseq_id (str) – The RefSeq transcript ID used.

Raises:

ValueError – If no domain information is found for the given gene and length.

write_maf(file_path)[source]

Write the MAF to a tab-separated file without the index.

Parameters:: file_path (str or os.PathLike) – Destination file path.
Return type:: None

write_SigProfilerMatrixGenerator_format(output_path)[source]

Convert and write the MAF in SigProfilerMatrixGenerator format.

Renames columns to match the SigProfilerMatrixGenerator standard and filters to rows whose Variant_Type is SNP, INS, or DEL.

Parameters:: output_path (str or os.PathLike) – Destination TSV file path.
Return type:: None

select_samples(sample_IDs)[source]

Select rows belonging to specific samples.

Parameters:: sample_IDs (list[str]) – Sample identifiers to keep.
Return type:: MAF
Returns:: MAF – A copy containing only rows for the requested samples.

PivotTable

PivotTable Module

Extended pandas DataFrame for bioinformatics analysis with integrated metadata support. Specifically designed for mutation analysis and genomic data visualization.

class pymaftools.core.PivotTable.PivotTable(data=None, *args, **kwargs)[source]

Bases: DataFrame

Enhanced pandas DataFrame for bioinformatics analysis.

A specialized DataFrame that maintains synchronized metadata for both features (rows, typically genes/mutations) and samples (columns). Designed for genomic data analysis with built-in support for mutation frequency calculations, statistical testing, and visualization.

feature_metadata

Metadata for features (genes/mutations/signatures), indexed by feature names.

Type:: pd.DataFrame

sample_metadata

Metadata for samples, indexed by sample names.

Type:: pd.DataFrame

Examples

>>> # Create a PivotTable from mutation data
>>> data = pd.DataFrame({'sample1': [True, False], 'sample2': [False, True]},
...                     index=['TP53', 'KRAS'])
>>> table = PivotTable(data)
>>> table.feature_metadata['freq'] = table.add_freq().feature_metadata['freq']

info()[source]

Return a summary string of the PivotTable structure.

Return type:: str

property plot: PivotTablePlot

Access plotting functionality for the PivotTable.

Returns:: PivotTablePlot – Plotting interface providing various visualization methods.

Examples

>>> # PCA plot colored by subtype
>>> pivot_table.plot.plot_pca_samples(group_col="subtype")

>>> # Boxplot with statistical annotations
>>> pivot_table.plot.plot_boxplot_with_annot(
...     test_col="TMB",
...     group_col="subtype"
... )

rename_index_and_columns(index_name='feature', columns_name='sample')[source]

Rename the index and columns of the PivotTable.

Parameters:

index_name (str, default "feature") – New name for the index (features).
columns_name (str, default "sample") – New name for the columns (samples).

Return type:

PivotTable

Returns:

PivotTable – PivotTable with renamed index and columns.

to_sqlite(db_path)[source]

Save PivotTable to SQLite database format.

Parameters:: db_path (str) – Path to the SQLite database file.
Return type:: None

classmethod read_sqlite(db_path)[source]

Load PivotTable from SQLite database format.

Parameters:: db_path (str) – Path to the SQLite database file.
Return type:: PivotTable
Returns:: PivotTable – Loaded PivotTable with metadata.

to_hierarchical_clustering(method='ward', metric='euclidean')[source]

Perform hierarchical clustering on both features and samples.

Computes hierarchical clustering linkage matrices for both the feature dimension (genes/mutations) and sample dimension using scipy’s linkage function. This enables creation of dendrograms and clustermaps for data visualization and pattern discovery.

Parameters:

method (str, default 'ward') –
Linkage algorithm to use. Options include:
- ’ward’: Minimizes within-cluster variance (requires euclidean metric)
- ’single’: Nearest point algorithm
- ’complete’: Farthest point algorithm
- ’average’: UPGMA algorithm
- ’weighted’: WPGMA algorithm
- ’centroid’: UPGMC algorithm
- ’median’: WPGMC algorithm
metric (str, default 'euclidean') –
Distance metric for clustering. Common options:
- ’euclidean’: Standard Euclidean distance
- ’manhattan’: L1 distance
- ’cosine’: Cosine distance
- ’correlation’: Correlation distance
- ’hamming’: Hamming distance (for binary data)
- ’jaccard’: Jaccard distance (for binary data)

Return type:

Dict[str, ndarray]

Returns:

Dict[str, np.ndarray] – Dictionary containing linkage matrices:

’gene_linkage’np.ndarray of shape (n_features-1, 4)
Linkage matrix for features (genes), where each row represents a merge operation in the clustering tree
’sample_linkage’np.ndarray of shape (n_samples-1, 4)
Linkage matrix for samples, where each row represents a merge operation in the clustering tree

Notes

The linkage matrices returned follow scipy’s format where each row contains [cluster1_id, cluster2_id, distance, cluster_size].

For binary mutation data, consider using ‘hamming’ or ‘jaccard’ metrics. The ‘ward’ method works only with ‘euclidean’ metric.

Examples

>>> # Basic hierarchical clustering
>>> clustering = pivot_table.to_hierarchical_clustering()
>>> gene_linkage = clustering['gene_linkage']
>>> sample_linkage = clustering['sample_linkage']

>>> # Using Jaccard distance for binary mutation data
>>> clustering = pivot_table.to_hierarchical_clustering(
...     method='average',
...     metric='jaccard'
... )

>>> # Create dendrogram from results
>>> from scipy.cluster.hierarchy import dendrogram
>>> import matplotlib.pyplot as plt
>>>
>>> plt.figure(figsize=(10, 6))
>>> dendrogram(clustering['gene_linkage'])
>>> plt.title('Gene Clustering Dendrogram')
>>> plt.show()

See also

scipy.cluster.hierarchy.linkage: The underlying clustering function
scipy.cluster.hierarchy.dendrogram: For visualizing clustering results
seaborn.clustermap: For creating clustered heatmaps

copy(deep=True)[source]

Make a copy of this object’s indices and data.

Creates a deep or shallow copy of the PivotTable and its associated feature_metadata and sample_metadata.

Parameters:: deep (bool, default True) – Whether to make a deep copy or shallow copy.
Return type:: PivotTable
Returns:: PivotTable – Copy of the PivotTable with preserved metadata.

Examples

>>> pivot_copy = pivot_table.copy()
>>> pivot_shallow = pivot_table.copy(deep=False)

See also

pandas.DataFrame.copy: The underlying pandas copy method.

subset(*, features=None, samples=None)[source]

Subset PivotTable by features and/or samples.

This method provides a convenient interface for selecting specific features (rows) and samples (columns) from the PivotTable while preserving metadata alignment.

Parameters:

features (list, pd.Series, or slice, optional) –
Features (rows) to select. Can be:
- list of strFeature names to select
  Example: ["TP53", "KRAS", "EGFR"]
- pd.Series (bool)Boolean mask for feature selection
  Example: pivot_table.feature_metadata["freq"] > 0.1
- sliceSlice object for feature selection
  Example: slice(None, 10) for first 10 features
- None : Select all features (default)
samples (list, pd.Series, or slice, optional) –
Samples (columns) to select. Can be:
- list of strSample names to select
  Example: ["sample1", "sample2", "sample3"]
- pd.Series (bool)Boolean mask for sample selection
  Example: pivot_table.sample_metadata["subtype"] == "LUAD"
- sliceSlice object for sample selection
  Example: slice(None, 20) for first 20 samples
- None : Select all samples (default)

Return type:

PivotTable

Returns:

PivotTable – Subset PivotTable with synchronized metadata. Only keeps existing labels (inner join behavior).

Notes

This method uses inner join behavior, meaning only existing labels are kept. Missing labels are silently ignored. For outer join behavior that includes missing labels with NaN values, use the reindex method.

Examples

>>> # Select specific genes and samples
>>> subset = pivot_table.subset(
...     features=["TP53", "KRAS"],
...     samples=["sample1", "sample2"]
... )

>>> # Select high-frequency mutations
>>> high_freq = pivot_table.feature_metadata["freq"] > 0.1
>>> frequent_mutations = pivot_table.subset(features=high_freq)

>>> # Select samples by subtype
>>> luad_samples = pivot_table.sample_metadata["subtype"] == "LUAD"
>>> luad_data = pivot_table.subset(samples=luad_samples)

The metadata is subset based on the DataFrame’s index (features) and columns (samples). Missing indices in metadata will result in NaN values in the new PivotTable’s metadata.

See also

PivotTable.reindex: For outer join behavior with missing labels.
PivotTable.__getitem__: For direct indexing operations.

reindex(index=None, columns=None, *args, fill_value=nan, feature_fill_value=nan, sample_fill_value=nan, **kwargs)[source]

Conform PivotTable to new index and/or columns with synchronized metadata.

This method extends pandas DataFrame.reindex to also reindex the associated feature_metadata and sample_metadata, maintaining consistency across all components of the PivotTable.

Parameters:

index (array-like, optional) – New labels for the rows. If None, use existing index.
columns (array-like, optional) – New labels for the columns. If None, use existing columns.
*args (Any) – Additional positional arguments passed to pandas.DataFrame.reindex.
fill_value (scalar, default np.nan) – Value to use for missing values in the main DataFrame.
feature_fill_value (scalar, default np.nan) – Value to use for missing values when reindexing feature_metadata.
sample_fill_value (scalar, default np.nan) – Value to use for missing values when reindexing sample_metadata.
**kwargs (Any) – Additional keyword arguments passed to pandas.DataFrame.reindex.

Return type:

PivotTable

Returns:

PivotTable – Reindexed PivotTable with synchronized metadata.

Notes

Unlike the subset method which uses inner join behavior, this method uses outer join behavior and will include missing labels filled with the specified fill values.

The metadata DataFrames are automatically reindexed to match the new structure of the main DataFrame.

Examples

>>> # Reindex with new features, filling missing with 0
>>> new_features = ["TP53", "KRAS", "NEW_GENE"]
>>> reindexed = pivot_table.reindex(
...     index=new_features,
...     fill_value=0,
...     feature_fill_value="Unknown"
... )

>>> # Reindex with new samples
>>> new_samples = ["sample1", "sample2", "new_sample"]
>>> reindexed = pivot_table.reindex(
...     columns=new_samples,
...     sample_fill_value="Missing"
... )

>>> # Reindex both dimensions
>>> reindexed = pivot_table.reindex(
...     index=new_features,
...     columns=new_samples,
...     fill_value=np.nan
... )

See also

pandas.DataFrame.reindex: The underlying pandas reindex method.
PivotTable.subset: For inner join behavior.

static merge(tables, fill_value=nan, feature_fill_value=nan, sample_fill_value=nan, join='outer')[source]

Merge multiple PivotTables into a single PivotTable.

Concatenates multiple PivotTable objects along the sample axis (columns) and aligns features (rows) according to the selected join strategy. Metadata (feature_metadata and sample_metadata) is automatically synchronized with the resulting data matrix.

Parameters:

tables (List[PivotTable]) – List of PivotTable objects to merge. All should have compatible structure.
fill_value (scalar, default np.nan) – Value to use for missing values in the main data matrix after merging.
feature_fill_value (scalar, default np.nan) – Value to use for missing values when reindexing feature_metadata.
sample_fill_value (scalar, default np.nan) – Value to use for missing values when reindexing sample_metadata.
join ({'inner', 'outer'}, default 'outer') –
Strategy for aligning features (rows) across tables: - ‘inner’: Keep only features shared by all tables. - ‘outer’: Keep all features from all tables (union).

Note: samples (columns) are always unioned.

Return type:

PivotTable

Returns:

PivotTable – A new PivotTable with: - Merged data matrix - Reindexed feature and sample metadata - Missing values filled with specified defaults

Raises:

ValueError – If join is not ‘inner’ or ‘outer’.
ValueError – If sample (column) names overlap across tables.

Examples

>>> # Outer merge (default): keeps all features
>>> merged = PivotTable.merge([table_A, table_B])

>>> # Inner merge: keeps only shared features
>>> merged = PivotTable.merge([table_A, table_B], join='inner')

>>> # Fill missing values with 0
>>> merged = PivotTable.merge([table_A, table_B], fill_value=0)

calculate_TMB(default_capture_size=40, group_col='subtype', capture_size_dict=None)[source]

Return type:: PivotTable

calculate_feature_frequency()[source]

Calculate mutation frequency for each feature.

Computes the frequency of each feature (gene/mutation/signature) across all samples by converting the data to binary format and calculating the proportion of samples with mutations for each feature.

Treats any non-False value as indicating the presence of a mutation, effectively converting the data to binary (mutated/not mutated) before frequency calculation.

Returns:: pd.Series – Mutation frequency for each feature, indexed by feature names. Values range from 0.0 (no mutations in any sample) to 1.0 (mutations in all samples).

Notes

This method is equivalent to calling:

Convert PivotTable to binary format using to_binary_table()
Sum mutations across samples (axis=1)
Divide by total number of samples

The frequency represents the proportion of samples that have a mutation for each feature, regardless of the specific mutation type or value.

Examples

>>> # Create example mutation data
>>> data = pd.DataFrame({
...     'sample1': [True, False, True],
...     'sample2': [False, True, True],
...     'sample3': [True, True, False]
... }, index=['TP53', 'KRAS', 'EGFR'])
>>> table = PivotTable(data)
>>> frequencies = table.calculate_feature_frequency()
>>> print(frequencies)
TP53     0.666667
KRAS     0.666667
EGFR     0.666667
dtype: float64

>>> # Frequency shows proportion of samples with each mutation
>>> print(f"TP53 is mutated in {frequencies['TP53']:.1%} of samples")
TP53 is mutated in 66.7% of samples

See also

to_binary_table: Convert PivotTable to binary mutation format
add_freq: Add frequency columns to feature_metadata
filter_by_freq: Filter features by frequency threshold

add_freq(groups={})[source]

Add mutation frequency columns to feature_metadata.

Calculates overall mutation frequency and optionally group-specific frequencies for all features, adding these as new columns to the feature_metadata DataFrame. This enables frequency-based filtering and analysis operations.

Parameters:

groups (Dict[str, PivotTable], default {}) –

Dictionary mapping group names to PivotTable objects for calculating group-specific mutation frequencies. Each PivotTable should represent a subset of samples belonging to a specific group (e.g., cancer subtypes, treatment groups, etc.).

Example: {“LUAD”: luad_table, “LUSC”: lusc_table, “Control”: control_table}

Return type:

PivotTable

Returns:

PivotTable – A new PivotTable (copy) with frequency columns added to feature_metadata:

”{group_name}_freq”float
Frequency for each group specified in the groups dictionary
”freq”float
Overall frequency across all samples in the current PivotTable

Raises:

TypeError – If any value in the groups dictionary is not a PivotTable instance.

Notes

The frequency calculation treats any non-False value as indicating mutation presence. Frequencies are calculated as:

frequency = (number of mutated samples) / (total number of samples)

Group-specific frequencies are calculated independently for each group’s PivotTable, while the overall frequency uses all samples in the current PivotTable.

Examples

>>> # Add overall frequency only
>>> table_with_freq = pivot_table.add_freq()
>>> print(table_with_freq.feature_metadata.columns)
Index(['freq'], dtype='object')

>>> # Add group-specific frequencies
>>> luad_subset = pivot_table.subset(samples=luad_sample_mask)
>>> lusc_subset = pivot_table.subset(samples=lusc_sample_mask)
>>> groups = {"LUAD": luad_subset, "LUSC": lusc_subset}
>>> table_with_freq = pivot_table.add_freq(groups=groups)
>>> print(table_with_freq.feature_metadata.columns)
Index(['LUAD_freq', 'LUSC_freq', 'freq'], dtype='object')

>>> # Use frequencies for filtering
>>> high_freq_features = table_with_freq.filter_by_freq(threshold=0.1)
>>> luad_specific = table_with_freq[
...     (table_with_freq.feature_metadata['LUAD_freq'] > 0.2) &
...     (table_with_freq.feature_metadata['LUSC_freq'] < 0.05)
... ]

See also

calculate_feature_frequency: Calculate frequency for current PivotTable
filter_by_freq: Filter features by frequency threshold
sort_features: Sort features by metadata columns including frequency

sort_features(by='freq', ascending=False)[source]

Sort features (rows) by a column in feature_metadata.

Parameters:

by (str, default "freq") – Column name in feature_metadata to sort by.
ascending (bool, default False) – Sort order. False for descending (highest values first).

Return type:

PivotTable

Returns:

PivotTable – New PivotTable with features sorted by the specified column.

Raises:

ValueError – If the specified column is not found in feature_metadata.

sort_samples_by_mutations(top=10)[source]

Sort samples by their mutation patterns.

Uses a binary encoding approach where mutation patterns of the top mutated features are converted to integers for sorting.

Parameters:: top (int, default 10) – Number of top features to consider for sorting.
Return type:: PivotTable
Returns:: PivotTable – New PivotTable with samples sorted by mutation patterns. The mutation weight is added to sample_metadata.

sort_samples_by_group(group_col, group_order, top=10)[source]

Sort samples by group membership and then by mutation patterns.

First sorts samples according to the specified group order, then within each group, applies mutation-based sorting using sort_samples_by_mutations. This creates a hierarchical sorting where group membership is the primary sort key and mutation patterns are the secondary key.

Parameters:

group_col (str) – The column name in sample_metadata containing group information (e.g., “subtype”, “treatment”, “stage”).
group_order (List[str]) – The desired order of groups for sample arrangement. Groups will be ordered as specified in this list.
top (int, default 10) – The number of top features (highest frequency) to consider when sorting samples by mutation patterns within each group.

Return type:

PivotTable

Returns:

PivotTable – A new PivotTable with samples sorted first by group membership, then by mutation patterns within each group.

Raises:

ValueError – If the specified group_col is not found in sample_metadata.

Notes

This method is useful for creating organized visualizations where you want to group samples by a specific criterion (e.g., cancer subtype) while maintaining mutation-based ordering within each group.

The mutation-based sorting within groups uses the sort_samples_by_mutations method, which converts mutation patterns to binary encodings for sorting.

Examples

>>> # Sort samples by cancer subtype, then by mutation patterns
>>> sorted_table = pivot_table.sort_samples_by_group(
...     group_col="subtype",
...     group_order=["LUAD", "LUSC", "SCLC"],
...     top=15
... )

>>> # Sort by treatment response, considering top 20 mutations
>>> sorted_table = pivot_table.sort_samples_by_group(
...     group_col="response",
...     group_order=["Complete", "Partial", "Stable", "Progressive"],
...     top=20
... )

See also

sort_samples_by_mutations: Sort samples by mutation patterns only
sort_features: Sort features by metadata columns
subset: Select specific samples or features

PCA(to_binary)[source]

Perform Principal Component Analysis on the PivotTable.

Parameters:

to_binary (bool) – Whether to convert the data to binary format before PCA.

Return type:

Tuple[DataFrame, ndarray, PCA]

Returns:

tuple –

pca_result_df : pd.DataFrame with PC1 and PC2 for each sample
explained_variance : np.ndarray of variance ratios for PC1 and PC2
pca : sklearn.decomposition.PCA fitted object

head(n=50)[source]

Return the first n features (rows) subset of the PivotTable.

Parameters:: n (int, default 50) – Number of features to return.
Return type:: PivotTable
Returns:: PivotTable – PivotTable subset containing only the first n features.

tail(n=50)[source]

Return the last n features (rows) subset of the PivotTable.

Parameters:: n (int, default 50) – Number of features to return.
Return type:: PivotTable
Returns:: PivotTable – PivotTable subset containing only the last n features.

filter_by_freq(threshold=0.05)[source]

Filter features by their mutation frequency.

Parameters:: threshold (float, default 0.05) – Minimum frequency threshold (0 to 1).
Return type:: PivotTable
Returns:: PivotTable – PivotTable containing only features with freq >= threshold.
Raises:: ValueError – If ‘freq’ column is not found in feature_metadata.

filter_by_variance(threshold=None, method='var', quantile=None)[source]

Filter features by variance or median absolute deviation.

Parameters:

threshold (float, optional) – Minimum variance/MAD threshold. Features with scores >= threshold are kept. If None, quantile must be specified.
method ({"var", "mad"}, default "var") – Dispersion metric: - “var”: variance - “mad”: median absolute deviation
quantile (float, optional) – Quantile cutoff (0–1). E.g. quantile=0.75 keeps the top 25% most variable features. Overrides threshold when both given.

Return type:

PivotTable

Returns:

PivotTable – Filtered PivotTable with dispersion scores in feature_metadata.

Raises:

ValueError – If neither threshold nor quantile is specified, or if method is not supported.

filter_by_statistical_test(group_col, method='kruskal', alpha=0.05)[source]

Filter features by a statistical test across sample groups.

For each feature, performs the chosen test on groups defined by group_col in sample_metadata, applies FDR correction, and returns only features with adjusted p-value < alpha.

Parameters:

group_col (str) – Column in sample_metadata defining sample groups.
method ({"ttest", "mann_whitney", "kruskal", "anova"}, default "kruskal") – Statistical test. ttest and mann_whitney require exactly two groups; kruskal and anova support two or more.
alpha (float, default 0.05) – Significance threshold applied after FDR correction.

Return type:

PivotTable

Returns:

PivotTable – Filtered PivotTable with p_value and adjusted_p_value columns added to feature_metadata.

Raises:

ValueError – If method is unsupported or group count doesn’t match the test.

to_cooccur_matrix(freq=True)[source]

Convert to co-occurrence matrix format.

Parameters:: freq (bool, default True) – If True, normalize by sample count to get frequencies. If False, return raw co-occurrence counts.
Return type:: CooccurrenceMatrix
Returns:: CooccurrenceMatrix – Matrix showing feature co-occurrence patterns.

to_binary_table()[source]

Convert PivotTable to binary format.

Converts all non-False values to True, creating a binary representation of the mutation data.

Return type:: PivotTable
Returns:: PivotTable – Bool PivotTable where True indicates mutation presence.

mutation_enrichment_test(group_col, group1, group2, alpha=0.05, minimum_mutations=2, method='chi2')[source]

Perform statistical enrichment test for mutations between two groups.

Tests whether specific mutations are significantly enriched in one group compared to another using either Chi-squared test or Fisher’s exact test. Multiple testing correction is applied using the Benjamini-Hochberg method.

Parameters:

group_col (str) – Column name in sample_metadata that contains group assignments.
group1 (str) – Name of the first group to compare.
group2 (str) – Name of the second group to compare.
alpha (float, default 0.05) – Significance level for multiple testing correction.
minimum_mutations (int, default 2) – Minimum number of mutations required in either group to include a feature in the analysis.
method ({"chi2", "fisher"}, default "chi2") – Statistical test method to use: - “chi2”: Chi-squared test of independence - “fisher”: Fisher’s exact test

Return type:

DataFrame

Returns:

pd.DataFrame – Results DataFrame with the following columns: - “{group1}_True”: Count of mutated samples in group1 - “{group1}_False”: Count of non-mutated samples in group1 - “{group2}_True”: Count of mutated samples in group2 - “{group2}_False”: Count of non-mutated samples in group2 - “p_value”: Raw p-values from statistical test - “adjusted_p_value”: FDR-corrected p-values - “is_significant”: Boolean indicating significance after correction - “test_method”: Method used for testing

Raises:

ValueError – If unsupported statistical method is specified.

Notes

The method creates 2x2 contingency tables for each feature:

Group1 Group2

Mutated a b Not mutated c d

Features with fewer than minimum_mutations in both groups are excluded to avoid testing rare mutations that may not be statistically meaningful.

Examples

>>> # Test for mutations enriched in LUAD vs LUSC
>>> results = pivot_table.mutation_enrichment_test(
...     group_col="subtype",
...     group1="LUAD",
...     group2="LUSC",
...     method="fisher"
... )
>>> significant = results[results["is_significant"]]
>>> print(f"Found {len(significant)} significantly enriched mutations")

See also

scipy.stats.chi2_contingency: Chi-squared test implementation
scipy.stats.fisher_exact: Fisher’s exact test implementation
statsmodels.stats.multitest.multipletests: Multiple testing correction

compute_similarity(method='cosine')[source]

Compute sample similarity matrix using specified metric.

Parameters:: method ({"cosine", "hamming", "jaccard", "pearson", "spearman", "kendall"}, default "cosine") – Similarity metric to use.
Return type:: SimilarityMatrix
Returns:: SimilarityMatrix – Pairwise similarity matrix between samples.
Raises:: ValueError – If unsupported similarity method is specified.

order(group_col, group_order)[source]

Reorder samples by group membership.

Parameters:

group_col (str) – Column name in sample_metadata containing group information.
group_order (List[str]) – Order of groups for sample arrangement.

Return type:

PivotTable

Returns:

PivotTable – PivotTable with samples ordered by group membership.

static prepare_data(maf)[source]

Prepare and process MAF data into a sorted PivotTable.

Filters MAF for nonsynonymous mutations, converts to PivotTable, adds frequency calculations, and sorts by feature frequency and sample mutation patterns.

Parameters:: maf (MAF) – Input MAF object containing mutation data.
Return type:: PivotTable
Returns:: PivotTable – Processed and sorted PivotTable ready for analysis.

add_sample_metadata(sample_metadata, fill_value=None, force=False)[source]

Safely add sample metadata to the PivotTable.

This method ensures that: 1. Only samples existing in the PivotTable are added 2. Existing columns are not overwritten unless forced 3. Type consistency is maintained

Parameters:

sample_metadata (pd.DataFrame) – New metadata to add, indexed by sample names.
fill_value (Optional[Union[str, float]], default None) – Value to use for missing data.
force (bool, default False) – If True, allow overwriting existing columns.

Return type:

PivotTable

Returns:

PivotTable – PivotTable with updated sample metadata.

Raises:

ValueError – If sample names don’t match or columns conflict without force=True.

Examples

>>> # Add new metadata columns
>>> new_meta = pd.DataFrame({
...     'age': [65, 72, 58],
...     'stage': ['I', 'II', 'III']
... }, index=['sample1', 'sample2', 'sample3'])
>>> table_with_meta = table.add_sample_metadata(new_meta)

add_feature_metadata(feature_metadata, fill_value=None, force=False)[source]

Safely add feature metadata to the PivotTable.

This method ensures that: 1. Only features existing in the PivotTable are added 2. Existing columns are not overwritten unless forced 3. Type consistency is maintained

Parameters:

feature_metadata (pd.DataFrame) – New metadata to add, indexed by feature names.
fill_value (Optional[Union[str, float]], default None) – Value to use for missing data.
force (bool, default False) – If True, allow overwriting existing columns.

Return type:

PivotTable

Returns:

PivotTable – PivotTable with updated feature metadata.

Raises:

ValueError – If feature names don’t match or columns conflict without force=True.

Examples

>>> # Add gene annotation metadata
>>> gene_anno = pd.DataFrame({
...     'chromosome': ['17', '12', '3'],
...     'gene_type': ['tumor_suppressor', 'oncogene', 'oncogene']
... }, index=['TP53', 'KRAS', 'PIK3CA'])
>>> table_with_anno = table.add_feature_metadata(gene_anno)

pymaftools.core.PivotTable.capture_size(bed_path)[source]

Calculate the total capture size (in megabases) from a BED file.

The BED file must have at least three columns: chrom, start, end.

Return type:: float

Parameters:: bed_path (str): Path to the BED file.
Returns:: float: Total capture region size in megabases (Mb).

Cohort

class pymaftools.core.Cohort.Cohort(name, description='')[source]

Bases: object

add_sample_metadata(new_metadata, source='')[source]

Add or merge sample metadata into the cohort.

Parameters:

new_metadata (pd.DataFrame) – DataFrame containing sample metadata, indexed by sample ID.
source (str, optional) – Name of the source providing the metadata, used in error messages, by default “”.

Raises:

TypeError – If new_metadata is not a pandas DataFrame.
ValueError – If the index of new_metadata does not match the existing cohort index, or if shared columns have conflicting values.

Return type:

None

add_table(table, table_name)[source]

Add a PivotTable to the cohort.

Parameters:

table (PivotTable) – The PivotTable to add.
table_name (str) – Name to assign to the table within the cohort.

Raises:

TypeError – If table is not an instance of PivotTable.

Return type:

None

remove_table(table_name)[source]

Remove a table from the cohort by name.

Parameters:: table_name (str) – Name of the table to remove.
Return type:: None

subset(samples=[])[source]

Create a new Cohort containing only the specified samples.

Parameters:: samples (list of str, optional) – Sample IDs to keep, by default [].
Return type:: Cohort
Returns:: Cohort – A new Cohort containing only the specified samples.

copy(deep=True)[source]

Create a copy of the Cohort.

Parameters:: deep (bool, optional) – If True, perform a deep copy of all tables and metadata. If False, perform a shallow copy, by default True.
Return type:: Cohort
Returns:: Cohort – A new Cohort instance.

info()[source]

Return a summary string of the Cohort structure.

Return type:: str
Returns:: str – A tree-formatted summary showing each table’s dimensions and metadata counts.

to_sql_registry()[source]

Generate a registry DataFrame for SQL table mapping.

Creates a mapping between logical table names and their corresponding SQL table names for data, sample metadata, and feature metadata.

Return type:: DataFrame
Returns:: pd.DataFrame – Registry with columns: sql_table_name, cohort_name, table_name, type

to_sqlite(db_path)[source]

Save Cohort to SQLite database format.

Deprecated since version 0.4.0: to_sqlite will be removed in a future version. Use to_hdf5() instead, which supports larger datasets without column limits.

Parameters:: db_path (str) – Path to the output SQLite database file.
Return type:: None

classmethod read_sqlite(db_path)[source]

Load Cohort from SQLite database format.

Deprecated since version 0.4.0: read_sqlite will be removed in a future version. Use read_hdf5() instead.

Parameters:: db_path (str) – Path to the SQLite database file.
Return type:: Cohort
Returns:: Cohort – Loaded Cohort object.

to_hdf5(h5_path)[source]

Save Cohort to HDF5 format.

HDF5 format is recommended for large datasets as it doesn’t have the column limit that SQLite has (~2000 columns).

Parameters:: h5_path (str) – Path to the output HDF5 file.
Return type:: None

classmethod read_hdf5(h5_path)[source]

Load Cohort from HDF5 format.

Parameters:: h5_path (str) – Path to the HDF5 file.
Return type:: Cohort
Returns:: Cohort – Loaded Cohort object.

CopyNumberVariationTable

class pymaftools.core.CopyNumberVariationTable.CopyNumberVariationTable(data=None, *args, **kwargs)[source]

Bases: PivotTable

Table for storing and manipulating copy number variation (CNV) data.

Inherits from PivotTable and provides specialized methods for reading GISTIC output files, thresholding continuous copy number values, sorting by chromosomal position, clustering, and plotting CNV frequencies.

The data matrix is oriented with genomic features (genes or chromosome arms) as rows and samples as columns. Associated feature_metadata and sample_metadata DataFrames carry annotation such as cytoband, chromosome, arm, thresholds, and sample type.

See also

PivotTable: Base class providing generic pivot-table operations.

classmethod from_pivot_table(table)[source]

Create a CopyNumberVariationTable object from a PivotTable object, preserving all metadata.

Parameters:: table (PivotTable) – A PivotTable object containing sample_metadata and feature_metadata attributes.
Return type:: CopyNumberVariationTable
Returns:: CopyNumberVariationTable – A CopyNumberVariationTable object with original sample_metadata and feature_metadata preserved.

classmethod read_gistic_arm_level(file_path)[source]

Read GISTIC broad data by arm level file.

Parameters:: file_path (str) – Path to the GISTIC arm-level results file.
Returns:: CopyNumberVariationTable – A CopyNumberVariationTable object with arm-level copy number data.

classmethod read_gistic_gene_level(file_path, feature_columns=['Gene Symbol', 'Gene ID', 'Cytoband'], samples=None)[source]

Read GISTIC results file and create a CopyNumberVariationTable object.

This method reads GISTIC output files (typically all_data_by_genes.txt or all_thresholded.by_genes.txt) and converts them into a CopyNumberVariationTable object with properly formatted feature and sample metadata.

Parameters:

file_path (str) – Path to the GISTIC results file (tab-separated format).
feature_columns (list of str, default ["Gene Symbol", "Gene ID", "Cytoband"]) – List of column names to be treated as feature metadata. These columns will be separated from the main data matrix.
samples (None or list of str, optional) – List of sample names to subset. If None, all samples are kept. Only samples present in both the data and this list will be retained.

Returns:

CopyNumberVariationTable – A CopyNumberVariationTable object containing: - Main data matrix with gene symbols as index - feature_metadata with gene information and parsed chromosome data - sample_metadata with case_ID and sample_type extracted from column names

Raises:

ValueError – If ‘Gene Symbol’ column is not found in the input file.

Notes

The method performs several data processing steps: 1. Removes ‘.call’ suffix from column names 2. Separates feature metadata from data columns 3. Parses sample names to extract case_ID and sample_type (split by last ‘_’) 4. Parses Cytoband information into Chromosome, Arm, and Band columns 5. Subsets data to specified samples if provided

The Cytoband parsing supports both numeric chromosomes (1-22) and sex chromosomes (X, Y) using the pattern: chromosome + arm (p/q) + band.

Examples

>>> cnv = CopyNumberVariationTable.read_gistic_gene_level('data/all_data_by_genes.txt')
>>> cnv = CopyNumberVariationTable.read_gistic_gene_level('data/all_thresholded.by_genes.txt',
...                       feature_columns=['Gene Symbol', 'Gene ID', 'Cytoband', 'Locus ID'])
>>> cnv = CopyNumberVariationTable.read_gistic_gene_level('data/all_data_by_genes.txt',
...                       samples=['LUAD_001_T', 'LUAD_002_T'])

sort_by_chromosome(ascending=True)[source]

Sort CopyNumberVariationTable data by chromosomal position.

Sorts the CopyNumberVariationTable data by chromosome number, arm (p/q), and band position. Handles both numeric chromosomes (1-22) and sex chromosomes (X, Y).

Parameters:: ascending (bool, default True) – Whether to sort in ascending order. If False, sorts in descending order.
Return type:: CopyNumberVariationTable
Returns:: CopyNumberVariationTable – A new CopyNumberVariationTable object with features sorted by chromosomal position.

Notes

The sorting order is: 1. Chromosome: 1, 2, …, 22, X, Y 2. Arm: p (short arm) before q (long arm) 3. Band: numerical order (e.g., 11.1, 11.2, 12.1)

Requires the feature_metadata to have ‘Chromosome’, ‘Arm’, and ‘Band’ columns, which are typically created by the read_gistic method when parsing Cytoband information.

Examples

>>> cnv_sorted = cnv_table.sort_by_chromosome()
>>> cnv_desc = cnv_table.sort_by_chromosome(ascending=False)

to_thresholded_cnv()[source]

Convert continuous CNV values to discrete thresholded categories.

Each value is mapped to one of five integer levels based on per-sample thresholds stored in sample_metadata: -2 (deep deletion), -1 (shallow deletion), 0 (neutral), +1 (low-level gain), +2 (high-level amplification).

Return type:: CopyNumberVariationTable
Returns:: CopyNumberVariationTable – A new table with the same shape where every cell contains an integer in {-2, -1, 0, 1, 2}.
Raises:: KeyError – If sample_metadata does not contain the required threshold columns: del_high_threshold, del_low_threshold, amp_low_threshold, amp_high_threshold.

static read_all_gistic(all_data_by_genes_file, sample_cutoffs_file, all_thresholded_by_genes_file, broad_values_by_arm_file)[source]

Read all GISTIC output files and create CopyNumberVariationTable objects.

Parameters:

all_data_by_genes_file (str) – Path to the GISTIC all_data_by_genes.txt file.
sample_cutoffs_file (str) – Path to the GISTIC sample_cutoffs.txt file.
all_thresholded_by_genes_file (str) – Path to the GISTIC all_thresholded.by_genes.txt file.
broad_values_by_arm_file (str) – Path to the GISTIC broad_values_by_arm.txt file.

Returns:

tuple – A tuple containing: - all_data_by_genes_table : CopyNumberVariationTable - sample_cutoff_df : pd.DataFrame - thresholded_cnv_table : CopyNumberVariationTable - broad_values_by_arm_table : CopyNumberVariationTable

to_cluster_table(cluster_col='cluster')[source]

Aggregate features by cluster label and return a cluster-level table.

Groups features (rows) according to cluster_col in feature_metadata and computes the mean CNV value per cluster per sample.

Parameters:: cluster_col (str, default "cluster") – Name of the column in feature_metadata that contains cluster assignments.
Return type:: CopyNumberVariationTable
Returns:: CopyNumberVariationTable – A new table whose rows are clusters and whose columns are samples. The feature_metadata of the returned table contains: unique_chr_arm, features (list of original feature names), and features_count.
Raises:: ValueError – If cluster_col is not found in feature_metadata.

plot_cnv_band_ratio(cluster_id, mode='gain', threshold=0.1, sample_type='T', subtype_order=None, ax=None, cmap=None, show=True, title=None)[source]

Plot gain or loss frequency across cytobands for a specific CNV cluster and sample type.

Parameters:

cluster_id (str) – Cluster ID to extract features (e.g., “C47” or “C6”).
mode ({"gain", "loss"}) – Type of alteration to compute.
threshold (float) – Threshold for gain or loss (default: 0.1).
sample_type (str) – Sample type to subset (default: “T”).
subtype_order (list of str, optional) – Order of subtypes to show in columns. If None, uses [“LUAD”, “ASC”, “LUSC”].
ax (matplotlib Axes, optional) – If provided, plot on this Axes object.
cmap (str, optional) – Colormap (default: “Reds” for gain, “Blues” for loss).
show (bool) – Whether to show the plot.
title (str, optional) – Title to display on plot.

Return type:

DataFrame

Returns:

pd.DataFrame – Cytoband × Subtype frequency table.

static to_cnv_table(all_sample_df)[source]

Build a CopyNumberVariationTable from a long-format DataFrame.

Pivots all_sample_df so that genes become rows and samples become columns, then attaches gene-level metadata (name, chromosome, start, end). Duplicate gene names are disambiguated by appending the Ensembl gene ID.

Parameters:: all_sample_df (pd.DataFrame) – Long-format DataFrame with at least the columns gene_id, gene_name, chromosome, start, end, sample_ID, and copy_number.
Return type:: CopyNumberVariationTable
Returns:: CopyNumberVariationTable – A table indexed by (unique) gene name with samples as columns.

pymaftools.core.CopyNumberVariationTable.read_sample_cutoff_file(sample_cutoff_file)[source]

Read the sample cutoff file and extract the amp_thresh and del_thresh values.

Parameters:: sample_cutoff_file (str) – Path to the sample cutoff file.
Return type:: DataFrame
Returns:: pd.DataFrame – DataFrame containing sample cutoff data with amp_threshold and del_threshold columns.

pymaftools.core.CopyNumberVariationTable.TCGA_sample_type(TCGA_barcode)[source]

Determine the sample type from a TCGA barcode suffix.

Parses the two-digit sample-type code near the end of the barcode and returns a single-character label.

Parameters:: TCGA_barcode (str) – A TCGA-style barcode ending with a sample-type portion (e.g., "TCGA-XX-XXXX-01A").
Return type:: str
Returns:: str – "T" for tumor (codes 00-09), "N" for normal (codes 10-19), or "C" for control (codes 20-29).
Raises:: ValueError – If the barcode does not match the expected format or the sample-type code is 30 or above.

pymaftools.core.CopyNumberVariationTable.get_target_sample_ID(paired_sample_IDs, target_sample_type)[source]

Extract the sample ID matching a target type from a comma-separated list.

Parameters:

paired_sample_IDs (str) – Comma-separated TCGA barcode strings (e.g., "TCGA-XX-XXXX-01A, TCGA-XX-XXXX-11A").
target_sample_type (str) – Desired sample type ("T", "N", or "C").

Return type:

str

Returns:

str – The first barcode whose type matches target_sample_type.

Raises:

ValueError – If no barcode in the list matches the requested type.

pymaftools.core.CopyNumberVariationTable.read_TCGA_ASCAT3_CNV_file_sheet(file_path, file_suffix='ascat3.gene_level_copy_number.v36.tsv')[source]

Read a TCGA ASCAT3 file sheet and extract tumor/normal sample IDs.

Filters rows whose File Name column contains file_suffix, then derives tumor_sample_ID and normal_sample_ID from the paired Sample ID field.

Parameters:

file_path (str) – Path to a tab-separated file sheet downloaded from the GDC portal.
file_suffix (str, default "ascat3.gene_level_copy_number.v36.tsv") – Substring used to filter relevant file rows.

Return type:

DataFrame

Returns:

pd.DataFrame – The filtered file sheet with added tumor_sample_ID and normal_sample_ID columns.

pymaftools.core.CopyNumberVariationTable.read_cnv_files(base_dir, file_sheet)[source]

Read and concatenate per-sample CNV files listed in a file sheet.

Iterates over file_sheet, reads each tab-separated CNV file from base_dir, tags rows with the tumor sample ID, drops rows with missing values, and concatenates everything into a single long-format DataFrame.

Parameters:

base_dir (str) – Directory containing the individual CNV files.
file_sheet (pd.DataFrame) – DataFrame with at least File Name and tumor_sample_ID columns (as produced by read_TCGA_ASCAT3_CNV_file_sheet()).

Return type:

DataFrame

Returns:

pd.DataFrame – Concatenated long-format DataFrame of all samples with an added sample_ID column.

ExpressionTable

class pymaftools.core.ExpressionTable.ExpressionTable(data=None, *args, **kwargs)[source]

Bases: PivotTable

Table for handling RNA expression data.

Inherits from PivotTable and provides specific functionality for gene expression analysis, including cluster-level aggregation.

to_cluster_table(cluster_col='cluster')[source]

Aggregate expression values by cluster assignment.

Groups features (genes) by the specified cluster column in feature_metadata and computes the mean expression per cluster.

Parameters:: cluster_col (str, default "cluster") – Column name in feature_metadata containing cluster labels.
Return type:: ExpressionTable
Returns:: ExpressionTable – Cluster-level expression table with aggregated metadata.
Raises:: ValueError – If cluster_col is not found in feature_metadata.

SignatureTable

class pymaftools.core.SignatureTable.SignatureTable(data=None, *args, **kwargs)[source]

Bases: PivotTable

Table for handling COSMIC Single Base Substitution (SBS) signature data.

Inherits from PivotTable and provides a convenience class method for reading signature weight files.

classmethod read_signature(file_path)[source]

Read a signature weight file and return a SignatureTable.

Parameters:: file_path (str) – Path to a tab-separated signature file where rows are signatures and columns are mutation contexts.
Return type:: SignatureTable
Returns:: SignatureTable – Transposed table with signatures as columns.

CancerCellFractionTable

class pymaftools.core.CancerCellFractionTable.CancerCellFractionTable[source]

Bases: object

Handler for cancer cell fraction (CCF) data from clonal analysis tools.

Provides methods for reading PyClone output and producing sorted PivotTable objects with cluster annotations.

static pyclone_to_sorted_table(filepath)[source]

Read PyClone output and create a sorted PivotTable.

Reads a tab-separated PyClone results file, pivots the data into a mutation-by-sample matrix of cellular prevalence values, and sorts mutations by cluster mean CCF (descending).

Parameters:: filepath (str) – Path to a PyClone results file (tab-separated) containing at minimum the columns mutation_id, sample_id, cellular_prevalence, and cluster_id.
Return type:: PivotTable
Returns:: PivotTable – Sorted table with mutations as rows and samples as columns. Feature metadata includes mean_ccf, cluster, and cluster_text (e.g. “major”, “minor1”, “minor2”, …).

PairwiseMatrix

class pymaftools.core.PairwiseMatrix.PairwiseMatrix(data=None, index=None, columns=None, dtype=None, copy=None)[source]

Bases: DataFrame

Base class for pairwise matrices.

A pd.DataFrame subclass that represents a symmetric pairwise matrix (e.g., co-occurrence or similarity) between samples or features.

class pymaftools.core.PairwiseMatrix.CooccurrenceMatrix(data=None, index=None, columns=None, dtype=None, copy=None)[source]

Bases: PairwiseMatrix

Matrix of pairwise co-occurrence counts between features.

A PairwiseMatrix subclass where each cell (i, j) stores the co-occurrence frequency or count between feature i and feature j.

class pymaftools.core.PairwiseMatrix.SimilarityMatrix(data=None, index=None, columns=None, dtype=None, copy=None)[source]

Bases: PairwiseMatrix

Matrix of pairwise similarity scores between samples.

A PairwiseMatrix subclass where each cell (i, j) stores the similarity score (e.g., cosine similarity) between sample i and sample j. Provides methods for group-level similarity analysis, permutation testing, statistical comparison of group pairs, and visualization including heatmaps and network conversion.

get_mean_group_similarity(groups, group_order=None)[source]

Compute mean similarity between every pair of groups.

Parameters:

groups (pd.Series or np.ndarray) – Group label for each sample, aligned with the matrix indices.
group_order (array-like of str, optional) – Ordered list of unique group labels. If None, derived from groups.unique().

Return type:

DataFrame

Returns:

pd.DataFrame – Square DataFrame of shape (n_groups, n_groups) containing the mean pairwise similarity between each pair of groups.

generate_permutation_list(groups, group_order, n_permutations=1000)[source]

Generate group similarity matrices under random label permutations.

Parameters:

groups (pd.Series) – Group label for each sample.
group_order (array-like of str) – Ordered list of unique group labels.
n_permutations (int, default=1000) – Number of permutations to perform.

Return type:

list[DataFrame]

Returns:

list of pd.DataFrame – Each element is a group-mean similarity matrix computed from a random permutation of the group labels.

static calculate_group_similarity_pvalues(true_group_similarity, permutation_list, group_order, tail='right')[source]

Calculate permutation p-values for each pairwise group similarity.

Parameters:

true_group_similarity (pd.DataFrame) – Observed group-mean similarity matrix.
permutation_list (list of pd.DataFrame) – Permuted group-mean similarity matrices from generate_permutation_list().
group_order (array-like of str) – Ordered list of unique group labels.
tail ({'right', 'left', 'two'}, default='right') – Direction of the test. 'right' tests whether the observed value is greater than expected by chance.

Return type:

DataFrame

Returns:

pd.DataFrame – Matrix of p-values with the same shape as true_group_similarity.

Raises:

ValueError – If tail is not one of 'right', 'left', or 'two'.

static plot_group_heatmap(result_df, title, cmap='Blues', tick_size=14, fontsize=14, annot_size=14, mask_lower_triangle=True, ax=None, save_path=None, dpi=300)[source]

Plot a heatmap of group affinity matrix.

Parameters:

result_df (pd.DataFrame) – Group affinity matrix to plot.
title (str) – Title for the heatmap.
cmap (str, default='Blues') – Colormap for the heatmap.
tick_size (int, default=14) – Size of tick labels.
fontsize (int, default=14) – Font size for title.
annot_size (int, default=14) – Font size for annotations.
mask_lower_triangle (bool, default=True) – Whether to mask the lower triangle.
ax (matplotlib.axes.Axes, optional) – Existing axes to plot on.
save_path (str, optional) – Path to save the figure.
dpi (int, default=300) – DPI for saved figure.

Return type:

None

Examples

>>> AffinityMatrix.plot_group_heatmap(group_matrix, "Group Similarities")

plot_similarity_matrix(groups, figsize=(20, 20), group_cmap={'ASC': 'green', 'LUAD': 'orange', 'LUSC': 'blue'}, title='Cosine Similarity', cmap='coolwarm', ax=None, save_path=None, dpi=300)[source]

Plot the similarity matrix with group annotations.

Parameters:

groups (pd.Series) – Group labels for each sample.
figsize (tuple of int, default=(20, 20)) – Figure size as (width, height).
group_cmap (dict, default={'LUAD': 'orange', 'ASC': 'green', 'LUSC': 'blue'}) – Color mapping for groups.
title (str, default='Cosine Similarity') – Title for the plot.
cmap (str, default='coolwarm') – Colormap for the similarity matrix.
ax (matplotlib.axes.Axes, optional) – Existing axes to plot on.
save_path (str, optional) – Path to save the figure.
dpi (int, default=300) – DPI for saved figure.

Return type:

None

Examples

>>> groups = pd.Series(['A', 'A', 'B', 'B'])
>>> affinity_matrix.plot_similarity_matrix(groups, title="Sample Similarities")

compare_group_pairs(groups, pair1, pair2)[source]

Perform statistical test comparing affinity between two group pairs.

Parameters:

groups (pd.Series) – Group labels for each sample.
pair1 (tuple of str) – First group pair to compare (group1, group2).
pair2 (tuple of str) – Second group pair to compare (group1, group2).

Return type:

tuple[float, float]

Returns:

stat (float) – Mann-Whitney U test statistic.
p_value (float) – P-value of the test.

Examples

>>> stat, p = affinity_matrix.compare_group_pairs(
...     groups, ('A', 'B'), ('A', 'C')
... )

to_edges_dataframe(label, freq_threshold=0.1)[source]

Convert affinity matrix to edge list format for network analysis.

Parameters:

label (str) – Label to assign to all edges.
freq_threshold (float, default=0.1) – Minimum frequency threshold for edge inclusion.

Return type:

DataFrame

Returns:

pd.DataFrame – DataFrame with columns: source, target, frequency, label. Self-loops are removed.

Examples

>>> edges_df = affinity_matrix.to_edges_dataframe('similarity', 0.2)

to_networkx_graph(label, freq_threshold=0.1)[source]

Convert affinity matrix to NetworkX graph for network analysis.

Parameters:

label (str) – Label to assign to all edges.
freq_threshold (float, default=0.1) – Minimum frequency threshold for edge inclusion.

Return type:

MultiGraph

Returns:

nx.MultiGraph – NetworkX graph with frequency and label as edge attributes.

Examples

>>> graph = affinity_matrix.to_networkx_graph('similarity', 0.2)
>>> print(f"Graph has {graph.number_of_nodes()} nodes")

static plot_permutation_distribution(permutation_list, true_result_df, group1, group2, figsize=(6, 4), save_path=None, dpi=300)[source]

Plot the distribution of permuted values vs. the true observed value.

Parameters:

permutation_list (list of pd.DataFrame) – List of permuted affinity matrices.
true_result_df (pd.DataFrame) – True observed affinity matrix.
group1 (str) – First group name.
group2 (str) – Second group name.
figsize (tuple of int, default=(6, 4)) – Figure size as (width, height).
save_path (str, optional) – Path to save the figure.
dpi (int, default=300) – DPI for saved figure.

Return type:

None

Examples

>>> AffinityMatrix.plot_permutation_distribution(
...     perm_list, true_matrix, 'A', 'B'
... )

plot_similarity(groups, figsize=(20, 20), group_cmap={'ASC': 'green', 'LUAD': 'orange', 'LUSC': 'blue'}, title=None, cmap='coolwarm', ax=None, save_path=None, dpi=300, title_fontsize=20)[source]

Plot the similarity matrix with a group-color annotation bar.

Parameters:

groups (pd.Series) – Group label for each sample.
figsize (tuple of int, default=(20, 20)) – Figure size as (width, height).
group_cmap (dict of str to str) – Mapping from group name to color.
title (str, optional) – Title displayed above the heatmap.
cmap (str, default='coolwarm') – Colormap for the similarity heatmap.
ax (tuple of matplotlib.axes.Axes, optional) – Pre-existing axes as (ax_heatmap, ax_colorbar, ax_groupbar).
save_path (str, optional) – Path to save the figure.
dpi (int, default=300) – Resolution for the saved figure.
title_fontsize (int, default=20) – Font size for the title.

Return type:

None

static plot_heatmap(result_df, title, cmap='Blues', tick_size=14, fontsize=14, annot_size=14, mask_lower_triangle=True, ax=None, save_path=None, dpi=300, show_only_x_ticks=False, annot=True)[source]

Plot a heatmap of a group similarity or p-value matrix.

Parameters:

result_df (pd.DataFrame) – Square matrix to visualize.
title (str) – Title for the heatmap.
cmap (str, default='Blues') – Colormap for the heatmap.
tick_size (int, default=14) – Font size for tick labels.
fontsize (int, default=14) – Font size for the title.
annot_size (int, default=14) – Font size for cell annotations.
mask_lower_triangle (bool, default=True) – Whether to mask the lower triangle.
ax (matplotlib.axes.Axes, optional) – Existing axes to plot on.
save_path (str, optional) – Path to save the figure.
dpi (int, default=300) – Resolution for the saved figure.
show_only_x_ticks (bool, default=False) – If True, hide y-axis tick labels.
annot (bool, default=True) – Whether to annotate cells with numeric values.

Return type:

None

static analyze_similarity(table, groups, group_order, method, title=None, layout='grid', similarity_cmap='coolwarm', group_cmap={'ASC': 'green', 'LUAD': 'orange', 'LUSC': 'blue'}, group_avg_cmap='Blues', group_pvalues_cmap='Reds_r', save_dir='./figures/Similarity', dpi=300, file_format='tiff', heatmap_show_only_x_ticks=False, heatmap_annot=True, utest_group_pairs=[('LUAD', 'ASC'), ('ASC', 'LUSC')], annot_size=14)[source]

Run a full similarity analysis pipeline and produce a composite figure.

Computes the similarity matrix from table, calculates group-level means and permutation p-values, performs optional Mann-Whitney U tests between specified group pairs, and saves a multi-panel figure.

Parameters:

table (object) – Data table with a compute_similarity(method=...) method and a sample_metadata attribute.
groups (pd.Series or np.ndarray) – Group label for each sample.
group_order (array-like of str) – Ordered list of unique group labels.
method (str) – Similarity method passed to table.compute_similarity.
title (str, optional) – Title for the figure; also used to derive the output filename.
layout ({'grid', 'horizontal'}, default='grid') – Panel arrangement of the composite figure.
similarity_cmap (str, default='coolwarm') – Colormap for the full similarity matrix.
group_cmap (dict of str to str) – Mapping from group name to color for the annotation bar.
group_avg_cmap (str, default='Blues') – Colormap for the group-mean similarity heatmap.
group_pvalues_cmap (str, default='Reds_r') – Colormap for the permutation p-value heatmap.
save_dir (str, default='./figures/Similarity') – Directory to save the figure.
dpi (int, default=300) – Resolution for the saved figure.
file_format (str, default='tiff') – Output image format (e.g., 'tiff', 'png').
heatmap_show_only_x_ticks (bool, default=False) – If True, hide y-axis tick labels on group heatmaps.
heatmap_annot (bool, default=True) – Whether to annotate group heatmap cells.
utest_group_pairs (list of tuple of str, optional) – Two group pairs for a Mann-Whitney U test comparison.
annot_size (int, default=14) – Font size for heatmap annotations.

Return type:

dict[str, Any]

Returns:

dict – Dictionary with keys 'similarity_matrix', 'group_similarity', 'pval_matrix', 'pairwise_utest_p', 'pair1', and 'pair2'.

get_pairs_subset(groups, pair1, pair2)[source]

Extract similarity sub-matrices for two group pairs.

Parameters:

groups (pd.Series or np.ndarray) – Group label for each sample.
pair1 (tuple of str) – First group pair (group_a, group_b).
pair2 (tuple of str) – Second group pair (group_a, group_b).

Return type:

tuple[DataFrame, DataFrame]

Returns:

pair1_subset (pd.DataFrame) – Sub-matrix of similarities between the groups in pair1.
pair2_subset (pd.DataFrame) – Sub-matrix of similarities between the groups in pair2.

paired_similarity_utest(groups, pair1, pair2)[source]

Compare similarity distributions of two group pairs with a U test.

Performs a two-sample Mann-Whitney U test on the flattened similarity values of pair1 versus pair2.

Parameters:

groups (pd.Series or np.ndarray) – Group label for each sample.
pair1 (tuple of str) – First group pair (group_a, group_b).
pair2 (tuple of str) – Second group pair (group_a, group_b).

Return type:

tuple[float, float]

Returns:

stat (float) – Mann-Whitney U test statistic.
p (float) – Two-sided p-value.

Clustering

pymaftools.core.Clustering.table_to_distance(table)[source]

Convert a PivotTable to a distance matrix.

Parameters:: table (PivotTable) – Input data table with samples and features.
Return type:: ndarray
Returns:: numpy.ndarray – Distance matrix computed as 1 minus cosine similarity.

pymaftools.core.Clustering.k_fold_clustering_evaluation(table, min_clusters=2, max_clusters=50, metric='cosine', random_state=42, group_col='subtype')[source]

Evaluate the optimal number of clusters using K-fold cross-validation.

Parameters:

table (PivotTable) – Gene expression or CNV data table.
min_clusters (int, optional) – Minimum number of clusters to evaluate, by default 2.
max_clusters (int, optional) – Maximum number of clusters to evaluate, by default 50.
metric ({'cosine', 'hamming', 'jaccard'}, optional) – Similarity metric to use, by default ‘cosine’.
random_state (int, optional) – Random seed for reproducibility, by default 42.
group_col (str, optional) – Column name in sample metadata used for grouping, by default ‘subtype’.

Return type:

tuple[DataFrame, dict[int, dict[int, ndarray]]]

Returns:

pd.DataFrame – DataFrame containing silhouette scores for each fold and cluster count.
dict[int, dict[int, numpy.ndarray]] – Mapping of cluster count k to fold-wise cluster labels.

pymaftools.core.Clustering.align_clusters(ref_labels, target_labels, n_clusters)[source]

Align target cluster labels to reference labels using the Hungarian algorithm.

Parameters:

ref_labels (numpy.ndarray) – Reference cluster labels to align against.
target_labels (numpy.ndarray) – Target cluster labels to be remapped.
n_clusters (int) – Number of clusters.

Return type:

ndarray

Returns:

numpy.ndarray – Remapped target labels aligned to the reference labeling.

pymaftools.core.Clustering.align_cluster_label_dict(cluster_label_dict)[source]

Align cluster labels across folds using fold 1 as the reference.

Parameters:: cluster_label_dict (dict[int, dict[int, numpy.ndarray]]) – Mapping of cluster count k to a dict of fold number to label array. Structure: {k: {fold: labels}}.
Return type:: dict[int, DataFrame]
Returns:: dict[int, pd.DataFrame] – Mapping of each k to an aligned DataFrame (samples x folds).

pymaftools.core.Clustering.convert_ndarray_to_list(obj)[source]

Recursively convert all numpy.ndarray values in a nested structure to lists.

Parameters:: obj (object) – Input object, typically a dict, list, or numpy.ndarray.
Return type:: object
Returns:: object – The same structure with all numpy.ndarray instances replaced by lists.

pymaftools.core.Clustering.calculate_ari_matrix(aligned_cluster_label_dict, k)[source]

Compute the pairwise Adjusted Rand Index (ARI) matrix across folds.

Parameters:

aligned_cluster_label_dict (dict[int, pd.DataFrame]) – Aligned cluster labels, as returned by align_cluster_label_dict.
k (int) – Number of clusters to evaluate.

Return type:

DataFrame

Returns:

pd.DataFrame – Square DataFrame of pairwise ARI scores between folds.

pymaftools.core.Clustering.plot_ari_matrix(aligned_cluster_label_dict, k)[source]

Plot the upper-triangle ARI heatmap for a given cluster count k.

Parameters:

aligned_cluster_label_dict (dict[int, pd.DataFrame]) – Aligned cluster labels, as returned by align_cluster_label_dict.
k (int) – Number of clusters to visualize.

Return type:

None

pymaftools.core.Clustering.run_random_forest_cv(X, y, feature_names, n_splits=5, random_state=42, n_estimators=100)[source]

Run stratified K-fold cross-validated Random Forest classification.

Parameters:

X (numpy.ndarray) – Feature matrix of shape (n_samples, n_features).
y (numpy.ndarray) – Target labels of shape (n_samples,).
feature_names (list[str]) – Names corresponding to each feature column in X.
n_splits (int, optional) – Number of CV folds, by default 5.
random_state (int, optional) – Random seed, by default 42.
n_estimators (int, optional) – Number of trees in the forest, by default 100.

Return type:

tuple[RandomForestClassifier, list[float], DataFrame]

Returns:

RandomForestClassifier – The last trained model.
list[float] – Accuracy scores for each fold.
pd.DataFrame – Feature importances per fold and their mean.

pymaftools.core.Clustering.run_random_forest_multiple_seeds(X, y, feature_names, seeds=range(0, 5), n_estimators=100)[source]

Train Random Forest classifiers with multiple random seeds on the full dataset.

Parameters:

X (numpy.ndarray) – Feature matrix of shape (n_samples, n_features).
y (array-like) – Target labels.
feature_names (list[str]) – Names corresponding to each feature column in X.
seeds (range or list[int], optional) – Random seeds to iterate over, by default range(5).
n_estimators (int, optional) – Number of trees in each forest, by default 100.

Return type:

tuple[list[RandomForestClassifier], DataFrame]

Returns:

list[RandomForestClassifier] – Trained models, one per seed.
pd.DataFrame – Feature importances per seed and their mean.

pymaftools.core.Clustering.plot_cluster_feature_importance_boxplot(table, importance_cols, top_n=20)[source]

Draw a bar and box plot of the top N cluster feature importances.

Parameters:

table (pd.DataFrame) – DataFrame containing cluster information with importance scores. Must include a mean_importance column, plus unique_chr_arm and gene_count for axis labels.
importance_cols (list[str]) – Column names for per-fold importance scores.
top_n (int, optional) – Number of top clusters to display, by default 20.

Return type:

None

pymaftools.core.Clustering.plot_cluster_feature_importance(table, importance_cols, top_n=20)[source]

Plot top N cluster feature importances as bar (mean) and scatter (per-fold).

Parameters:

table (pd.DataFrame) – DataFrame containing cluster information with importance scores. Must include mean_importance, unique_chr_arm, and gene_count columns.
importance_cols (list[str]) – Column names for per-fold importance scores.
top_n (int, optional) – Number of top clusters to display, by default 20.

Return type:

None

pymaftools.core.Clustering.run_feature_clustering(table, result_path, max_clusters=200)[source]

Run agglomerative clustering on features for a range of cluster counts.

Parameters:

table (PivotTable) – Input data table with features as rows and samples as columns.
result_path (str) – File path to save the resulting CSV of silhouette scores.
max_clusters (int, optional) – Maximum number of clusters to evaluate, by default 200.

Return type:

DataFrame

Returns:

pd.DataFrame – DataFrame with columns n_clusters and silhouette for each k.

pymaftools.core.Clustering.plot_clustering_metrics_and_find_best_k(metric_df, filename, title=None, target_col='mean_silhouette', dpi=300, bbox_inches='tight', transparent=True, format=None, **kwargs)[source]

Plot silhouette and ARI metrics across cluster counts and find the best k.

Parameters:

metric_df (pd.DataFrame) – DataFrame indexed by cluster count with per-fold silhouette columns (fold1_silhouette … fold5_silhouette) and mean_ari_5_fold.
filename (str) – Output file path for the saved figure.
title (str or None, optional) – Plot title. If None, no title is displayed.
target_col (str, optional) – Column name to maximize for selecting the best k, by default 'mean_silhouette'.
dpi (int, optional) – Resolution in dots per inch, by default 300.
bbox_inches (str, optional) – Bounding box setting for saving, by default 'tight'.
transparent (bool, optional) – Whether the background is transparent, by default True.
format (str or None, optional) – Output format. Inferred from filename extension if None.
**kwargs – Additional keyword arguments passed to fig.savefig.

Return type:

int

Returns:

int – The cluster count k that maximizes target_col.

pymaftools.core.Clustering.gpt_known_genes_summary(client, genes, arm, cancer_type='lung cancer')[source]

Query GPT-4 for well-known genes in a given chromosomal arm and cancer type.

Parameters:

client (object) – OpenAI client instance with a chat.completions.create method.
genes (list[str]) – List of gene names to evaluate.
arm (str) – Chromosomal arm where the genes are located (e.g., '3p').
cancer_type (str, optional) – Cancer type context for the query, by default 'lung cancer'.

Return type:

tuple[str, str]

Returns:

str – GPT-4 response text listing notable genes and reasons.
str – The prompt that was sent to the model.