Core Modules
MAF
- class pymaftools.core.MAF.MAF(*args, **kwargs)[source]
Bases:
DataFrameA pandas DataFrame subclass for Mutation Annotation Format (MAF) files.
Provides methods to read, filter, merge, and convert MAF data commonly used in cancer genomics pipelines.
- index_col = ['Hugo_Symbol', 'Start_Position', 'End_Position', 'Reference_Allele', 'Tumor_Seq_Allele1', 'Tumor_Seq_Allele2']
- vaild_variant_classfication = ['Frame_Shift_Del', 'Frame_Shift_Ins', 'In_Frame_Del', 'In_Frame_Ins', 'Missense_Mutation', 'Nonsense_Mutation', 'Silent', 'Splice_Site', 'Translation_Start_Site', 'Nonstop_Mutation', "3'UTR", "3'Flank", "5'UTR", "5'Flank", 'IGR', 'Intron', 'RNA', 'Targeted_Region']
- nonsynonymous_types = ['Frame_Shift_Del', 'Frame_Shift_Ins', 'In_Frame_Del', 'In_Frame_Ins', 'Missense_Mutation', 'Nonsense_Mutation', 'Splice_Site', 'Translation_Start_Site', 'Nonstop_Mutation']
- classmethod read_maf(maf_path, sample_ID, preffix='', suffix='')[source]
Read a MAF file and return a MAF object.
- Parameters:
maf_path (str or os.PathLike) – Path to the MAF file (first row is skipped as a comment line).
sample_ID (str) – Sample identifier to assign to all rows.
preffix (str, default "") – Prefix prepended to the sample ID.
suffix (str, default "") – Suffix appended to the sample ID.
- Return type:
- Returns:
MAF – A MAF DataFrame with a composite index built from
index_col.
- static merge_mutations(column)[source]
Merge multiple mutations for a single gene–sample pair.
If all values are
Falsethe result isFalse. When more than one non-false mutation exists the result is"Multi_Hit", following the maftools convention (see maftools issue #347).
- to_pivot_table()[source]
Create a gene-by-sample pivot table of variant classifications.
- Return type:
- Returns:
PivotTable – Pivot table with genes as rows, samples as columns, and variant classifications (or
"Multi_Hit"/False) as values.
- to_mutation_table()[source]
Create a mutation-level pivot table.
Each row corresponds to a unique mutation (composite index) rather than a gene, providing finer resolution than
to_pivot_table().- Return type:
- Returns:
PivotTable – Pivot table indexed by individual mutations.
- property mutations_count: Series
Count the number of mutations per sample.
- Returns:
pd.Series – Series indexed by sample ID with mutation counts as values.
- sort_by_chrom()[source]
Sort rows by genomic coordinates.
- Return type:
- Returns:
MAF – MAF sorted by Chromosome, Start_Position, and End_Position.
- classmethod read_csv(csv_path, sep='\\t', reindex=False)[source]
Read a CSV/TSV file into a MAF object.
- Parameters:
csv_path (str or os.PathLike) – Path to the CSV or TSV file.
sep (str, default "t") – Column delimiter.
reindex (bool, default False) – If
True, rebuild the composite index fromindex_colafter reading. Otherwise the first column is used as the index.
- Return type:
- Returns:
MAF – MAF constructed from the file contents.
- to_csv(csv_path, **kwargs)[source]
Write the MAF to a CSV/TSV file.
Default behaviour writes a tab-separated file with the index included.
- Parameters:
csv_path (str or os.PathLike) – Destination file path.
**kwargs (Any) – Additional keyword arguments forwarded to
pd.DataFrame.to_csv.
- Return type:
- to_MAF(maf_path, **kwargs)[source]
Write the data as a standard MAF file (no index column).
- Parameters:
maf_path (str or os.PathLike) – Destination file path.
**kwargs (Any) – Additional keyword arguments forwarded to
pd.DataFrame.to_csv.
- Return type:
- to_base_change_pivot_table()[source]
Build a base-change pivot table with transition/transversion stats.
Only SNPs are considered. The returned PivotTable has base-change categories as rows and samples as columns, with
ti,tv, andti/tvratio stored insample_metadata.- Return type:
- Returns:
PivotTable – Pivot table of base-change counts with ti/tv metadata.
- get_protein_info(gene)[source]
Extract protein mutation information for a given gene.
- Parameters:
gene (str) – Hugo gene symbol to query.
- Return type:
- Returns:
AA_length (int or None) – Total amino-acid length of the protein, or
Noneif unavailable.mutations_data (list[dict]) – List of dicts with keys
"position","type", and"count"describing nonsynonymous mutations.
- static get_domain_info(gene_name, AA_length, protein_domains_path=None)[source]
Look up protein domain annotations for a gene.
- Parameters:
gene_name (str) – HGNC gene symbol.
AA_length (int) – Amino-acid length used to match the correct transcript.
protein_domains_path (str, os.PathLike, or None, default None) – Path to a protein domains CSV. When
Nonethe bundled dataset (derived from maftools) is used.
- Return type:
- Returns:
domains (list[dict]) – List of dicts with
"Start","End", and"Label"keys.refseq_id (str) – The RefSeq transcript ID used.
- Raises:
ValueError – If no domain information is found for the given gene and length.
- write_maf(file_path)[source]
Write the MAF to a tab-separated file without the index.
- Parameters:
file_path (str or os.PathLike) – Destination file path.
- Return type:
- write_SigProfilerMatrixGenerator_format(output_path)[source]
Convert and write the MAF in SigProfilerMatrixGenerator format.
Renames columns to match the SigProfilerMatrixGenerator standard and filters to rows whose
Variant_Typeis SNP, INS, or DEL.- Parameters:
output_path (str or os.PathLike) – Destination TSV file path.
- Return type:
PivotTable
PivotTable Module
Extended pandas DataFrame for bioinformatics analysis with integrated metadata support. Specifically designed for mutation analysis and genomic data visualization.
- class pymaftools.core.PivotTable.PivotTable(data=None, *args, **kwargs)[source]
Bases:
DataFrameEnhanced pandas DataFrame for bioinformatics analysis.
A specialized DataFrame that maintains synchronized metadata for both features (rows, typically genes/mutations) and samples (columns). Designed for genomic data analysis with built-in support for mutation frequency calculations, statistical testing, and visualization.
- feature_metadata
Metadata for features (genes/mutations/signatures), indexed by feature names.
- Type:
pd.DataFrame
- sample_metadata
Metadata for samples, indexed by sample names.
- Type:
pd.DataFrame
Examples
>>> # Create a PivotTable from mutation data >>> data = pd.DataFrame({'sample1': [True, False], 'sample2': [False, True]}, ... index=['TP53', 'KRAS']) >>> table = PivotTable(data) >>> table.feature_metadata['freq'] = table.add_freq().feature_metadata['freq']
- property plot: PivotTablePlot
Access plotting functionality for the PivotTable.
- Returns:
PivotTablePlot – Plotting interface providing various visualization methods.
Examples
>>> # PCA plot colored by subtype >>> pivot_table.plot.plot_pca_samples(group_col="subtype")
>>> # Boxplot with statistical annotations >>> pivot_table.plot.plot_boxplot_with_annot( ... test_col="TMB", ... group_col="subtype" ... )
- rename_index_and_columns(index_name='feature', columns_name='sample')[source]
Rename the index and columns of the PivotTable.
- Parameters:
- Return type:
- Returns:
PivotTable – PivotTable with renamed index and columns.
- classmethod read_sqlite(db_path)[source]
Load PivotTable from SQLite database format.
- Parameters:
db_path (str) – Path to the SQLite database file.
- Return type:
- Returns:
PivotTable – Loaded PivotTable with metadata.
- to_hierarchical_clustering(method='ward', metric='euclidean')[source]
Perform hierarchical clustering on both features and samples.
Computes hierarchical clustering linkage matrices for both the feature dimension (genes/mutations) and sample dimension using scipy’s linkage function. This enables creation of dendrograms and clustermaps for data visualization and pattern discovery.
- Parameters:
method (str, default 'ward') –
Linkage algorithm to use. Options include:
’ward’: Minimizes within-cluster variance (requires euclidean metric)
’single’: Nearest point algorithm
’complete’: Farthest point algorithm
’average’: UPGMA algorithm
’weighted’: WPGMA algorithm
’centroid’: UPGMC algorithm
’median’: WPGMC algorithm
metric (str, default 'euclidean') –
Distance metric for clustering. Common options:
’euclidean’: Standard Euclidean distance
’manhattan’: L1 distance
’cosine’: Cosine distance
’correlation’: Correlation distance
’hamming’: Hamming distance (for binary data)
’jaccard’: Jaccard distance (for binary data)
- Return type:
- Returns:
Dict[str, np.ndarray] – Dictionary containing linkage matrices:
- ’gene_linkage’np.ndarray of shape (n_features-1, 4)
Linkage matrix for features (genes), where each row represents a merge operation in the clustering tree
- ’sample_linkage’np.ndarray of shape (n_samples-1, 4)
Linkage matrix for samples, where each row represents a merge operation in the clustering tree
Notes
The linkage matrices returned follow scipy’s format where each row contains [cluster1_id, cluster2_id, distance, cluster_size].
For binary mutation data, consider using ‘hamming’ or ‘jaccard’ metrics. The ‘ward’ method works only with ‘euclidean’ metric.
Examples
>>> # Basic hierarchical clustering >>> clustering = pivot_table.to_hierarchical_clustering() >>> gene_linkage = clustering['gene_linkage'] >>> sample_linkage = clustering['sample_linkage']
>>> # Using Jaccard distance for binary mutation data >>> clustering = pivot_table.to_hierarchical_clustering( ... method='average', ... metric='jaccard' ... )
>>> # Create dendrogram from results >>> from scipy.cluster.hierarchy import dendrogram >>> import matplotlib.pyplot as plt >>> >>> plt.figure(figsize=(10, 6)) >>> dendrogram(clustering['gene_linkage']) >>> plt.title('Gene Clustering Dendrogram') >>> plt.show()
See also
scipy.cluster.hierarchy.linkageThe underlying clustering function
scipy.cluster.hierarchy.dendrogramFor visualizing clustering results
seaborn.clustermapFor creating clustered heatmaps
- copy(deep=True)[source]
Make a copy of this object’s indices and data.
Creates a deep or shallow copy of the PivotTable and its associated feature_metadata and sample_metadata.
- Parameters:
deep (bool, default True) – Whether to make a deep copy or shallow copy.
- Return type:
- Returns:
PivotTable – Copy of the PivotTable with preserved metadata.
Examples
>>> pivot_copy = pivot_table.copy() >>> pivot_shallow = pivot_table.copy(deep=False)
See also
pandas.DataFrame.copyThe underlying pandas copy method.
- subset(*, features=None, samples=None)[source]
Subset PivotTable by features and/or samples.
This method provides a convenient interface for selecting specific features (rows) and samples (columns) from the PivotTable while preserving metadata alignment.
- Parameters:
features (list, pd.Series, or slice, optional) –
Features (rows) to select. Can be:
- list of strFeature names to select
Example:
["TP53", "KRAS", "EGFR"]
- pd.Series (bool)Boolean mask for feature selection
Example:
pivot_table.feature_metadata["freq"] > 0.1
- sliceSlice object for feature selection
Example:
slice(None, 10)for first 10 features
None : Select all features (default)
samples (list, pd.Series, or slice, optional) –
Samples (columns) to select. Can be:
- list of strSample names to select
Example:
["sample1", "sample2", "sample3"]
- pd.Series (bool)Boolean mask for sample selection
Example:
pivot_table.sample_metadata["subtype"] == "LUAD"
- sliceSlice object for sample selection
Example:
slice(None, 20)for first 20 samples
None : Select all samples (default)
- Return type:
- Returns:
PivotTable – Subset PivotTable with synchronized metadata. Only keeps existing labels (inner join behavior).
Notes
This method uses inner join behavior, meaning only existing labels are kept. Missing labels are silently ignored. For outer join behavior that includes missing labels with NaN values, use the
reindexmethod.Examples
>>> # Select specific genes and samples >>> subset = pivot_table.subset( ... features=["TP53", "KRAS"], ... samples=["sample1", "sample2"] ... )
>>> # Select high-frequency mutations >>> high_freq = pivot_table.feature_metadata["freq"] > 0.1 >>> frequent_mutations = pivot_table.subset(features=high_freq)
>>> # Select samples by subtype >>> luad_samples = pivot_table.sample_metadata["subtype"] == "LUAD" >>> luad_data = pivot_table.subset(samples=luad_samples)
The metadata is subset based on the DataFrame’s index (features) and columns (samples). Missing indices in metadata will result in NaN values in the new PivotTable’s metadata.
See also
PivotTable.reindexFor outer join behavior with missing labels.
PivotTable.__getitem__For direct indexing operations.
- reindex(index=None, columns=None, *args, fill_value=nan, feature_fill_value=nan, sample_fill_value=nan, **kwargs)[source]
Conform PivotTable to new index and/or columns with synchronized metadata.
This method extends pandas DataFrame.reindex to also reindex the associated feature_metadata and sample_metadata, maintaining consistency across all components of the PivotTable.
- Parameters:
index (array-like, optional) – New labels for the rows. If None, use existing index.
columns (array-like, optional) – New labels for the columns. If None, use existing columns.
*args (
Any) – Additional positional arguments passed to pandas.DataFrame.reindex.fill_value (scalar, default np.nan) – Value to use for missing values in the main DataFrame.
feature_fill_value (scalar, default np.nan) – Value to use for missing values when reindexing feature_metadata.
sample_fill_value (scalar, default np.nan) – Value to use for missing values when reindexing sample_metadata.
**kwargs (
Any) – Additional keyword arguments passed to pandas.DataFrame.reindex.
- Return type:
- Returns:
PivotTable – Reindexed PivotTable with synchronized metadata.
Notes
Unlike the
subsetmethod which uses inner join behavior, this method uses outer join behavior and will include missing labels filled with the specified fill values.The metadata DataFrames are automatically reindexed to match the new structure of the main DataFrame.
Examples
>>> # Reindex with new features, filling missing with 0 >>> new_features = ["TP53", "KRAS", "NEW_GENE"] >>> reindexed = pivot_table.reindex( ... index=new_features, ... fill_value=0, ... feature_fill_value="Unknown" ... )
>>> # Reindex with new samples >>> new_samples = ["sample1", "sample2", "new_sample"] >>> reindexed = pivot_table.reindex( ... columns=new_samples, ... sample_fill_value="Missing" ... )
>>> # Reindex both dimensions >>> reindexed = pivot_table.reindex( ... index=new_features, ... columns=new_samples, ... fill_value=np.nan ... )
See also
pandas.DataFrame.reindexThe underlying pandas reindex method.
PivotTable.subsetFor inner join behavior.
- static merge(tables, fill_value=nan, feature_fill_value=nan, sample_fill_value=nan, join='outer')[source]
Merge multiple PivotTables into a single PivotTable.
Concatenates multiple PivotTable objects along the sample axis (columns) and aligns features (rows) according to the selected join strategy. Metadata (feature_metadata and sample_metadata) is automatically synchronized with the resulting data matrix.
- Parameters:
tables (List[PivotTable]) – List of PivotTable objects to merge. All should have compatible structure.
fill_value (scalar, default np.nan) – Value to use for missing values in the main data matrix after merging.
feature_fill_value (scalar, default np.nan) – Value to use for missing values when reindexing feature_metadata.
sample_fill_value (scalar, default np.nan) – Value to use for missing values when reindexing sample_metadata.
join ({'inner', 'outer'}, default 'outer') –
Strategy for aligning features (rows) across tables: - ‘inner’: Keep only features shared by all tables. - ‘outer’: Keep all features from all tables (union).
Note: samples (columns) are always unioned.
- Return type:
- Returns:
PivotTable – A new PivotTable with: - Merged data matrix - Reindexed feature and sample metadata - Missing values filled with specified defaults
- Raises:
ValueError – If join is not ‘inner’ or ‘outer’.
ValueError – If sample (column) names overlap across tables.
Examples
>>> # Outer merge (default): keeps all features >>> merged = PivotTable.merge([table_A, table_B])
>>> # Inner merge: keeps only shared features >>> merged = PivotTable.merge([table_A, table_B], join='inner')
>>> # Fill missing values with 0 >>> merged = PivotTable.merge([table_A, table_B], fill_value=0)
- calculate_TMB(default_capture_size=40, group_col='subtype', capture_size_dict=None)[source]
- Return type:
- calculate_feature_frequency()[source]
Calculate mutation frequency for each feature.
Computes the frequency of each feature (gene/mutation/signature) across all samples by converting the data to binary format and calculating the proportion of samples with mutations for each feature.
Treats any non-False value as indicating the presence of a mutation, effectively converting the data to binary (mutated/not mutated) before frequency calculation.
- Returns:
pd.Series – Mutation frequency for each feature, indexed by feature names. Values range from 0.0 (no mutations in any sample) to 1.0 (mutations in all samples).
Notes
This method is equivalent to calling:
Convert PivotTable to binary format using to_binary_table()
Sum mutations across samples (axis=1)
Divide by total number of samples
The frequency represents the proportion of samples that have a mutation for each feature, regardless of the specific mutation type or value.
Examples
>>> # Create example mutation data >>> data = pd.DataFrame({ ... 'sample1': [True, False, True], ... 'sample2': [False, True, True], ... 'sample3': [True, True, False] ... }, index=['TP53', 'KRAS', 'EGFR']) >>> table = PivotTable(data) >>> frequencies = table.calculate_feature_frequency() >>> print(frequencies) TP53 0.666667 KRAS 0.666667 EGFR 0.666667 dtype: float64
>>> # Frequency shows proportion of samples with each mutation >>> print(f"TP53 is mutated in {frequencies['TP53']:.1%} of samples") TP53 is mutated in 66.7% of samples
See also
to_binary_tableConvert PivotTable to binary mutation format
add_freqAdd frequency columns to feature_metadata
filter_by_freqFilter features by frequency threshold
- add_freq(groups={})[source]
Add mutation frequency columns to feature_metadata.
Calculates overall mutation frequency and optionally group-specific frequencies for all features, adding these as new columns to the feature_metadata DataFrame. This enables frequency-based filtering and analysis operations.
- Parameters:
groups (Dict[str, PivotTable], default {}) –
Dictionary mapping group names to PivotTable objects for calculating group-specific mutation frequencies. Each PivotTable should represent a subset of samples belonging to a specific group (e.g., cancer subtypes, treatment groups, etc.).
Example: {“LUAD”: luad_table, “LUSC”: lusc_table, “Control”: control_table}
- Return type:
- Returns:
PivotTable – A new PivotTable (copy) with frequency columns added to feature_metadata:
- ”{group_name}_freq”float
Frequency for each group specified in the groups dictionary
- ”freq”float
Overall frequency across all samples in the current PivotTable
- Raises:
TypeError – If any value in the groups dictionary is not a PivotTable instance.
Notes
The frequency calculation treats any non-False value as indicating mutation presence. Frequencies are calculated as:
frequency = (number of mutated samples) / (total number of samples)
Group-specific frequencies are calculated independently for each group’s PivotTable, while the overall frequency uses all samples in the current PivotTable.
Examples
>>> # Add overall frequency only >>> table_with_freq = pivot_table.add_freq() >>> print(table_with_freq.feature_metadata.columns) Index(['freq'], dtype='object')
>>> # Add group-specific frequencies >>> luad_subset = pivot_table.subset(samples=luad_sample_mask) >>> lusc_subset = pivot_table.subset(samples=lusc_sample_mask) >>> groups = {"LUAD": luad_subset, "LUSC": lusc_subset} >>> table_with_freq = pivot_table.add_freq(groups=groups) >>> print(table_with_freq.feature_metadata.columns) Index(['LUAD_freq', 'LUSC_freq', 'freq'], dtype='object')
>>> # Use frequencies for filtering >>> high_freq_features = table_with_freq.filter_by_freq(threshold=0.1) >>> luad_specific = table_with_freq[ ... (table_with_freq.feature_metadata['LUAD_freq'] > 0.2) & ... (table_with_freq.feature_metadata['LUSC_freq'] < 0.05) ... ]
See also
calculate_feature_frequencyCalculate frequency for current PivotTable
filter_by_freqFilter features by frequency threshold
sort_featuresSort features by metadata columns including frequency
- sort_features(by='freq', ascending=False)[source]
Sort features (rows) by a column in feature_metadata.
- Parameters:
- Return type:
- Returns:
PivotTable – New PivotTable with features sorted by the specified column.
- Raises:
ValueError – If the specified column is not found in feature_metadata.
- sort_samples_by_mutations(top=10)[source]
Sort samples by their mutation patterns.
Uses a binary encoding approach where mutation patterns of the top mutated features are converted to integers for sorting.
- Parameters:
top (int, default 10) – Number of top features to consider for sorting.
- Return type:
- Returns:
PivotTable – New PivotTable with samples sorted by mutation patterns. The mutation weight is added to sample_metadata.
- sort_samples_by_group(group_col, group_order, top=10)[source]
Sort samples by group membership and then by mutation patterns.
First sorts samples according to the specified group order, then within each group, applies mutation-based sorting using sort_samples_by_mutations. This creates a hierarchical sorting where group membership is the primary sort key and mutation patterns are the secondary key.
- Parameters:
group_col (str) – The column name in sample_metadata containing group information (e.g., “subtype”, “treatment”, “stage”).
group_order (List[str]) – The desired order of groups for sample arrangement. Groups will be ordered as specified in this list.
top (int, default 10) – The number of top features (highest frequency) to consider when sorting samples by mutation patterns within each group.
- Return type:
- Returns:
PivotTable – A new PivotTable with samples sorted first by group membership, then by mutation patterns within each group.
- Raises:
ValueError – If the specified group_col is not found in sample_metadata.
Notes
This method is useful for creating organized visualizations where you want to group samples by a specific criterion (e.g., cancer subtype) while maintaining mutation-based ordering within each group.
The mutation-based sorting within groups uses the sort_samples_by_mutations method, which converts mutation patterns to binary encodings for sorting.
Examples
>>> # Sort samples by cancer subtype, then by mutation patterns >>> sorted_table = pivot_table.sort_samples_by_group( ... group_col="subtype", ... group_order=["LUAD", "LUSC", "SCLC"], ... top=15 ... )
>>> # Sort by treatment response, considering top 20 mutations >>> sorted_table = pivot_table.sort_samples_by_group( ... group_col="response", ... group_order=["Complete", "Partial", "Stable", "Progressive"], ... top=20 ... )
See also
sort_samples_by_mutationsSort samples by mutation patterns only
sort_featuresSort features by metadata columns
subsetSelect specific samples or features
- PCA(to_binary)[source]
Perform Principal Component Analysis on the PivotTable.
- Parameters:
to_binary (bool) – Whether to convert the data to binary format before PCA.
- Return type:
- Returns:
tuple –
pca_result_df : pd.DataFrame with PC1 and PC2 for each sample
explained_variance : np.ndarray of variance ratios for PC1 and PC2
pca : sklearn.decomposition.PCA fitted object
- head(n=50)[source]
Return the first n features (rows) subset of the PivotTable.
- Parameters:
n (int, default 50) – Number of features to return.
- Return type:
- Returns:
PivotTable – PivotTable subset containing only the first n features.
- tail(n=50)[source]
Return the last n features (rows) subset of the PivotTable.
- Parameters:
n (int, default 50) – Number of features to return.
- Return type:
- Returns:
PivotTable – PivotTable subset containing only the last n features.
- filter_by_freq(threshold=0.05)[source]
Filter features by their mutation frequency.
- Parameters:
threshold (float, default 0.05) – Minimum frequency threshold (0 to 1).
- Return type:
- Returns:
PivotTable – PivotTable containing only features with freq >= threshold.
- Raises:
ValueError – If ‘freq’ column is not found in feature_metadata.
- filter_by_variance(threshold=None, method='var', quantile=None)[source]
Filter features by variance or median absolute deviation.
- Parameters:
threshold (float, optional) – Minimum variance/MAD threshold. Features with scores >= threshold are kept. If None,
quantilemust be specified.method ({"var", "mad"}, default "var") – Dispersion metric: - “var”: variance - “mad”: median absolute deviation
quantile (float, optional) – Quantile cutoff (0–1). E.g.
quantile=0.75keeps the top 25% most variable features. Overridesthresholdwhen both given.
- Return type:
- Returns:
PivotTable – Filtered PivotTable with dispersion scores in
feature_metadata.- Raises:
ValueError – If neither
thresholdnorquantileis specified, or ifmethodis not supported.
- filter_by_statistical_test(group_col, method='kruskal', alpha=0.05)[source]
Filter features by a statistical test across sample groups.
For each feature, performs the chosen test on groups defined by
group_colinsample_metadata, applies FDR correction, and returns only features with adjusted p-value <alpha.- Parameters:
group_col (str) – Column in
sample_metadatadefining sample groups.method ({"ttest", "mann_whitney", "kruskal", "anova"}, default "kruskal") – Statistical test.
ttestandmann_whitneyrequire exactly two groups;kruskalandanovasupport two or more.alpha (float, default 0.05) – Significance threshold applied after FDR correction.
- Return type:
- Returns:
PivotTable – Filtered PivotTable with
p_valueandadjusted_p_valuecolumns added tofeature_metadata.- Raises:
ValueError – If
methodis unsupported or group count doesn’t match the test.
- to_cooccur_matrix(freq=True)[source]
Convert to co-occurrence matrix format.
- Parameters:
freq (bool, default True) – If True, normalize by sample count to get frequencies. If False, return raw co-occurrence counts.
- Return type:
- Returns:
CooccurrenceMatrix – Matrix showing feature co-occurrence patterns.
- to_binary_table()[source]
Convert PivotTable to binary format.
Converts all non-False values to True, creating a binary representation of the mutation data.
- Return type:
- Returns:
PivotTable – Bool PivotTable where True indicates mutation presence.
- mutation_enrichment_test(group_col, group1, group2, alpha=0.05, minimum_mutations=2, method='chi2')[source]
Perform statistical enrichment test for mutations between two groups.
Tests whether specific mutations are significantly enriched in one group compared to another using either Chi-squared test or Fisher’s exact test. Multiple testing correction is applied using the Benjamini-Hochberg method.
- Parameters:
group_col (str) – Column name in sample_metadata that contains group assignments.
group1 (str) – Name of the first group to compare.
group2 (str) – Name of the second group to compare.
alpha (float, default 0.05) – Significance level for multiple testing correction.
minimum_mutations (int, default 2) – Minimum number of mutations required in either group to include a feature in the analysis.
method ({"chi2", "fisher"}, default "chi2") – Statistical test method to use: - “chi2”: Chi-squared test of independence - “fisher”: Fisher’s exact test
- Return type:
DataFrame- Returns:
pd.DataFrame – Results DataFrame with the following columns: - “{group1}_True”: Count of mutated samples in group1 - “{group1}_False”: Count of non-mutated samples in group1 - “{group2}_True”: Count of mutated samples in group2 - “{group2}_False”: Count of non-mutated samples in group2 - “p_value”: Raw p-values from statistical test - “adjusted_p_value”: FDR-corrected p-values - “is_significant”: Boolean indicating significance after correction - “test_method”: Method used for testing
- Raises:
ValueError – If unsupported statistical method is specified.
Notes
The method creates 2x2 contingency tables for each feature:
Group1 Group2
Mutated a b Not mutated c d
Features with fewer than minimum_mutations in both groups are excluded to avoid testing rare mutations that may not be statistically meaningful.
Examples
>>> # Test for mutations enriched in LUAD vs LUSC >>> results = pivot_table.mutation_enrichment_test( ... group_col="subtype", ... group1="LUAD", ... group2="LUSC", ... method="fisher" ... ) >>> significant = results[results["is_significant"]] >>> print(f"Found {len(significant)} significantly enriched mutations")
See also
scipy.stats.chi2_contingencyChi-squared test implementation
scipy.stats.fisher_exactFisher’s exact test implementation
statsmodels.stats.multitest.multipletestsMultiple testing correction
- compute_similarity(method='cosine')[source]
Compute sample similarity matrix using specified metric.
- Parameters:
method ({"cosine", "hamming", "jaccard", "pearson", "spearman", "kendall"}, default "cosine") – Similarity metric to use.
- Return type:
- Returns:
SimilarityMatrix – Pairwise similarity matrix between samples.
- Raises:
ValueError – If unsupported similarity method is specified.
- order(group_col, group_order)[source]
Reorder samples by group membership.
- Parameters:
- Return type:
- Returns:
PivotTable – PivotTable with samples ordered by group membership.
- static prepare_data(maf)[source]
Prepare and process MAF data into a sorted PivotTable.
Filters MAF for nonsynonymous mutations, converts to PivotTable, adds frequency calculations, and sorts by feature frequency and sample mutation patterns.
- Parameters:
maf (MAF) – Input MAF object containing mutation data.
- Return type:
- Returns:
PivotTable – Processed and sorted PivotTable ready for analysis.
- add_sample_metadata(sample_metadata, fill_value=None, force=False)[source]
Safely add sample metadata to the PivotTable.
This method ensures that: 1. Only samples existing in the PivotTable are added 2. Existing columns are not overwritten unless forced 3. Type consistency is maintained
- Parameters:
- Return type:
- Returns:
PivotTable – PivotTable with updated sample metadata.
- Raises:
ValueError – If sample names don’t match or columns conflict without force=True.
Examples
>>> # Add new metadata columns >>> new_meta = pd.DataFrame({ ... 'age': [65, 72, 58], ... 'stage': ['I', 'II', 'III'] ... }, index=['sample1', 'sample2', 'sample3']) >>> table_with_meta = table.add_sample_metadata(new_meta)
- add_feature_metadata(feature_metadata, fill_value=None, force=False)[source]
Safely add feature metadata to the PivotTable.
This method ensures that: 1. Only features existing in the PivotTable are added 2. Existing columns are not overwritten unless forced 3. Type consistency is maintained
- Parameters:
- Return type:
- Returns:
PivotTable – PivotTable with updated feature metadata.
- Raises:
ValueError – If feature names don’t match or columns conflict without force=True.
Examples
>>> # Add gene annotation metadata >>> gene_anno = pd.DataFrame({ ... 'chromosome': ['17', '12', '3'], ... 'gene_type': ['tumor_suppressor', 'oncogene', 'oncogene'] ... }, index=['TP53', 'KRAS', 'PIK3CA']) >>> table_with_anno = table.add_feature_metadata(gene_anno)
- pymaftools.core.PivotTable.capture_size(bed_path)[source]
Calculate the total capture size (in megabases) from a BED file.
The BED file must have at least three columns: chrom, start, end.
- Return type:
- Parameters:
bed_path (str): Path to the BED file.
- Returns:
float: Total capture region size in megabases (Mb).
Cohort
- class pymaftools.core.Cohort.Cohort(name, description='')[source]
Bases:
object- add_sample_metadata(new_metadata, source='')[source]
Add or merge sample metadata into the cohort.
- Parameters:
new_metadata (pd.DataFrame) – DataFrame containing sample metadata, indexed by sample ID.
source (str, optional) – Name of the source providing the metadata, used in error messages, by default “”.
- Raises:
TypeError – If
new_metadatais not a pandas DataFrame.ValueError – If the index of
new_metadatadoes not match the existing cohort index, or if shared columns have conflicting values.
- Return type:
- add_table(table, table_name)[source]
Add a PivotTable to the cohort.
- Parameters:
table (PivotTable) – The PivotTable to add.
table_name (str) – Name to assign to the table within the cohort.
- Raises:
TypeError – If
tableis not an instance of PivotTable.- Return type:
- info()[source]
Return a summary string of the Cohort structure.
- Return type:
- Returns:
str – A tree-formatted summary showing each table’s dimensions and metadata counts.
- to_sql_registry()[source]
Generate a registry DataFrame for SQL table mapping.
Creates a mapping between logical table names and their corresponding SQL table names for data, sample metadata, and feature metadata.
- Return type:
DataFrame- Returns:
pd.DataFrame – Registry with columns: sql_table_name, cohort_name, table_name, type
- to_sqlite(db_path)[source]
Save Cohort to SQLite database format.
Deprecated since version 0.4.0:
to_sqlitewill be removed in a future version. Useto_hdf5()instead, which supports larger datasets without column limits.
- classmethod read_sqlite(db_path)[source]
Load Cohort from SQLite database format.
Deprecated since version 0.4.0:
read_sqlitewill be removed in a future version. Useread_hdf5()instead.
CopyNumberVariationTable
- class pymaftools.core.CopyNumberVariationTable.CopyNumberVariationTable(data=None, *args, **kwargs)[source]
Bases:
PivotTableTable for storing and manipulating copy number variation (CNV) data.
Inherits from
PivotTableand provides specialized methods for reading GISTIC output files, thresholding continuous copy number values, sorting by chromosomal position, clustering, and plotting CNV frequencies.The data matrix is oriented with genomic features (genes or chromosome arms) as rows and samples as columns. Associated
feature_metadataandsample_metadataDataFrames carry annotation such as cytoband, chromosome, arm, thresholds, and sample type.See also
PivotTableBase class providing generic pivot-table operations.
- classmethod from_pivot_table(table)[source]
Create a CopyNumberVariationTable object from a PivotTable object, preserving all metadata.
- Parameters:
table (PivotTable) – A PivotTable object containing sample_metadata and feature_metadata attributes.
- Return type:
- Returns:
CopyNumberVariationTable – A CopyNumberVariationTable object with original sample_metadata and feature_metadata preserved.
- classmethod read_gistic_arm_level(file_path)[source]
Read GISTIC broad data by arm level file.
- Parameters:
file_path (str) – Path to the GISTIC arm-level results file.
- Returns:
CopyNumberVariationTable – A CopyNumberVariationTable object with arm-level copy number data.
- classmethod read_gistic_gene_level(file_path, feature_columns=['Gene Symbol', 'Gene ID', 'Cytoband'], samples=None)[source]
Read GISTIC results file and create a CopyNumberVariationTable object.
This method reads GISTIC output files (typically all_data_by_genes.txt or all_thresholded.by_genes.txt) and converts them into a CopyNumberVariationTable object with properly formatted feature and sample metadata.
- Parameters:
file_path (str) – Path to the GISTIC results file (tab-separated format).
feature_columns (list of str, default ["Gene Symbol", "Gene ID", "Cytoband"]) – List of column names to be treated as feature metadata. These columns will be separated from the main data matrix.
samples (None or list of str, optional) – List of sample names to subset. If None, all samples are kept. Only samples present in both the data and this list will be retained.
- Returns:
CopyNumberVariationTable – A CopyNumberVariationTable object containing: - Main data matrix with gene symbols as index - feature_metadata with gene information and parsed chromosome data - sample_metadata with case_ID and sample_type extracted from column names
- Raises:
ValueError – If ‘Gene Symbol’ column is not found in the input file.
Notes
The method performs several data processing steps: 1. Removes ‘.call’ suffix from column names 2. Separates feature metadata from data columns 3. Parses sample names to extract case_ID and sample_type (split by last ‘_’) 4. Parses Cytoband information into Chromosome, Arm, and Band columns 5. Subsets data to specified samples if provided
The Cytoband parsing supports both numeric chromosomes (1-22) and sex chromosomes (X, Y) using the pattern: chromosome + arm (p/q) + band.
Examples
>>> cnv = CopyNumberVariationTable.read_gistic_gene_level('data/all_data_by_genes.txt') >>> cnv = CopyNumberVariationTable.read_gistic_gene_level('data/all_thresholded.by_genes.txt', ... feature_columns=['Gene Symbol', 'Gene ID', 'Cytoband', 'Locus ID']) >>> cnv = CopyNumberVariationTable.read_gistic_gene_level('data/all_data_by_genes.txt', ... samples=['LUAD_001_T', 'LUAD_002_T'])
- sort_by_chromosome(ascending=True)[source]
Sort CopyNumberVariationTable data by chromosomal position.
Sorts the CopyNumberVariationTable data by chromosome number, arm (p/q), and band position. Handles both numeric chromosomes (1-22) and sex chromosomes (X, Y).
- Parameters:
ascending (bool, default True) – Whether to sort in ascending order. If False, sorts in descending order.
- Return type:
- Returns:
CopyNumberVariationTable – A new CopyNumberVariationTable object with features sorted by chromosomal position.
Notes
The sorting order is: 1. Chromosome: 1, 2, …, 22, X, Y 2. Arm: p (short arm) before q (long arm) 3. Band: numerical order (e.g., 11.1, 11.2, 12.1)
Requires the feature_metadata to have ‘Chromosome’, ‘Arm’, and ‘Band’ columns, which are typically created by the read_gistic method when parsing Cytoband information.
Examples
>>> cnv_sorted = cnv_table.sort_by_chromosome() >>> cnv_desc = cnv_table.sort_by_chromosome(ascending=False)
- to_thresholded_cnv()[source]
Convert continuous CNV values to discrete thresholded categories.
Each value is mapped to one of five integer levels based on per-sample thresholds stored in
sample_metadata: -2 (deep deletion), -1 (shallow deletion), 0 (neutral), +1 (low-level gain), +2 (high-level amplification).- Return type:
- Returns:
CopyNumberVariationTable – A new table with the same shape where every cell contains an integer in {-2, -1, 0, 1, 2}.
- Raises:
KeyError – If
sample_metadatadoes not contain the required threshold columns:del_high_threshold,del_low_threshold,amp_low_threshold,amp_high_threshold.
- static read_all_gistic(all_data_by_genes_file, sample_cutoffs_file, all_thresholded_by_genes_file, broad_values_by_arm_file)[source]
Read all GISTIC output files and create CopyNumberVariationTable objects.
- Parameters:
all_data_by_genes_file (str) – Path to the GISTIC all_data_by_genes.txt file.
sample_cutoffs_file (str) – Path to the GISTIC sample_cutoffs.txt file.
all_thresholded_by_genes_file (str) – Path to the GISTIC all_thresholded.by_genes.txt file.
broad_values_by_arm_file (str) – Path to the GISTIC broad_values_by_arm.txt file.
- Returns:
tuple – A tuple containing: - all_data_by_genes_table : CopyNumberVariationTable - sample_cutoff_df : pd.DataFrame - thresholded_cnv_table : CopyNumberVariationTable - broad_values_by_arm_table : CopyNumberVariationTable
- to_cluster_table(cluster_col='cluster')[source]
Aggregate features by cluster label and return a cluster-level table.
Groups features (rows) according to
cluster_colinfeature_metadataand computes the mean CNV value per cluster per sample.- Parameters:
cluster_col (str, default "cluster") – Name of the column in
feature_metadatathat contains cluster assignments.- Return type:
- Returns:
CopyNumberVariationTable – A new table whose rows are clusters and whose columns are samples. The
feature_metadataof the returned table contains:unique_chr_arm,features(list of original feature names), andfeatures_count.- Raises:
ValueError – If
cluster_colis not found infeature_metadata.
- plot_cnv_band_ratio(cluster_id, mode='gain', threshold=0.1, sample_type='T', subtype_order=None, ax=None, cmap=None, show=True, title=None)[source]
Plot gain or loss frequency across cytobands for a specific CNV cluster and sample type.
- Parameters:
cluster_id (str) – Cluster ID to extract features (e.g., “C47” or “C6”).
mode ({"gain", "loss"}) – Type of alteration to compute.
threshold (float) – Threshold for gain or loss (default: 0.1).
sample_type (str) – Sample type to subset (default: “T”).
subtype_order (list of str, optional) – Order of subtypes to show in columns. If None, uses [“LUAD”, “ASC”, “LUSC”].
ax (matplotlib Axes, optional) – If provided, plot on this Axes object.
cmap (str, optional) – Colormap (default: “Reds” for gain, “Blues” for loss).
show (bool) – Whether to show the plot.
title (str, optional) – Title to display on plot.
- Return type:
DataFrame- Returns:
pd.DataFrame – Cytoband × Subtype frequency table.
- static to_cnv_table(all_sample_df)[source]
Build a CopyNumberVariationTable from a long-format DataFrame.
Pivots
all_sample_dfso that genes become rows and samples become columns, then attaches gene-level metadata (name, chromosome, start, end). Duplicate gene names are disambiguated by appending the Ensembl gene ID.- Parameters:
all_sample_df (pd.DataFrame) – Long-format DataFrame with at least the columns
gene_id,gene_name,chromosome,start,end,sample_ID, andcopy_number.- Return type:
- Returns:
CopyNumberVariationTable – A table indexed by (unique) gene name with samples as columns.
- pymaftools.core.CopyNumberVariationTable.read_sample_cutoff_file(sample_cutoff_file)[source]
Read the sample cutoff file and extract the amp_thresh and del_thresh values.
- Parameters:
sample_cutoff_file (str) – Path to the sample cutoff file.
- Return type:
DataFrame- Returns:
pd.DataFrame – DataFrame containing sample cutoff data with amp_threshold and del_threshold columns.
- pymaftools.core.CopyNumberVariationTable.TCGA_sample_type(TCGA_barcode)[source]
Determine the sample type from a TCGA barcode suffix.
Parses the two-digit sample-type code near the end of the barcode and returns a single-character label.
- Parameters:
TCGA_barcode (str) – A TCGA-style barcode ending with a sample-type portion (e.g.,
"TCGA-XX-XXXX-01A").- Return type:
- Returns:
str –
"T"for tumor (codes 00-09),"N"for normal (codes 10-19), or"C"for control (codes 20-29).- Raises:
ValueError – If the barcode does not match the expected format or the sample-type code is 30 or above.
- pymaftools.core.CopyNumberVariationTable.get_target_sample_ID(paired_sample_IDs, target_sample_type)[source]
Extract the sample ID matching a target type from a comma-separated list.
- Parameters:
- Return type:
- Returns:
str – The first barcode whose type matches
target_sample_type.- Raises:
ValueError – If no barcode in the list matches the requested type.
- pymaftools.core.CopyNumberVariationTable.read_TCGA_ASCAT3_CNV_file_sheet(file_path, file_suffix='ascat3.gene_level_copy_number.v36.tsv')[source]
Read a TCGA ASCAT3 file sheet and extract tumor/normal sample IDs.
Filters rows whose
File Namecolumn containsfile_suffix, then derivestumor_sample_IDandnormal_sample_IDfrom the pairedSample IDfield.- Parameters:
- Return type:
DataFrame- Returns:
pd.DataFrame – The filtered file sheet with added
tumor_sample_IDandnormal_sample_IDcolumns.
- pymaftools.core.CopyNumberVariationTable.read_cnv_files(base_dir, file_sheet)[source]
Read and concatenate per-sample CNV files listed in a file sheet.
Iterates over
file_sheet, reads each tab-separated CNV file frombase_dir, tags rows with the tumor sample ID, drops rows with missing values, and concatenates everything into a single long-format DataFrame.- Parameters:
base_dir (str) – Directory containing the individual CNV files.
file_sheet (pd.DataFrame) – DataFrame with at least
File Nameandtumor_sample_IDcolumns (as produced byread_TCGA_ASCAT3_CNV_file_sheet()).
- Return type:
DataFrame- Returns:
pd.DataFrame – Concatenated long-format DataFrame of all samples with an added
sample_IDcolumn.
ExpressionTable
- class pymaftools.core.ExpressionTable.ExpressionTable(data=None, *args, **kwargs)[source]
Bases:
PivotTableTable for handling RNA expression data.
Inherits from PivotTable and provides specific functionality for gene expression analysis, including cluster-level aggregation.
- to_cluster_table(cluster_col='cluster')[source]
Aggregate expression values by cluster assignment.
Groups features (genes) by the specified cluster column in
feature_metadataand computes the mean expression per cluster.- Parameters:
cluster_col (str, default "cluster") – Column name in
feature_metadatacontaining cluster labels.- Return type:
- Returns:
ExpressionTable – Cluster-level expression table with aggregated metadata.
- Raises:
ValueError – If cluster_col is not found in
feature_metadata.
SignatureTable
- class pymaftools.core.SignatureTable.SignatureTable(data=None, *args, **kwargs)[source]
Bases:
PivotTableTable for handling COSMIC Single Base Substitution (SBS) signature data.
Inherits from PivotTable and provides a convenience class method for reading signature weight files.
- classmethod read_signature(file_path)[source]
Read a signature weight file and return a SignatureTable.
- Parameters:
file_path (str) – Path to a tab-separated signature file where rows are signatures and columns are mutation contexts.
- Return type:
- Returns:
SignatureTable – Transposed table with signatures as columns.
CancerCellFractionTable
- class pymaftools.core.CancerCellFractionTable.CancerCellFractionTable[source]
Bases:
objectHandler for cancer cell fraction (CCF) data from clonal analysis tools.
Provides methods for reading PyClone output and producing sorted PivotTable objects with cluster annotations.
- static pyclone_to_sorted_table(filepath)[source]
Read PyClone output and create a sorted PivotTable.
Reads a tab-separated PyClone results file, pivots the data into a mutation-by-sample matrix of cellular prevalence values, and sorts mutations by cluster mean CCF (descending).
- Parameters:
filepath (str) – Path to a PyClone results file (tab-separated) containing at minimum the columns
mutation_id,sample_id,cellular_prevalence, andcluster_id.- Return type:
- Returns:
PivotTable – Sorted table with mutations as rows and samples as columns. Feature metadata includes
mean_ccf,cluster, andcluster_text(e.g. “major”, “minor1”, “minor2”, …).
PairwiseMatrix
- class pymaftools.core.PairwiseMatrix.PairwiseMatrix(data=None, index=None, columns=None, dtype=None, copy=None)[source]
Bases:
DataFrameBase class for pairwise matrices.
A
pd.DataFramesubclass that represents a symmetric pairwise matrix (e.g., co-occurrence or similarity) between samples or features.
- class pymaftools.core.PairwiseMatrix.CooccurrenceMatrix(data=None, index=None, columns=None, dtype=None, copy=None)[source]
Bases:
PairwiseMatrixMatrix of pairwise co-occurrence counts between features.
A
PairwiseMatrixsubclass where each cell(i, j)stores the co-occurrence frequency or count between feature i and feature j.
- class pymaftools.core.PairwiseMatrix.SimilarityMatrix(data=None, index=None, columns=None, dtype=None, copy=None)[source]
Bases:
PairwiseMatrixMatrix of pairwise similarity scores between samples.
A
PairwiseMatrixsubclass where each cell(i, j)stores the similarity score (e.g., cosine similarity) between sample i and sample j. Provides methods for group-level similarity analysis, permutation testing, statistical comparison of group pairs, and visualization including heatmaps and network conversion.- get_mean_group_similarity(groups, group_order=None)[source]
Compute mean similarity between every pair of groups.
- Parameters:
groups (pd.Series or np.ndarray) – Group label for each sample, aligned with the matrix indices.
group_order (array-like of str, optional) – Ordered list of unique group labels. If
None, derived fromgroups.unique().
- Return type:
DataFrame- Returns:
pd.DataFrame – Square DataFrame of shape
(n_groups, n_groups)containing the mean pairwise similarity between each pair of groups.
- generate_permutation_list(groups, group_order, n_permutations=1000)[source]
Generate group similarity matrices under random label permutations.
- Parameters:
- Return type:
list[DataFrame]- Returns:
list of pd.DataFrame – Each element is a group-mean similarity matrix computed from a random permutation of the group labels.
- static calculate_group_similarity_pvalues(true_group_similarity, permutation_list, group_order, tail='right')[source]
Calculate permutation p-values for each pairwise group similarity.
- Parameters:
true_group_similarity (pd.DataFrame) – Observed group-mean similarity matrix.
permutation_list (list of pd.DataFrame) – Permuted group-mean similarity matrices from
generate_permutation_list().group_order (array-like of str) – Ordered list of unique group labels.
tail ({'right', 'left', 'two'}, default='right') – Direction of the test.
'right'tests whether the observed value is greater than expected by chance.
- Return type:
DataFrame- Returns:
pd.DataFrame – Matrix of p-values with the same shape as true_group_similarity.
- Raises:
ValueError – If tail is not one of
'right','left', or'two'.
- static plot_group_heatmap(result_df, title, cmap='Blues', tick_size=14, fontsize=14, annot_size=14, mask_lower_triangle=True, ax=None, save_path=None, dpi=300)[source]
Plot a heatmap of group affinity matrix.
- Parameters:
result_df (pd.DataFrame) – Group affinity matrix to plot.
title (str) – Title for the heatmap.
cmap (str, default='Blues') – Colormap for the heatmap.
tick_size (int, default=14) – Size of tick labels.
fontsize (int, default=14) – Font size for title.
annot_size (int, default=14) – Font size for annotations.
mask_lower_triangle (bool, default=True) – Whether to mask the lower triangle.
ax (matplotlib.axes.Axes, optional) – Existing axes to plot on.
save_path (str, optional) – Path to save the figure.
dpi (int, default=300) – DPI for saved figure.
- Return type:
Examples
>>> AffinityMatrix.plot_group_heatmap(group_matrix, "Group Similarities")
- plot_similarity_matrix(groups, figsize=(20, 20), group_cmap={'ASC': 'green', 'LUAD': 'orange', 'LUSC': 'blue'}, title='Cosine Similarity', cmap='coolwarm', ax=None, save_path=None, dpi=300)[source]
Plot the similarity matrix with group annotations.
- Parameters:
groups (pd.Series) – Group labels for each sample.
figsize (tuple of int, default=(20, 20)) – Figure size as (width, height).
group_cmap (dict, default={'LUAD': 'orange', 'ASC': 'green', 'LUSC': 'blue'}) – Color mapping for groups.
title (str, default='Cosine Similarity') – Title for the plot.
cmap (str, default='coolwarm') – Colormap for the similarity matrix.
ax (matplotlib.axes.Axes, optional) – Existing axes to plot on.
save_path (str, optional) – Path to save the figure.
dpi (int, default=300) – DPI for saved figure.
- Return type:
Examples
>>> groups = pd.Series(['A', 'A', 'B', 'B']) >>> affinity_matrix.plot_similarity_matrix(groups, title="Sample Similarities")
- compare_group_pairs(groups, pair1, pair2)[source]
Perform statistical test comparing affinity between two group pairs.
- Parameters:
- Return type:
- Returns:
stat (float) – Mann-Whitney U test statistic.
p_value (float) – P-value of the test.
Examples
>>> stat, p = affinity_matrix.compare_group_pairs( ... groups, ('A', 'B'), ('A', 'C') ... )
- to_edges_dataframe(label, freq_threshold=0.1)[source]
Convert affinity matrix to edge list format for network analysis.
- Parameters:
- Return type:
DataFrame- Returns:
pd.DataFrame – DataFrame with columns: source, target, frequency, label. Self-loops are removed.
Examples
>>> edges_df = affinity_matrix.to_edges_dataframe('similarity', 0.2)
- to_networkx_graph(label, freq_threshold=0.1)[source]
Convert affinity matrix to NetworkX graph for network analysis.
- Parameters:
- Return type:
MultiGraph- Returns:
nx.MultiGraph – NetworkX graph with frequency and label as edge attributes.
Examples
>>> graph = affinity_matrix.to_networkx_graph('similarity', 0.2) >>> print(f"Graph has {graph.number_of_nodes()} nodes")
- static plot_permutation_distribution(permutation_list, true_result_df, group1, group2, figsize=(6, 4), save_path=None, dpi=300)[source]
Plot the distribution of permuted values vs. the true observed value.
- Parameters:
permutation_list (list of pd.DataFrame) – List of permuted affinity matrices.
true_result_df (pd.DataFrame) – True observed affinity matrix.
group1 (str) – First group name.
group2 (str) – Second group name.
figsize (tuple of int, default=(6, 4)) – Figure size as (width, height).
save_path (str, optional) – Path to save the figure.
dpi (int, default=300) – DPI for saved figure.
- Return type:
Examples
>>> AffinityMatrix.plot_permutation_distribution( ... perm_list, true_matrix, 'A', 'B' ... )
- plot_similarity(groups, figsize=(20, 20), group_cmap={'ASC': 'green', 'LUAD': 'orange', 'LUSC': 'blue'}, title=None, cmap='coolwarm', ax=None, save_path=None, dpi=300, title_fontsize=20)[source]
Plot the similarity matrix with a group-color annotation bar.
- Parameters:
groups (pd.Series) – Group label for each sample.
figsize (tuple of int, default=(20, 20)) – Figure size as
(width, height).group_cmap (dict of str to str) – Mapping from group name to color.
title (str, optional) – Title displayed above the heatmap.
cmap (str, default='coolwarm') – Colormap for the similarity heatmap.
ax (tuple of matplotlib.axes.Axes, optional) – Pre-existing axes as
(ax_heatmap, ax_colorbar, ax_groupbar).save_path (str, optional) – Path to save the figure.
dpi (int, default=300) – Resolution for the saved figure.
title_fontsize (int, default=20) – Font size for the title.
- Return type:
- static plot_heatmap(result_df, title, cmap='Blues', tick_size=14, fontsize=14, annot_size=14, mask_lower_triangle=True, ax=None, save_path=None, dpi=300, show_only_x_ticks=False, annot=True)[source]
Plot a heatmap of a group similarity or p-value matrix.
- Parameters:
result_df (pd.DataFrame) – Square matrix to visualize.
title (str) – Title for the heatmap.
cmap (str, default='Blues') – Colormap for the heatmap.
tick_size (int, default=14) – Font size for tick labels.
fontsize (int, default=14) – Font size for the title.
annot_size (int, default=14) – Font size for cell annotations.
mask_lower_triangle (bool, default=True) – Whether to mask the lower triangle.
ax (matplotlib.axes.Axes, optional) – Existing axes to plot on.
save_path (str, optional) – Path to save the figure.
dpi (int, default=300) – Resolution for the saved figure.
show_only_x_ticks (bool, default=False) – If
True, hide y-axis tick labels.annot (bool, default=True) – Whether to annotate cells with numeric values.
- Return type:
- static analyze_similarity(table, groups, group_order, method, title=None, layout='grid', similarity_cmap='coolwarm', group_cmap={'ASC': 'green', 'LUAD': 'orange', 'LUSC': 'blue'}, group_avg_cmap='Blues', group_pvalues_cmap='Reds_r', save_dir='./figures/Similarity', dpi=300, file_format='tiff', heatmap_show_only_x_ticks=False, heatmap_annot=True, utest_group_pairs=[('LUAD', 'ASC'), ('ASC', 'LUSC')], annot_size=14)[source]
Run a full similarity analysis pipeline and produce a composite figure.
Computes the similarity matrix from table, calculates group-level means and permutation p-values, performs optional Mann-Whitney U tests between specified group pairs, and saves a multi-panel figure.
- Parameters:
table (object) – Data table with a
compute_similarity(method=...)method and asample_metadataattribute.groups (pd.Series or np.ndarray) – Group label for each sample.
group_order (array-like of str) – Ordered list of unique group labels.
method (str) – Similarity method passed to
table.compute_similarity.title (str, optional) – Title for the figure; also used to derive the output filename.
layout ({'grid', 'horizontal'}, default='grid') – Panel arrangement of the composite figure.
similarity_cmap (str, default='coolwarm') – Colormap for the full similarity matrix.
group_cmap (dict of str to str) – Mapping from group name to color for the annotation bar.
group_avg_cmap (str, default='Blues') – Colormap for the group-mean similarity heatmap.
group_pvalues_cmap (str, default='Reds_r') – Colormap for the permutation p-value heatmap.
save_dir (str, default='./figures/Similarity') – Directory to save the figure.
dpi (int, default=300) – Resolution for the saved figure.
file_format (str, default='tiff') – Output image format (e.g.,
'tiff','png').heatmap_show_only_x_ticks (bool, default=False) – If
True, hide y-axis tick labels on group heatmaps.heatmap_annot (bool, default=True) – Whether to annotate group heatmap cells.
utest_group_pairs (list of tuple of str, optional) – Two group pairs for a Mann-Whitney U test comparison.
annot_size (int, default=14) – Font size for heatmap annotations.
- Return type:
- Returns:
dict – Dictionary with keys
'similarity_matrix','group_similarity','pval_matrix','pairwise_utest_p','pair1', and'pair2'.
- get_pairs_subset(groups, pair1, pair2)[source]
Extract similarity sub-matrices for two group pairs.
- Parameters:
- Return type:
tuple[DataFrame,DataFrame]- Returns:
pair1_subset (pd.DataFrame) – Sub-matrix of similarities between the groups in pair1.
pair2_subset (pd.DataFrame) – Sub-matrix of similarities between the groups in pair2.
- paired_similarity_utest(groups, pair1, pair2)[source]
Compare similarity distributions of two group pairs with a U test.
Performs a two-sample Mann-Whitney U test on the flattened similarity values of pair1 versus pair2.
- Parameters:
- Return type:
- Returns:
stat (float) – Mann-Whitney U test statistic.
p (float) – Two-sided p-value.
Clustering
- pymaftools.core.Clustering.table_to_distance(table)[source]
Convert a PivotTable to a distance matrix.
- Parameters:
table (PivotTable) – Input data table with samples and features.
- Return type:
- Returns:
numpy.ndarray – Distance matrix computed as 1 minus cosine similarity.
- pymaftools.core.Clustering.k_fold_clustering_evaluation(table, min_clusters=2, max_clusters=50, metric='cosine', random_state=42, group_col='subtype')[source]
Evaluate the optimal number of clusters using K-fold cross-validation.
- Parameters:
table (PivotTable) – Gene expression or CNV data table.
min_clusters (int, optional) – Minimum number of clusters to evaluate, by default 2.
max_clusters (int, optional) – Maximum number of clusters to evaluate, by default 50.
metric ({'cosine', 'hamming', 'jaccard'}, optional) – Similarity metric to use, by default ‘cosine’.
random_state (int, optional) – Random seed for reproducibility, by default 42.
group_col (str, optional) – Column name in sample metadata used for grouping, by default ‘subtype’.
- Return type:
- Returns:
pd.DataFrame – DataFrame containing silhouette scores for each fold and cluster count.
dict[int, dict[int, numpy.ndarray]] – Mapping of cluster count k to fold-wise cluster labels.
- pymaftools.core.Clustering.align_clusters(ref_labels, target_labels, n_clusters)[source]
Align target cluster labels to reference labels using the Hungarian algorithm.
- Parameters:
ref_labels (numpy.ndarray) – Reference cluster labels to align against.
target_labels (numpy.ndarray) – Target cluster labels to be remapped.
n_clusters (int) – Number of clusters.
- Return type:
- Returns:
numpy.ndarray – Remapped target labels aligned to the reference labeling.
- pymaftools.core.Clustering.align_cluster_label_dict(cluster_label_dict)[source]
Align cluster labels across folds using fold 1 as the reference.
- Parameters:
cluster_label_dict (dict[int, dict[int, numpy.ndarray]]) – Mapping of cluster count k to a dict of fold number to label array. Structure:
{k: {fold: labels}}.- Return type:
- Returns:
dict[int, pd.DataFrame] – Mapping of each k to an aligned DataFrame (samples x folds).
- pymaftools.core.Clustering.convert_ndarray_to_list(obj)[source]
Recursively convert all numpy.ndarray values in a nested structure to lists.
- pymaftools.core.Clustering.calculate_ari_matrix(aligned_cluster_label_dict, k)[source]
Compute the pairwise Adjusted Rand Index (ARI) matrix across folds.
- pymaftools.core.Clustering.plot_ari_matrix(aligned_cluster_label_dict, k)[source]
Plot the upper-triangle ARI heatmap for a given cluster count k.
- pymaftools.core.Clustering.run_random_forest_cv(X, y, feature_names, n_splits=5, random_state=42, n_estimators=100)[source]
Run stratified K-fold cross-validated Random Forest classification.
- Parameters:
X (numpy.ndarray) – Feature matrix of shape (n_samples, n_features).
y (numpy.ndarray) – Target labels of shape (n_samples,).
feature_names (list[str]) – Names corresponding to each feature column in X.
n_splits (int, optional) – Number of CV folds, by default 5.
random_state (int, optional) – Random seed, by default 42.
n_estimators (int, optional) – Number of trees in the forest, by default 100.
- Return type:
tuple[RandomForestClassifier,list[float],DataFrame]- Returns:
RandomForestClassifier – The last trained model.
list[float] – Accuracy scores for each fold.
pd.DataFrame – Feature importances per fold and their mean.
- pymaftools.core.Clustering.run_random_forest_multiple_seeds(X, y, feature_names, seeds=range(0, 5), n_estimators=100)[source]
Train Random Forest classifiers with multiple random seeds on the full dataset.
- Parameters:
X (numpy.ndarray) – Feature matrix of shape (n_samples, n_features).
y (array-like) – Target labels.
feature_names (list[str]) – Names corresponding to each feature column in X.
seeds (range or list[int], optional) – Random seeds to iterate over, by default
range(5).n_estimators (int, optional) – Number of trees in each forest, by default 100.
- Return type:
tuple[list[RandomForestClassifier],DataFrame]- Returns:
list[RandomForestClassifier] – Trained models, one per seed.
pd.DataFrame – Feature importances per seed and their mean.
- pymaftools.core.Clustering.plot_cluster_feature_importance_boxplot(table, importance_cols, top_n=20)[source]
Draw a bar and box plot of the top N cluster feature importances.
- Parameters:
table (pd.DataFrame) – DataFrame containing cluster information with importance scores. Must include a
mean_importancecolumn, plusunique_chr_armandgene_countfor axis labels.importance_cols (list[str]) – Column names for per-fold importance scores.
top_n (int, optional) – Number of top clusters to display, by default 20.
- Return type:
- pymaftools.core.Clustering.plot_cluster_feature_importance(table, importance_cols, top_n=20)[source]
Plot top N cluster feature importances as bar (mean) and scatter (per-fold).
- Parameters:
table (pd.DataFrame) – DataFrame containing cluster information with importance scores. Must include
mean_importance,unique_chr_arm, andgene_countcolumns.importance_cols (list[str]) – Column names for per-fold importance scores.
top_n (int, optional) – Number of top clusters to display, by default 20.
- Return type:
- pymaftools.core.Clustering.run_feature_clustering(table, result_path, max_clusters=200)[source]
Run agglomerative clustering on features for a range of cluster counts.
- Parameters:
table (PivotTable) – Input data table with features as rows and samples as columns.
result_path (str) – File path to save the resulting CSV of silhouette scores.
max_clusters (int, optional) – Maximum number of clusters to evaluate, by default 200.
- Return type:
DataFrame- Returns:
pd.DataFrame – DataFrame with columns
n_clustersandsilhouettefor each k.
- pymaftools.core.Clustering.plot_clustering_metrics_and_find_best_k(metric_df, filename, title=None, target_col='mean_silhouette', dpi=300, bbox_inches='tight', transparent=True, format=None, **kwargs)[source]
Plot silhouette and ARI metrics across cluster counts and find the best k.
- Parameters:
metric_df (pd.DataFrame) – DataFrame indexed by cluster count with per-fold silhouette columns (
fold1_silhouette…fold5_silhouette) andmean_ari_5_fold.filename (str) – Output file path for the saved figure.
title (str or None, optional) – Plot title. If None, no title is displayed.
target_col (str, optional) – Column name to maximize for selecting the best k, by default
'mean_silhouette'.dpi (int, optional) – Resolution in dots per inch, by default 300.
bbox_inches (str, optional) – Bounding box setting for saving, by default
'tight'.transparent (bool, optional) – Whether the background is transparent, by default True.
format (str or None, optional) – Output format. Inferred from
filenameextension if None.**kwargs – Additional keyword arguments passed to
fig.savefig.
- Return type:
- Returns:
int – The cluster count k that maximizes
target_col.
- pymaftools.core.Clustering.gpt_known_genes_summary(client, genes, arm, cancer_type='lung cancer')[source]
Query GPT-4 for well-known genes in a given chromosomal arm and cancer type.
- Parameters:
- Return type:
- Returns:
str – GPT-4 response text listing notable genes and reasons.
str – The prompt that was sent to the model.