Run CalicoST on prostate cancer dataset

Obtain the data

We obtained the spatially resolved transcriptomics of a prostate cross section studied by Erickon et al. from EGA using accession EGAD00001008644. We ran spaceranger to get the BAM files and applied CalicoST to study the CNAs and spatial evolution of cancer across multiple spatial regions.

Compute allele counts by preprocessing: genotyping and reference-based phasing

Step 1: Download SNP and phasing panels

Download the following files to your machine.

Step 2: Make a table for BAM files and slice IDs to jointly genotype and phase

In order to jointly genotype and phase across multiple slices, CalicoST requires a bamlist.tsv file that specifies the BAM file paths, slice IDs, and spaceranger output directories. It must be tab-deliminated, without header, and contain the following columns in order, otherwise the pipeline will report an error.

BAM file location

slide ID

spaceranger out directory location

Below is the bamlist.tsv file we used for the prostate cancer data.

/u/congma/ragr-data/datasets/spatial_cna/Lundeberg_organwide/P1_spaceranger/P1_H1_2_visium/outs/possorted_genome_bam.bam        H12     /u/congma/ragr-data/datasets/spatial_cna/Lundeberg_organwide/P1_spaceranger/P1_H1_2_visium/outs/
/u/congma/ragr-data/datasets/spatial_cna/Lundeberg_organwide/P1_spaceranger/P1_H1_4_visium/outs/possorted_genome_bam.bam        H14     /u/congma/ragr-data/datasets/spatial_cna/Lundeberg_organwide/P1_spaceranger/P1_H1_4_visium/outs/
/u/congma/ragr-data/datasets/spatial_cna/Lundeberg_organwide/P1_spaceranger/P1_H1_5_visium/outs/possorted_genome_bam.bam        H15     /u/congma/ragr-data/datasets/spatial_cna/Lundeberg_organwide/P1_spaceranger/P1_H1_5_visium/outs/
/u/congma/ragr-data/datasets/spatial_cna/Lundeberg_organwide/P1_spaceranger/P1_H2_1_visium/outs/possorted_genome_bam.bam        H21     /u/congma/ragr-data/datasets/spatial_cna/Lundeberg_organwide/P1_spaceranger/P1_H2_1_visium/outs/
/u/congma/ragr-data/datasets/spatial_cna/Lundeberg_organwide/P1_spaceranger/P1_H2_5_visium/outs/possorted_genome_bam.bam        H25     /u/congma/ragr-data/datasets/spatial_cna/Lundeberg_organwide/P1_spaceranger/P1_H2_5_visium/outs/

Step 3: Make a config.yaml for snakemake preprocessing pipeline

Besides the BAM files, we also need to specify the reference SNPs and phasing panels and other running configurations. These will be specified in the config.yaml file. A template config.yaml file is available in our github. You can modify the template with paths to the references on our machine. Below is the content of config.yaml file we used.

# path to executables or their parent directories
calicost_dir: /n/fs/ragr-data/users/congma/Codes/CalicoST
eagledir: /n/fs/ragr-data/users/congma/environments/Eagle_v2.4.1

# running parameters
# samtools sort (only used when joingly calling from multiple slices)
samtools_sorting_mem: "4G"
# cellsnp-lite
UMItag: "Auto"
cellTAG: "CB"
nthreads_cellsnplite: 20
region_vcf: /n/fs/ragr-data/users/congma/references/snplist/nocpg.genome1K.phase3.SNP_AF5e4.chr1toX.hg38.vcf.gz
# Eagle phasing
phasing_panel: /n/fs/ragr-data/users/congma/references/phasing_ref/1000G_hg38
chromosomes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]

# input
bamlist: /n/fs/ragr-data/datasets/spatial_cna/Lundeberg_organwide/P1_snps/joint_H1_245_H2_15/bamlist.tsv

# output
output_snpinfo: /n/fs/ragr-data/datasets/spatial_cna/Lundeberg_organwide/P1_snps/joint_H1_245_H2_15

Step 4: Run the proprocessing snakemake pipeline

With the input of modified config.yaml file, we run the snakemake pipeline calicost.smk as follows.

snakemake --cores <number threads> --configfile config.yaml --snakefile calicost.smk all

Key outputs of the snakemake preprocessing pipeline

In the directory specified in output_snpinfo entry of the config.yaml file, we will see the following allele count files

  • cell_snp_Aallele.npz: A scipy sparse matrix of number spots * number SNPs, each entry indicates the UMI counts of the first haplotype phased by Eagle2.

  • cell_snp_Ballele.npz: A scipy sparse matrix of number spots * number SNPs, each entry indicates the UMI counts of the second haplotype phased by Eagle2.

  • unique_snp_ids.npy: The ID each the SNPs, corresponding to the columns of scipy sparse matrices.

  • barcodes.txt: The spot barcodes (with slide IDs appended as suffix), corresponding to the rows of scipy sparse matrices.

Estimate tumor proportion per spot

Step 1: Make the configuration_purity file To use CalicoST for inferring tumor proportion, first set up a configuration file for CalicoST to specify the input paths, output paths, and running configurations. A template configuration_purity file is available at the github repo. Modify the input/output directories from the template. We used the following configuration_purity file for estimating tumor purity on the prostate cancer data.

input_filelist : /n/fs/ragr-data/datasets/spatial_cna/Lundeberg_organwide/P1_snps/joint_H1_245_H2_15/bamlist.tsv
snp_dir : /n/fs/ragr-data/datasets/spatial_cna/Lundeberg_organwide/P1_snps/joint_H1_245_H2_15
output_dir : /n/fs/ragr-data/users/congma/Datasets/CalicoST_prostate_example/estimate_tumor_prop

# supporting files and preprocessing arguments
geneticmap_file : /n/fs/ragr-data/users/congma/Codes/CalicoST/GRCh38_resources/genetic_map_GRCh38_merged.tab.gz
hgtable_file : /n/fs/ragr-data/users/congma/Codes/CalicoST/GRCh38_resources/hgTables_hg38_gencode.txt
normalidx_file : None
tumorprop_file : None
alignment_files :
supervision_clone_file : None
filtergenelist_file : /n/fs/ragr-data/users/congma/Codes/CalicoST/GRCh38_resources/ig_gene_list.txt
filterregion_file : /n/fs/ragr-data/users/congma/Codes/CalicoST/GRCh38_resources/HLA_regions.bed
secondary_min_umi : 400
bafonly : False

# phase switch probability
nu : 1.0
logphase_shift : -2.0
npart_phasing : 3

# HMRF configurations
n_clones : 5
n_clones_rdr : 2
min_spots_per_clone : 100
min_avgumi_per_clone : 10
maxspots_pooling : 19
tumorprop_threshold : 0.7
max_iter_outer : 20
nodepotential : weighted_sum
initialization_method : rectangle
num_hmrf_initialization_start : 0
num_hmrf_initialization_end : 1
spatial_weight : 1.0
construct_adjacency_method : hexagon
construct_adjacency_w : 1.0

# HMM configurations
n_states : 7
params : smp
t : 1-1e-4
t_phaseing : 0.9999
fix_NB_dispersion : False
shared_NB_dispersion : True
fix_BB_dispersion : False
shared_BB_dispersion : True
max_iter : 30
tol : 0.0001
gmm_random_state : 0
np_threshold : 1.0
np_eventminlen : 10

# integer copy number
nonbalance_bafdist : 1.0
nondiploid_rdrdist : 10.0

For more information of the running configurations, refer to this page in the documentations.

Step 2: Run the tumor purity estimation module in CalicoST

We run the following command in terminal

# make the output directory
mkdir -p /n/fs/ragr-data/users/congma/Datasets/CalicoST_prostate_example/estimate_tumor_prop/
# run CalicoST to infer tumor purity
OMP_NUM_THREADS=1 python <CalicoST git-cloned directory>/src/calicost/estimate_tumor_proportion.py -c configuration_purity

This command takes about 1.5 hours to run and will generate an output file of loh_estimator_tumor_prop.tsv in the output directory.

Load and visualize inferred tumor proportions

We load and visualize the estimated tumor proportions as follows.

import scanpy as sc
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import copy
# Modify the example_directory to be the path of the downloaded and untarred data.
example_directory = "./"

tumor_proportions = pd.read_csv(f'{example_directory}/estimate_tumor_prop/loh_estimator_tumor_prop.tsv', header=0, index_col=0, sep='\t')
tumor_proportions
Tumor
BARCODES
AAACAAGTATCTCCCA-1_H12 0.050000
AAACAGGGTCTATATT-1_H12 NaN
AAACATTTCCCGGATT-1_H12 0.050000
AAACCGGGTAGGTACC-1_H12 0.825997
AAACCGTTCGTCCAGG-1_H12 0.940555
... ...
TTGTTCAGTGTGCTAC-1_H25 0.050000
TTGTTGTGTGTCAAGA-1_H25 0.188937
TTGTTTCACATCCAGG-1_H25 0.956014
TTGTTTCATTAGTCTA-1_H25 0.838851
TTGTTTCCATACAACT-1_H25 0.364009

13344 rows × 1 columns

slice_ids = ['H12', 'H14', 'H15', 'H21', 'H25']
directory_name = ['P1_H1_2_visium', 'P1_H1_4_visium', 'P1_H1_5_visium', 'P1_H2_1_visium', 'P1_H2_5_visium']

for i,s in enumerate(slice_ids):
    # load spatial locations
    # note that scanpy is incompatible with the latest tissue_positions.csv file, we directly load the positions as pandas data frame
    df = pd.read_csv(f'{example_directory}/data/{directory_name[i]}/spatial/tissue_positions.csv', header=0, index_col=0, sep=',')
    print(df.head())
    # combine the position data frame with the tumor proportion dataframe
    slice_tumor_proportions = tumor_proportions[tumor_proportions.index.str.endswith(s)]
    df = df.join( slice_tumor_proportions.rename(index=lambda x:x.split("_")[0]) )

    # plot
    fig, axes = plt.subplots(1, 1, facecolor='white')
    sns.scatterplot(x=df.array_row, y=df.array_col, hue=df.Tumor, palette='magma_r', linewidth=0, s=10, legend=False, ax=axes)
    norm = plt.Normalize(np.nanmin(df.Tumor.values), np.nanmax(df.Tumor.values))
    axes.figure.colorbar( plt.cm.ScalarMappable(cmap='magma_r', norm=norm), ax=axes )
    axes.set_title(s)
    plt.show()
                    in_tissue  array_row  array_col  pxl_row_in_fullres  \
barcode                                                                   
ACGCCTGACACGCGCT-1          0          0          0                1466   
TACCGATCCAACACTT-1          0          1          1                1592   
ATTAAAGCGGACGAGC-1          0          0          2                1466   
GATAAGGGACGATTAG-1          0          1          3                1592   
GTGCAAATCACCAATA-1          0          0          4                1466   

                    pxl_col_in_fullres  
barcode                                 
ACGCCTGACACGCGCT-1                1298  
TACCGATCCAACACTT-1                1370  
ATTAAAGCGGACGAGC-1                1443  
GATAAGGGACGATTAG-1                1515  
GTGCAAATCACCAATA-1                1588  
../../_images/0bd44a02cfc38fc7624c8cd8f7ee53d12b4772411848057cc1597dfed2655c8b.png
                    in_tissue  array_row  array_col  pxl_row_in_fullres  \
barcode                                                                   
ACGCCTGACACGCGCT-1          0          0          0                1831   
TACCGATCCAACACTT-1          0          1          1                1983   
ATTAAAGCGGACGAGC-1          0          0          2                1831   
GATAAGGGACGATTAG-1          0          1          3                1983   
GTGCAAATCACCAATA-1          0          0          4                1831   

                    pxl_col_in_fullres  
barcode                                 
ACGCCTGACACGCGCT-1                1544  
TACCGATCCAACACTT-1                1631  
ATTAAAGCGGACGAGC-1                1718  
GATAAGGGACGATTAG-1                1805  
GTGCAAATCACCAATA-1                1892  
../../_images/82cfd15c9b52cac247723460a9cdbfed9b10afc7dfe1965e4d99594f5956af96.png
                    in_tissue  array_row  array_col  pxl_row_in_fullres  \
barcode                                                                   
ACGCCTGACACGCGCT-1          0          0          0                1593   
TACCGATCCAACACTT-1          0          1          1                1720   
ATTAAAGCGGACGAGC-1          0          0          2                1593   
GATAAGGGACGATTAG-1          0          1          3                1719   
GTGCAAATCACCAATA-1          0          0          4                1593   

                    pxl_col_in_fullres  
barcode                                 
ACGCCTGACACGCGCT-1                1172  
TACCGATCCAACACTT-1                1245  
ATTAAAGCGGACGAGC-1                1317  
GATAAGGGACGATTAG-1                1390  
GTGCAAATCACCAATA-1                1462  
../../_images/0625892712a0b5609691beace831441d37c7ae8303380451b5948db9ca4f792e.png
                    in_tissue  array_row  array_col  pxl_row_in_fullres  \
barcode                                                                   
ACGCCTGACACGCGCT-1          0          0          0                1830   
TACCGATCCAACACTT-1          0          1          1                1982   
ATTAAAGCGGACGAGC-1          0          0          2                1831   
GATAAGGGACGATTAG-1          0          1          3                1982   
GTGCAAATCACCAATA-1          0          0          4                1831   

                    pxl_col_in_fullres  
barcode                                 
ACGCCTGACACGCGCT-1                1523  
TACCGATCCAACACTT-1                1610  
ATTAAAGCGGACGAGC-1                1697  
GATAAGGGACGATTAG-1                1784  
GTGCAAATCACCAATA-1                1871  
../../_images/ae058a782d54eedc0affcc2f6aaaa43bb55bf3db006f58727abb04ae9a958b1a.png
                    in_tissue  array_row  array_col  pxl_row_in_fullres  \
barcode                                                                   
ACGCCTGACACGCGCT-1          0          0          0                1948   
TACCGATCCAACACTT-1          0          1          1                2100   
ATTAAAGCGGACGAGC-1          0          0          2                1948   
GATAAGGGACGATTAG-1          0          1          3                2100   
GTGCAAATCACCAATA-1          0          0          4                1947   

                    pxl_col_in_fullres  
barcode                                 
ACGCCTGACACGCGCT-1                1441  
TACCGATCCAACACTT-1                1529  
ATTAAAGCGGACGAGC-1                1616  
GATAAGGGACGATTAG-1                1705  
GTGCAAATCACCAATA-1                1791  
../../_images/aaa6c0342c37d83cf860fa701a266d5502c094ae2bd4be51c349dcc97226ed73.png

Run CalicoST to infer CNAs and cancer clones based on estimated tumor proportions

Step 1: Create a configuration file for CalicoST

The configuration file for CalicoST, configuration_cna_multi, specifies the input/output paths and the running parameters. It is very similar to the configutation_purity file except that (1) it uses a different output directory, and (2)the “tumorprop_file” entry is now replaced with the output of the inferred tumor purity. Below is the configuration_cna_multi file we used.

input_filelist : /n/fs/ragr-data/datasets/spatial_cna/Lundeberg_organwide/P1_snps/joint_H1_245_H2_15/bamlist.tsv
snp_dir : /n/fs/ragr-data/datasets/spatial_cna/Lundeberg_organwide/P1_snps/joint_H1_245_H2_15
output_dir : /n/fs/ragr-data/users/congma/Datasets/CalicoST_prostate_example/calicost

# supporting files and preprocessing arguments
geneticmap_file : /n/fs/ragr-data/users/congma/Codes/CalicoST/GRCh38_resources/genetic_map_GRCh38_merged.tab.gz
hgtable_file : /n/fs/ragr-data/users/congma/Codes/CalicoST/GRCh38_resources/hgTables_hg38_gencode.txt
normalidx_file : None
tumorprop_file : /n/fs/ragr-data/users/congma/Datasets/CalicoST_prostate_example/estimate_tumor_prop/loh_estimator_tumor_prop.tsv
alignment_files :
supervision_clone_file : None
filtergenelist_file : /n/fs/ragr-data/users/congma/Codes/CalicoST/GRCh38_resources/ig_gene_list.txt
filterregion_file : /n/fs/ragr-data/users/congma/Codes/CalicoST/GRCh38_resources/HLA_regions.bed
secondary_min_umi : 400
bafonly : False

# phase switch probability
nu : 1.0
logphase_shift : -2.0
npart_phasing : 3

# HMRF configurations
n_clones : 5
n_clones_rdr : 2
min_spots_per_clone : 100
min_avgumi_per_clone : 10
maxspots_pooling : 19
tumorprop_threshold : 0.7
max_iter_outer : 20
nodepotential : weighted_sum
initialization_method : rectangle
num_hmrf_initialization_start : 0
num_hmrf_initialization_end : 1
spatial_weight : 1.0
construct_adjacency_method : hexagon
construct_adjacency_w : 1.0

# HMM configurations
n_states : 7
params : smp
t : 1-1e-4
t_phaseing : 0.9999
fix_NB_dispersion : False
shared_NB_dispersion : True
fix_BB_dispersion : False
shared_BB_dispersion : True
max_iter : 30
tol : 0.0001
gmm_random_state : 0
np_threshold : 1.0
np_eventminlen : 10

# integer copy number
nonbalance_bafdist : 1.0
nondiploid_rdrdist : 10.0

Step 2: Run the CalicoST to infer CNAs and clones

We ran the following command.

# create output directory
mkdir -p /n/fs/ragr-data/users/congma/Datasets/CalicoST_prostate_example/calicost
# run CalicoST
OMP_NUM_THREADS=1 python <CalicoST directory>/src/calicost/calicost_main.py -c configuration_cna_multi

This command takes about 8h to finish.

Load CalicoST-generated result tables

<output_dir>/clone5_rectangle0_w1.0/clone_labels.tsv stores the inferred cancer clones for each spot. Note that spots with a low tumor purity will also be assigned to a cancer clone, despite that the inferred cancer clone label is not very meaningful for those nearly normal spots.

df = pd.read_csv(f"{example_directory}/calicost/clone5_rectangle0_w1.0/clone_labels.tsv", header=0, index_col=0, sep='\t')
df
clone_label tumor_proportion
BARCODES
AAACAAGTATCTCCCA-1_H12 3 0.050000
AAACAGGGTCTATATT-1_H12 3 NaN
AAACATTTCCCGGATT-1_H12 3 0.050000
AAACCGGGTAGGTACC-1_H12 3 0.825997
AAACCGTTCGTCCAGG-1_H12 3 0.940555
... ... ...
TTGTTCAGTGTGCTAC-1_H25 2 0.050000
TTGTTGTGTGTCAAGA-1_H25 2 0.188937
TTGTTTCACATCCAGG-1_H25 1 0.956014
TTGTTTCATTAGTCTA-1_H25 1 0.838851
TTGTTTCCATACAACT-1_H25 2 0.364009

13344 rows × 2 columns

<output_dir>/clone5_rectangle0_w1.0/cnv_seglevel.tsv stores the allele-specific copy numbers of each genome bin within each clone. Each row is a genome bin, the columns containing the chromosome, start, and end position of the genome bin, and the inferred the A and B allele copy number within each cancer clone.

df = pd.read_csv(f"{example_directory}/calicost/clone5_rectangle0_w1.0/cnv_seglevel.tsv", header=0, index_col=None, sep='\t')
df
CHR START END clone1 A clone1 B clone2 A clone2 B clone3 A clone3 B clone4 A clone4 B clone5 A clone5 B
0 1 89295 1419136 1 1 1 1 1 1 1 1 1 1
1 1 1434861 1440568 1 1 1 1 1 1 1 1 1 1
2 1 1449689 1496123 1 1 1 1 1 1 1 1 1 1
3 1 1512162 1721078 1 1 1 1 1 1 1 1 1 1
4 1 1724838 2308568 1 1 1 1 1 1 1 1 1 1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2255 22 46360834 48898361 1 1 1 1 1 1 1 1 1 1
2256 22 49773283 49963978 1 1 1 1 1 1 1 1 1 1
2257 22 50089879 50180213 1 1 1 1 1 1 1 1 1 1
2258 22 50185915 50292030 1 1 1 1 1 1 1 1 1 1
2259 22 50309030 50783625 1 1 1 1 1 1 1 1 1 1

2260 rows × 13 columns

<output_dir>/clone5_rectangle0_w1.0/cnv_genelevel.tsv stores the allele-specific copy number for each gene. The copy number per gene is derived from projecting the allele-specific copy numbers of each genome bin to the spanned genes that have enough expression to be retained after CalicoST gene filtering step.

df = pd.read_csv(f"{example_directory}/calicost/clone5_rectangle0_w1.0/cnv_genelevel.tsv", header=0, index_col=None, sep='\t')
df
gene clone1 A clone1 B clone2 A clone2 B clone3 A clone3 B clone4 A clone4 B clone5 A clone5 B
0 AL627309.1 1 1 1 1 1 1 1 1 1 1
1 AL627309.5 1 1 1 1 1 1 1 1 1 1
2 LINC01409 1 1 1 1 1 1 1 1 1 1
3 LINC01128 1 1 1 1 1 1 1 1 1 1
4 LINC00115 1 1 1 1 1 1 1 1 1 1
... ... ... ... ... ... ... ... ... ... ... ...
15035 CHKB-DT 1 1 1 1 1 1 1 1 1 1
15036 MAPK8IP2 1 1 1 1 1 1 1 1 1 1
15037 ARSA 1 1 1 1 1 1 1 1 1 1
15038 SHANK3 1 1 1 1 1 1 1 1 1 1
15039 RABL2B 1 1 1 1 1 1 1 1 1 1

15040 rows × 11 columns

Visualize CalicoST-generated plots

Once CalicoST is finished running, it generates the plots of spatial organization of clones and allele-specific copy numbers. The plots are in PDF format and can be directly viewed. Below, we load the PDF plots in this notebook for easy visualization.

from wand.image import Image as WImage
img = WImage(filename=f"{example_directory}/calicost/clone5_rectangle0_w1.0/plots/clone_spatial.pdf", resolution=100)
img
../../_images/300a43b307d52f773d27147c7ec06e7898e1004f4b727a371401bf964e88a6b7.png
# allele-specific copy numbers of each clone (the color scheme is the same as Fig2c
img = WImage(filename=f"{example_directory}/calicost/clone5_rectangle0_w1.0/plots/acn_genome.pdf", resolution=120)
img
../../_images/8b4794f8f69a9cfbd49d40145aa65ea58c015ae95b62497a7d331796e53dd6df.png
# RDR-BAF plot along the genome for each clone
img = WImage(filename=f"{example_directory}/calicost/clone5_rectangle0_w1.0/plots/rdr_baf_defaultcolor.pdf", resolution=120)
img
../../_images/0b2e506d3c4598b29d614c877f935eac521b4f376f1a85ad5ffe8dae1df8127e.png

Reconstruct tumor phylogeny and phylogeography

We use an existing phylogeny reconstruction method, Startle by Sashittal et al, to infer a phylogeny of CalicoST-inferred cancer clones. To reconstruct a phylogeny based CalicoST results of the first random initialization, run the following command in shell:

mkdir calicost/phylogeny_clone5_rectangle0_w1.0
python <CalicoST code directory>/src/calicost/phylogeny_startle.py -c calicost/clone5_rectangle0_w1.0 -s <startle executable path> -o calicost/phylogeny_clone5_rectangle0_w1.0/

The above run of Startle will produce a plain-text file calicost/phylogeny_clone5_rectangle0_w1.0/loh_tree.newick that encodes phylogeny tree with leaf nodes as CalicoST-inferred clones. We load the tree file as follows.

with open(f"{example_directory}/calicost/phylogeny_clone5_rectangle0_w1.0/loh_tree.newick", 'r') as fp:
    print( fp.readlines() )
['((clone1:0,clone2:3):4,(clone3:1,(clone4:3,clone5:3):9):2);']

Now we project the phylogenetic tree in space to get a phylogeography. Before getting the phylogeography, we note that we currently don’t have the relative positioning among the five slices yet. We manually place the five slices according to Fig 1b in the original publication by Erickon et al., and transform the x/y coordinate in the tissue_positions.csv file according to the new positioning.

# load coordinates and inferred cancer clones
coords = []
for i,s in enumerate(slice_ids):
    # load spatial locations
    # note that scanpy is incompatible with the latest tissue_positions.csv file, we directly load the positions as pandas data frame
    df = pd.read_csv(f'{example_directory}/data/{directory_name[i]}/spatial/tissue_positions.csv', header=0, index_col=0, sep=',')
    df.index = df.index + "_" + s
    df['slice_id'] = s
    coords.append( df )

coords = pd.concat(coords)

# combine with the cancer clone table
df = pd.read_csv(f"{example_directory}/calicost/clone5_rectangle0_w1.0/clone_labels.tsv", header=0, index_col=0, sep='\t')
df.clone_label = 'clone' + df.clone_label.astype(str)
coords = coords.join(df)

# remove spots that are not assigned to clones by CalicoST (filtered out due to low UMI count or SNP-covering UMI count)
coords = coords[coords.clone_label.notnull()]
coords
in_tissue array_row array_col pxl_row_in_fullres pxl_col_in_fullres slice_id clone_label tumor_proportion
barcode
TCCTTCAGTGGTCGAA-1_H12 1 15 69 3366 6308 H12 clone3 0.050000
GCGTCGAAATGTCGGT-1_H12 1 17 65 3618 6018 H12 clone3 0.132391
AACTGATATTAGGCCT-1_H12 1 16 66 3492 6090 H12 clone3 0.050000
CGAGCTGGGCTTTAGG-1_H12 1 17 67 3618 6163 H12 clone3 0.050000
GGGTGTTTCAGCTATG-1_H12 1 16 68 3492 6236 H12 clone3 0.050000
... ... ... ... ... ... ... ... ...
ATGGCCCGAAAGGTTA-1_H25 1 76 120 13480 12001 H25 clone2 1.000000
CGTAATATGGCCCTTG-1_H25 1 77 121 13632 12089 H25 clone2 1.000000
AGAGTCTTAATGAAAG-1_H25 1 76 122 13480 12176 H25 clone2 1.000000
ATTGAATTCCCTGTAG-1_H25 1 76 124 13479 12351 H25 clone2 1.000000
TTGAAGTGCATCTACA-1_H25 1 77 127 13630 12615 H25 clone2 NaN

13344 rows × 8 columns

def flip_axis(coords, axis):
    max_x = np.max(coords[:,axis])
    min_x = np.min(coords[:,axis])
    tmp_coords = copy.copy(coords)
    tmp_coords[:,axis] = min_x + max_x - coords[:,axis]
    return tmp_coords


def rotate_by_angle(coords, angle):
    theta = angle / 180 * np.pi
    R = np.array([ [np.cos(theta), -np.sin(theta)], [np.sin(theta), np.cos(theta)]] )
    mean_coords = np.mean(coords, axis=0)
    return (coords - mean_coords.reshape(1,-1)) @ R + mean_coords.reshape(1,-1)
adjusted_coords = copy.copy(coords[['array_row', 'array_col']].values)
# scale y coordinate so that the hexagon is not squeezed on one direction
adjusted_coords[:,0] = adjusted_coords[:,0] * np.sqrt(3)
# shift x and y coordinate to start from 0 for each slice
for s,sname in enumerate(slice_ids):
    index = np.where(coords.slice_id.values == sname)[0]
    adjusted_coords[index,0] -= np.min(adjusted_coords[index,0])
    adjusted_coords[index,1] -= np.min(adjusted_coords[index,1])
    

# position in number of cubes
cube_length = min( np.max(adjusted_coords[:,0]), np.max(adjusted_coords[:,1]) )
sample_cube_pos = np.array([ [2,0], #H12
                             [4, 0.2], #H14
                             [5,0.5], #H15
                             [0,1], #H21
                             [5,1.5] ]) #H25

swap_x_y = [False, True, True, False, True]
rotation_angle = [15,-5,-5,0,-5] # H12, H14, H15, H21, H25

full_adj_coords = np.zeros(adjusted_coords.shape)
for s,sname in enumerate(slice_ids):
    index = np.where(coords.slice_id.values == sname)[0]
    if swap_x_y[s]:
        tmp_coords = np.vstack([adjusted_coords[index,1],adjusted_coords[index,0]]).T
        if sname != "H25":
            tmp_coords = flip_axis(tmp_coords, axis=0 )
        tmp_coords = flip_axis(tmp_coords, axis=1)
        full_adj_coords[index,:] = tmp_coords + cube_length * sample_cube_pos[s]
    else:
        full_adj_coords[index,:] = adjusted_coords[index,:] + cube_length * sample_cube_pos[s]
        
    full_adj_coords[index,:] = rotate_by_angle(full_adj_coords[index,:], rotation_angle[s])

coords['final_x'] = full_adj_coords[:,0]
coords['final_y'] = full_adj_coords[:,1]
coords
in_tissue array_row array_col pxl_row_in_fullres pxl_col_in_fullres slice_id clone_label tumor_proportion final_x final_y
barcode
TCCTTCAGTGGTCGAA-1_H12 1 15 69 3366 6308 H12 clone3 0.050000 238.417652 74.069483
GCGTCGAAATGTCGGT-1_H12 1 17 65 3618 6018 H12 clone3 0.132391 241.246079 69.170504
AACTGATATTAGGCCT-1_H12 1 16 66 3492 6090 H12 clone3 0.050000 239.573047 70.654068
CGAGCTGGGCTTTAGG-1_H12 1 17 67 3618 6163 H12 clone3 0.050000 241.763717 71.102355
GGGTGTTTCAGCTATG-1_H12 1 16 68 3492 6236 H12 clone3 0.050000 240.090685 72.585919
... ... ... ... ... ... ... ... ... ... ...
ATGGCCCGAAAGGTTA-1_H25 1 76 120 13480 12001 H25 clone2 1.000000 700.743520 183.090182
CGTAATATGGCCCTTG-1_H25 1 77 121 13632 12089 H25 clone2 1.000000 701.914026 181.184948
AGAGTCTTAATGAAAG-1_H25 1 76 122 13480 12176 H25 clone2 1.000000 702.735909 183.264493
ATTGAATTCCCTGTAG-1_H25 1 76 124 13479 12351 H25 clone2 1.000000 704.728299 183.438805
TTGAAGTGCATCTACA-1_H25 1 77 127 13630 12615 H25 clone2 NaN 707.891194 181.707882

13344 rows × 10 columns

import calicost.utils_plotting

fig = calicost.utils_plotting.plot_individual_spots_in_space(full_adj_coords, coords.clone_label, coords.tumor_proportion, base_width=10, base_height=5)
plt.gca().invert_yaxis()
fig.show()
/n/fs/ragr-data/users/congma/temp/CalicoST/src/calicost/utils_plotting.py:1404: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  quantile_colors = this_full_cmap(np.array([0, np.min(copy_single_tumor_prop[idx]), np.max(copy_single_tumor_prop[idx]), 1]))
/n/fs/ragr-data/users/congma/temp/CalicoST/src/calicost/utils_plotting.py:1407: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  seaborn.scatterplot(x=shifted_coords[idx,0], y=-shifted_coords[idx,1], s=10, hue=copy_single_tumor_prop[idx], palette=this_cmap, linewidth=0, legend=None, ax=axes)
/n/fs/ragr-data/users/congma/temp/CalicoST/src/calicost/utils_plotting.py:1404: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  quantile_colors = this_full_cmap(np.array([0, np.min(copy_single_tumor_prop[idx]), np.max(copy_single_tumor_prop[idx]), 1]))
/n/fs/ragr-data/users/congma/temp/CalicoST/src/calicost/utils_plotting.py:1407: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  seaborn.scatterplot(x=shifted_coords[idx,0], y=-shifted_coords[idx,1], s=10, hue=copy_single_tumor_prop[idx], palette=this_cmap, linewidth=0, legend=None, ax=axes)
/n/fs/ragr-data/users/congma/temp/CalicoST/src/calicost/utils_plotting.py:1404: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  quantile_colors = this_full_cmap(np.array([0, np.min(copy_single_tumor_prop[idx]), np.max(copy_single_tumor_prop[idx]), 1]))
/n/fs/ragr-data/users/congma/temp/CalicoST/src/calicost/utils_plotting.py:1407: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  seaborn.scatterplot(x=shifted_coords[idx,0], y=-shifted_coords[idx,1], s=10, hue=copy_single_tumor_prop[idx], palette=this_cmap, linewidth=0, legend=None, ax=axes)
/n/fs/ragr-data/users/congma/temp/CalicoST/src/calicost/utils_plotting.py:1404: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  quantile_colors = this_full_cmap(np.array([0, np.min(copy_single_tumor_prop[idx]), np.max(copy_single_tumor_prop[idx]), 1]))
/n/fs/ragr-data/users/congma/temp/CalicoST/src/calicost/utils_plotting.py:1407: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  seaborn.scatterplot(x=shifted_coords[idx,0], y=-shifted_coords[idx,1], s=10, hue=copy_single_tumor_prop[idx], palette=this_cmap, linewidth=0, legend=None, ax=axes)
/n/fs/ragr-data/users/congma/temp/CalicoST/src/calicost/utils_plotting.py:1404: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  quantile_colors = this_full_cmap(np.array([0, np.min(copy_single_tumor_prop[idx]), np.max(copy_single_tumor_prop[idx]), 1]))
/n/fs/ragr-data/users/congma/temp/CalicoST/src/calicost/utils_plotting.py:1407: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  seaborn.scatterplot(x=shifted_coords[idx,0], y=-shifted_coords[idx,1], s=10, hue=copy_single_tumor_prop[idx], palette=this_cmap, linewidth=0, legend=None, ax=axes)
../../_images/75c2e34f2a0311be4074370c55cf8be6491663d4dd440bd8af4aa7788b3d1d9b.png

Now we project the phylogeny to the space of coords[[‘final_x’, ‘final_y’]] and infer ancestor locations using a Gaussian diffusion model.

import calicost.phylogeography

newick_file = f"{example_directory}/calicost/phylogeny_clone5_rectangle0_w1.0/loh_tree.newick"
t = calicost.phylogeography.project_phylogeneny_space(newick_file, coords[['final_x', 'final_y']].values, coords.clone_label.values, 
                              single_tumor_prop=coords.tumor_proportion.values, sample_list=slice_ids, sample_ids=coords.slice_id.values)

print( t )
root node is ancestor1_2_3_4_5
a list of leaf nodes: ['clone1', 'clone2', 'clone3', 'clone4', 'clone5']
a list of internal nodes: ['ancestor1_2_3_4_5', 'ancestor1_2', 'ancestor3_4_5', 'ancestor4_5']

      /-clone1
   /-|
  |   \-clone2
--|
  |   /-clone3
   \-|
     |   /-clone4
      \-|
         \-clone5
# plot clones in space with the phylogeography

fig = calicost.utils_plotting.plot_individual_spots_in_space(full_adj_coords, coords.clone_label, coords.tumor_proportion, base_width=10, base_height=5)
axes = plt.gca()

# clone centers + ancestors
for node in t.traverse():
    axes.scatter( node.x, -node.y, marker="D", linewidth=2, edgecolor='black', facecolor="None", s=50)

# edges
for node in t.iter_leaves():
    while not node.is_root():
        p = node.up
        if np.abs(node.x - p.x) + np.abs(node.y - p.y) > 1:
            axes.annotate("", xy=(node.x, -node.y), xytext=(p.x, -p.y), arrowprops=dict(mutation_scale=15, lw=1, arrowstyle="->", color="black"))
        node = p
        
axes.invert_yaxis()
fig.show()
/n/fs/ragr-data/users/congma/temp/CalicoST/src/calicost/utils_plotting.py:1404: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  quantile_colors = this_full_cmap(np.array([0, np.min(copy_single_tumor_prop[idx]), np.max(copy_single_tumor_prop[idx]), 1]))
/n/fs/ragr-data/users/congma/temp/CalicoST/src/calicost/utils_plotting.py:1407: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  seaborn.scatterplot(x=shifted_coords[idx,0], y=-shifted_coords[idx,1], s=10, hue=copy_single_tumor_prop[idx], palette=this_cmap, linewidth=0, legend=None, ax=axes)
/n/fs/ragr-data/users/congma/temp/CalicoST/src/calicost/utils_plotting.py:1404: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  quantile_colors = this_full_cmap(np.array([0, np.min(copy_single_tumor_prop[idx]), np.max(copy_single_tumor_prop[idx]), 1]))
/n/fs/ragr-data/users/congma/temp/CalicoST/src/calicost/utils_plotting.py:1407: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  seaborn.scatterplot(x=shifted_coords[idx,0], y=-shifted_coords[idx,1], s=10, hue=copy_single_tumor_prop[idx], palette=this_cmap, linewidth=0, legend=None, ax=axes)
/n/fs/ragr-data/users/congma/temp/CalicoST/src/calicost/utils_plotting.py:1404: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  quantile_colors = this_full_cmap(np.array([0, np.min(copy_single_tumor_prop[idx]), np.max(copy_single_tumor_prop[idx]), 1]))
/n/fs/ragr-data/users/congma/temp/CalicoST/src/calicost/utils_plotting.py:1407: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  seaborn.scatterplot(x=shifted_coords[idx,0], y=-shifted_coords[idx,1], s=10, hue=copy_single_tumor_prop[idx], palette=this_cmap, linewidth=0, legend=None, ax=axes)
/n/fs/ragr-data/users/congma/temp/CalicoST/src/calicost/utils_plotting.py:1404: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  quantile_colors = this_full_cmap(np.array([0, np.min(copy_single_tumor_prop[idx]), np.max(copy_single_tumor_prop[idx]), 1]))
/n/fs/ragr-data/users/congma/temp/CalicoST/src/calicost/utils_plotting.py:1407: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  seaborn.scatterplot(x=shifted_coords[idx,0], y=-shifted_coords[idx,1], s=10, hue=copy_single_tumor_prop[idx], palette=this_cmap, linewidth=0, legend=None, ax=axes)
/n/fs/ragr-data/users/congma/temp/CalicoST/src/calicost/utils_plotting.py:1404: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  quantile_colors = this_full_cmap(np.array([0, np.min(copy_single_tumor_prop[idx]), np.max(copy_single_tumor_prop[idx]), 1]))
/n/fs/ragr-data/users/congma/temp/CalicoST/src/calicost/utils_plotting.py:1407: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  seaborn.scatterplot(x=shifted_coords[idx,0], y=-shifted_coords[idx,1], s=10, hue=copy_single_tumor_prop[idx], palette=this_cmap, linewidth=0, legend=None, ax=axes)
../../_images/c9df9acd75aeb9bbe6bcd090f96dbd719d4995331db150fc1b2f1c6114c90153.png