Modules

admm_fit

Created on Thu May 5 23:18:26 2022

@author: hill103

this script stores functions related to ADMM framework for fitting model

ADMM framework based on https://github.com/cvxgrp/strat_models

admm_fit.one_admm_fit(data, L, theta, e_alpha, gamma_g, sigma2, lambda_r=1.0, lasso_weight=None, hv_x=None, hv_log_p=None, theta_mask=None, abs_tol=0.001, rel_tol=0.001, rho=1, mu=10, tau_incr=2, tau_decr=2, max_rho=10.0, min_rho=0.1, maxiter=100, max_cg_iterations=10, dynamic_rho=True, queue_len=3, diff_threshold=0.05, rho_incr=2, rho_decr=2, diff_scale=5, diff_stop=5e-05, opt_method='L-BFGS-B', global_optimize=False, hybrid_version=True, verbose=False)[source]

perform a whole ADMM iterations once as one fitting procedure in GLRM

Note: the output of the fit result is 3-Dimensional. The dimension transform will be performed outside this function if needed

Parameters:
  • data (Dict) –

    a Dict contains all info need for modeling:

    X: a 2-D numpy matrix of celltype specific marker gene expression (celltypes * genes).

    Y: a 2-D numpy matrix of spatial gene expression (spots * genes).

    A: a 2-D numpy matrix of Adjacency matrix (spots * spots), or is None. Adjacency matrix of spatial sptots (1: connected / 0: disconnected). All 0 in diagonal.

    N: a 1-D numpy array of sequencing depth of all spots (length #spots). If it’s None, use sum of observed marker gene expressions as sequencing depth.

    non_zero_mtx: If it’s None, then do not filter zeros during regression. If it’s a bool 2-D numpy matrix (spots * genes) as False means genes whose nUMI=0 while True means genes whose nUMI>0 in corresponding spots. The bool indicators can be calculated based on either observerd raw nUMI counts in spatial data, or CVAE transformed nUMI counts.

    spot_names: a list of string of spot barcodes. Only keep spots passed filtering.

    gene_names: a list of string of gene symbols. Only keep actually used marker genes.

    celltype_names: a list of string of celltype names.

    initial_guess: initial guess of cell-type proportions of spatial spots.

  • L (scipy sparse matrix (spots * spots)) – Laplacian matrix

  • theta (3-D numpy array (spots * celltypes * 1)) – initial guess of theta (celltype proportion).

  • e_alpha (1-D numpy array) – initial guess of e_alpha (spot-specific effect).

  • gamma_g (1-D numpy array) – gene-specific platform effect for all genes.

  • sigma2 (float) – initial guess of variance paramter of the lognormal distribution of ln(lambda). All genes and spots share the same variance. it may not be updated during the ADMM iterations, i.e. sigma2 is treated as an already optimized value.

  • lambda_r (float, optional) – strength for Adaptive Lasso penalty. The default is 1.0.

  • lasso_weight (3-D numpy array (spots * celltypes * 1), optional) – calculated weight for adaptive lasso. The default is None.

  • hv_x (1-D numpy array, optional) – data points served as x for calculation of probability density values. Only used for heavy-tail.

  • hv_log_p (1-D numpy array, optional) – log density values of normal distribution N(0, sigma^2) + heavy-tail. Only used for heavy-tail.

  • theta_mask (3-D numpy array (spots * celltypes * 1), optional) – mask for cell-type proportions (1: present, 0: not present). Only used for stage 2 theta optmization.

  • abs_tol (float, optional) – Absolute tolerance. The default is 1e-3.

  • rel_tol (float, optional) – Relative tolerance. The default is 1e-3.

  • rho (float, optional) – Initial penalty parameter. The default is 1. Actually used in code is 1/rho, and turned out to fasten the converge than using rho

  • mu (float, optional) – Adaptive penalty parameters. The default is 10.

  • tau_incr (float, optional) – Adaptive penalty parameters. The default is 2.

  • tau_decr (float, optional) – Adaptive penalty parameters. The default is 2.

  • max_rho (float, optional) – Adaptive penalty parameters. The default is 1e1.

  • min_rho (float, optional) – Adaptive penalty parameters. The default is 1e-1.

  • maxiter (int, optional) – Maximum number of ADMM iterations. The default is 100.

  • max_cg_iterations (int, optional) – Max number of CG iterations for graph Laplacian contrain per ADMM iteration. The default is 10.

  • dynamic_rho (bool, optional) – if True, dynamically increasing min_rho and max_rho. The default is True.

  • queue_len (int, optional) – the length of queue to record the mean of theta-theta_tilde ‘RMSE’ and theta-theta_hat ‘RMSE’. The default is 3.

  • diff_threshold (float, optional) – the threshold of ‘RMSE change’ for rho adjustment. If all ‘RMSE change’ values in queue are less than or equal the threshold, then increasing min_rho and max_rho. The default is 0.05.

  • rho_incr (float, optional) – the multiplier of min_rho and max_rho increasing. The default is 2.

  • rho_decr (float, optional) – the divider of min_rho and max_rho decreasing. The default is 2.

  • diff_scale (float, optional) – if current theta tilde and hat ‘RMSE’ > diff_scale * previous ‘RMSE’, which means current rho value is too large and cause unexpected oscillation, then decreasing min_rho and max_rho. The default is 5.

  • diff_stop (float, optional) – if average of theta tilde and hat ‘RMSE’ <= diff_stop, stop ADMM iteration. The default is 5e-5.

  • opt_method (string, optional) – specify method used in scipy.optimize.minimize for local model fitting. The default is ‘L-BFGS-B’, a default method in scipy for optimization with bounds. Another choice would be ‘SLSQP’, a default method in scipy for optimization with constrains and bounds.

  • global_optimize (bool, optional) – if is True, use basin-hopping algorithm to find the global minimum. The default is False.

  • hybrid_version (bool, optional) – if True, use the hybrid_version of GLRM, i.e. in ADMM local model loss function optimization for w but adaptive lasso constrain on theta. If False, local model loss function optimization and adaptive lasso will on the same w. The default is True.

  • verbose (bool, optional) – if True, print information in each ADMM loop

Returns:

estimated model coefficients, including:

theta : celltype proportions (#spots * #celltypes * 1)

e_alpha : spot-specific effect (1-D array with length #spot)

sigma2 : variance paramter of the lognormal distribution (float)

gamma_g : gene-specific platform effect for all genes (1-D array with length #gene)

theta_tilde : celltype proportions for Adaptive Lasso (#spots * #celltypes * 1)

theta_hat : celltype proportions for Graph Laplacian constrain (#spots * #celltypes * 1)

Return type:

Dict

config

Created on Thu May 5 21:58:08 2022

@author: hill103

this script configure the environment variables for GLRM

config.diagnosis_path = '/data/diagnosis'

define a small value to avoid divided by 0 or log(0)

config.gamma = 0.004

define eps in optmization for theta and sigma2 (default 1e-08)

config.is_docker()[source]
config.min_sigma2 = 1e-06

define the integration range and increment to calculate the heavy-tail probabilities

config.min_theta = 1e-09

define a small value to make sure sigma^2 > 0

config.min_val = 1e-12

define a small value to make sure theta (w) > 0

config.print(*objects, sep=' ', end='\n', file=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, flush=True)[source]
config.sigma2_eps = 1e-08

configuration for reproducible results in keras + TensorFlow skipped here, we configure the environment in main function

cvae

Created on Thu May 19 21:43:08 2022

@author: hill103

this script stores functions to build a CVAE for platform effect adjustment

cvae.CVAE_keras_model(p, p_cond, latent_dim, p_encoder_lst, p_decoder_lst, hidden_act='elu', output_act='relu', use_batch_norm=True, cvae_init_lr=0.01)[source]

define a standard CVAE model based on Keras need to build a decoder separately as can not extract it from the whole model

Parameters:
  • p (int) – number of nodes in input layer

  • p_cond (int) – number of conditional nodes in input layer

  • latent_dim (int) – number of nodes in latent space

  • p_encoder_lst (list of integers) – including number of nodes in each hidden layer of encoder, the length of list is the number of hidden layers

  • p_decoder_lst (list of integers) – including number of nodes in each hidden layer of decoder, the length of list is the number of hidden layers

  • hidden_act (string, optional) – activation function of hidden layers. Default is elu function

  • output_act (string, optional) – activation function of output layer. Default is relu function

  • use_batch_norm (bool, optional) – whether to use batch normalization. Default if True, i.e. use batch normalization

  • cvae_init_lr (float) – initial learning rate for training CVAE

Returns:

  • cvae (Keras model) – CVAE model. Encoder can be extracted from it

  • decoder (Keras model) – corresponding decoder, reset its weights after CVAE training

cvae.augment_sc(exp_df, celltype_anno, target_count, pseudo_spot_min_cell, pseudo_spot_max_cell)[source]

augment single cells and balance #cells of cell types

generate pseudo-spots by randomly combining scRNA-seq cells within the same cell type

original single cells are put at the end

Parameters:
  • exp_df (dataframe) – normalized gene expression (cell * genes)

  • celltype_anno (dataframe or None) – cell-type annotations for cells in scRNA-seq data. Only 1 column named <celltype>

  • target_count (int) – target number of cells per cell type

  • pseudo_spot_min_cell (int) – minimum value of cells in pseudo-spot

  • pseudo_spot_max_cell (int) – maximum value of cells in pseudo-spot

Returns:

  • pseudo_spots_df (dataframe) – pseudo-spot gene expression (pseudo-spots * genes; original cells included first)

  • pseudo_spots_celltype_prop (dataframe) – pseudo-spot cell-type proportions (pseudo-spots * cell-types; original cells included first)

  • n_cell_in_spot (list) – number of cells/spots in pseudo-spots (original cells included first)

cvae.build_CVAE(spatial_df, scRNA_df, scRNA_celltype, n_marker_per_cmp, n_pseudo_spot, pseudo_spot_min_cell, pseudo_spot_max_cell, seq_depth_scaler, cvae_input_scaler, cvae_init_lr, num_hidden_layer, use_batch_norm, cvae_train_epoch, use_spatial_pseudo, use_fdr, p_val_cutoff, fc_cutoff, pct1_cutoff, pct2_cutoff, sortby_fc, diagnosis, rerun_DE=True, filter_gene=True)[source]

build CVAE to adjust platform effect, return transformed spatial gene expression and scRNA-seq cell-type gene signature

input gene expression in datasets only included genes needed for downstream analysis and already been normalized by sequencing depth

Parameters:
  • spatial_df (dataframe) – normalized gene expression in spatial transcriptomic data (spots * genes).

  • scRNA_df (dataframe) – normalized gene expression in scRNA-seq data (cells * genes).

  • scRNA_celltype (dataframe) – cell-type annotations for cells in scRNA-seq data. Only 1 column named <celltype>.

  • n_marker_per_cmp (int) – number of TOP marker genes for each comparison in DE.

  • n_pseudo_spot (int) – number of pseudo-spots.

  • pseudo_spot_min_cell (int) – minimum value of cells in pseudo-spot.

  • pseudo_spot_max_cell (int) – maximum value of cells in pseudo-spot.

  • seq_depth_scaler (int) – a scaler of scRNA-seq sequencing depth.

  • cvae_input_scaler (int) – maximum value of the scaled input for CVAE.

  • cvae_init_lr (float) – initial learning rate for training CVAE.

  • num_hidden_layer (int) – number of hidden layers in encoder and decoder.

  • use_batch_norm (bool) – whether to use Batch Normalization.

  • cvae_train_epoch (int) – max number of training epochs for the CVAE.

  • use_spatial_pseudo (int) – whether to generate “pseudo-spots” in spatial condition.

  • use_fdr (bool) – whether to use FDR adjusted p value for filtering and sorting.

  • p_val_cutoff (float) – threshold of p value (or FDR if –use_fdr is true) in marker genes filtering.

  • fc_cutoff (float) – threshold of fold change (without log transform!) in marker genes filtering.

  • pct1_cutoff (float) – threshold of pct.1 in marker genes filtering.

  • pct2_cutoff (float) – threshold of pct.2 in marker genes filtering.

  • sortby_fc (bool) – whether to sort marker genes by fold change.

  • diagnosis (bool) – if True save more information to files for diagnosis CVAE and hyper-parameter selection.

  • rerun_DE (bool, optional) – whether to rerun DE on the CVAE transformed scRNA-seq data, since the DE genes might be different with before CVAE transforming.

  • filter_gene (bool) – whether to filter genes before DE.

Returns:

  • spatial_transformed_numi (dataframe) – CVAE transformed (platform effect adjusted) spatial spot gene raw nUMI counts (spots * genes).

  • scRNA_decode_avg_df (dataframe) – CVAE decodered average gene expression (normalized) of cell-types in scRNA-seq data (cell-types * genes).

  • new_markers (list or None) – marker genes from re-run DE on CVAE transformed scRNA-seq data. It will be None if not re-run DE (rerun_DE=False).

  • cvae_pred (dataframe or None) – cell-type proportions of spatial spots predicted or transferred by CVAE. It will be None if no way to got initial guess of cell-type proportions (spots * cell-types).

cvae.build_CVAE_whole(spatial_file, ref_file, ref_anno_file, marker_file, n_hv_gene, n_marker_per_cmp, n_pseudo_spot, pseudo_spot_min_cell, pseudo_spot_max_cell, seq_depth_scaler, cvae_input_scaler, cvae_init_lr, num_hidden_layer, use_batch_norm, cvae_train_epoch, use_spatial_pseudo, redo_de, use_fdr, p_val_cutoff, fc_cutoff, pct1_cutoff, pct2_cutoff, sortby_fc, diagnosis, filter_cell, filter_gene)[source]

read related CSV files, build CVAE to adjust platform effect, return transformed spatial gene expression and scRNA-seq cell-type gene signature

Parameters:
  • spatial_file (string) – full path of input csv file of raw nUMI counts in spatial transcriptomic data (spots * genes).

  • ref_file (string) – full path of input csv file of raw nUMI counts in scRNA-seq data (cells * genes).

  • ref_anno_file (string) – full path of input csv file of cell-type annotations for all cells in scRNA-seq data.

  • marker_file (string) – full path of input csv file of cell-typee marker gene expression (cell-types * genes).

  • n_hv_gene (int) – number of highly variable genes for CVAE.

  • n_marker_per_cmp (int) – number of TOP marker genes for each comparison in DE.

  • n_pseudo_spot (int) – number of pseudo-spots.

  • pseudo_spot_min_cell (int) – minimum value of cells in pseudo-spot.

  • pseudo_spot_max_cell (int) – maximum value of cells in pseudo-spot.

  • seq_depth_scaler (int) – a scaler of scRNA-seq sequencing depth.

  • cvae_input_scaler (int) – maximum value of the scaled input for CVAE.

  • cvae_init_lr (float) – initial learning rate for training CVAE.

  • num_hidden_layer (int) – number of hidden layers in encoder and decoder.

  • use_batch_norm (bool) – whether to use Batch Normalization.

  • cvae_train_epoch (int) – max number of training epochs for the CVAE.

  • use_spatial_pseudo (int) – whether to generate “pseudo-spots” in spatial condition.

  • redo_de (bool) – whether to redo DE after CVAE transformation.

  • use_fdr (bool) – whether to use FDR adjusted p value for filtering and sorting.

  • p_val_cutoff (float) – threshold of p value (or FDR if –use_fdr is true) in marker genes filtering.

  • fc_cutoff (float) – threshold of fold change (without log transform!) in marker genes filtering.

  • pct1_cutoff (float) – threshold of pct.1 in marker genes filtering.

  • pct2_cutoff (float) – threshold of pct.2 in marker genes filtering.

  • sortby_fc (bool) – whether to sort marker genes by fold change.

  • diagnosis (bool) – if True save more information to files for diagnosis CVAE and hyper-parameter selection.

  • filter_cell (bool) – whether to filter cells before DE.

  • filter_gene (bool) – whether to filter genes before DE.

Returns:

  • spatial_transformed_numi (dataframe) – CVAE transformed (platform effect adjusted) spatial spot gene raw nUMI counts (spots * genes).

  • scRNA_decode_avg_df (dataframe) – CVAE decodered average gene expression (normalized) of cell-types in scRNA-seq data (cell-types * genes).

  • new_markers (list or None) – marker genes from re-run DE on CVAE transformed scRNA-seq data. It will be None if not re-run DE.

  • cvae_pred (dataframe or None) – cell-type proportions of spatial spots predicted or transferred by CVAE. It will be None if no way to got initial guess of cell-type proportions (spots * cell-types).

cvae.celltype2props(celltype_anno, celltype_order)[source]

calculate cell-type proportions matrix given one cell-type annotation

Parameters:
  • celltype_anno (dataframe) – cell-type annotations. Only 1 column named <celltype>

  • celltype_order (list) – already sorted unique cell-types. Its order matters, and will be the order in cell-type proportions (columns) and cell-type gene expression profile (rows)

Returns:

celltype_prop – cell-type proportions, columns are already sorted cell-types

Return type:

dataframe

cvae.combine_spatial_spots(exp_df, n_spot, pseudo_spot_min_cell, pseudo_spot_max_cell)[source]

we also generate “pseudo-spots” by combining spatial spots

here we do not know the true cell-type proportions in spatial spots, so we also do not know the proportions in generated “pseudo-spots”. We just ignore it, and only use the expressions for CVAE training

Parameters:
  • exp_df (dataframe) – normalized gene expression (spots * genes)

  • n_spot (int) – number of pseudo spots need to be generated, including training and validation set

  • pseudo_spot_min_cell (int) – minimum value of cells in pseudo-spot

  • pseudo_spot_max_cell (int) – maximum value of cells in pseudo-spot

Returns:

pseudo_spots_df – pseudo-spot gene expression (pseudo-spots * genes; NO original cells included)

Return type:

dataframe

cvae.generate_pseudo_spots(exp_df, celltype_anno, n_spot, celltype_order, pseudo_spot_min_cell, pseudo_spot_max_cell)[source]

generate pseudo-spots for CVAE training by randomly combining scRNA-seq cells

UPDATE:

now we separate the pseudo-spots and all scRNA-seq cells, i.e. we DO NOT add all cells to the end of the dataframe as special pseudo-spots with only one cell

if n_spot=0, i.e. no pseudo-spots, then return an empty dataframe

Parameters:
  • exp_df (dataframe) – normalized gene expression (cell * genes)

  • celltype_anno (dataframe or None) – cell-type annotations for cells in scRNA-seq data. Only 1 column named <celltype>

  • n_spot (int) – number of pseudo spots need to be generated, including training and validation set

  • celltype_order (list) – already sorted unique cell-types. Its order matters, and will be the order in cell-type proportions (columns) and cell-type gene expression profile (rows)

  • pseudo_spot_min_cell (int) – minimum value of cells in pseudo-spot

  • pseudo_spot_max_cell (int) – maximum value of cells in pseudo-spot

Returns:

  • pseudo_spots_df (dataframe) – pseudo-spot gene expression (pseudo-spots * genes; NO original cells included)

  • pseudo_spots_celltype_prop (dataframe) – pseudo-spot cell-type proportions (pseudo-spots * cell-types; NO original cells included)

  • n_cell_in_spot (list) – number of cells/spots in pseudo-spots (NO original cells included)

cvae.transferProps(query, ref, ref_props, n_neighbors=10, sigma=1, use_embedding='PCA', pca_dimension=None)[source]

transfer cell-type proportions by select K Nearest Neighbors in ref and take Gaussian weighted average of ref proportions

Parameters:
  • query (2-D numpy matrix) – encoder embeddings of spatial spots (spots * latent layer neurons).

  • ref (2-D numpy matrix) – encoder embeddings of scRNA-seq cells and pseudo-spots in scRNA-seq condition (cells+pseudo-spots * latent layer neurons).

  • ref_props (2-D numpy matrix) – cell-type proportion matrix of scRNA-seq cells and pseudo-spots (cells+pseudo-spots * cell-types).

  • n_neighbors (int, optional) – Number of neighbors to use. The default is 10.

  • sigma (float, optional) – Standard deviation for the Gaussian weighting function. The default is 1.

  • use_embedding (str, optional) – which embedding to use, either PCA, UMAP or none. The default is PCA.

  • pca_dimension (int, optinal) – specify the number of dimensions for PCA reduction. If set to None, the reduced dimension will be one-third of the input dimension.

Returns:

query_props – cell-type proportion matrix for spatial spots.

Return type:

2-D numpy matrix

cvaeglrm

Created on Tue May 10 03:15:36 2022

@author: hill103

this script is the main function of the whole CVAE-GLRM pipeline

pipeline steps:

  1. receive and parse the command line parameters and then pass the parameters to CVAE-GLRM model

  2. pre-process

  3. building CVAE (if available)

  4. GLRM model fitting

  5. post-process and save the results

cvaeglrm.main()[source]

main function

Parameters:

None.

Return type:

None.

diagnosis_plots

Created on Thu Apr 25 06:02:34 2024

@author: hill103

We move all functions for generating diagnosis plots to here.

diagnosis_plots.defineColor(n_spatial_spot, scRNA_celltype)[source]

generate n visually distinct colours for cell-types. Use these colours across whole pipeline for consistency

Parameters:
  • n_spatial_spot (int) – number of spatial spots.

  • scRNA_celltype (dataframe) – cell-type annotations for cells in scRNA-seq data. Only 1 column named <celltype>.

Returns:

plot_colors – generated color palette, keys are cell-types together with the number of cells with this cell-type, and values are RGB colors.

Return type:

dict

diagnosis_plots.diagnosisCVAE(cvae, encoder, decoder, spatial_embed, spatial_transformed_df, spatial_transformed_numi, pseudo_spatial_embed, scRNA_celltype, celltype_order, celltype_count_dict, scrna_cell_celltype_prop, scRNA_embed, scrna_n_cell, pseudo_spots_celltype_prop, n_cell_in_spot, pseudo_spot_embed, scRNA_decode_df, scRNA_decode_avg_df, new_markers, plot_colors)[source]

save CVAE related Keras models to h5 file, generate figures to diagnosis the training of CVAE

Parameters:
  • cvae (Keras model) – already trained CVAE model

  • encoder (Keras model) – encoder part extract from CVAE model

  • decoder (Keras model) – a separated decoder whose weights are already updated, i.e. it should give the same decoded output with CVAE

  • spatial_embed (2-D numpy array) – mu in latent space of spatial spots (spots * latent neurons)

  • spatial_transformed_df (dataframe) – CVAE transformed (platform effect adjusted) spatial spot gene normalized expressions (spots * genes)

  • spatial_transformed_numi (dataframe) – CVAE transformed (platform effect adjusted) spatial spot gene raw nUMI counts (spots * genes)

  • pseudo_spatial_embed (dataframe) – mu in latent space of spatial pseudo-spots (spatial pseudo-spots * latent neurons). It will have 0 rows if no spatial pseudo-spots generated

  • scRNA_celltype (dataframe) – cell-type annotations for cells in scRNA-seq data. Only 1 column named <celltype>

  • celltype_order (list) – already sorted unique cell-types. Its order matters, and will be the order in pseudo_spots_celltype (columns) and cell-type gene expression profile (rows)

  • celltype_count_dict (dict) – number of cells in reference scRNA-seq data for each cell-type

  • scrna_cell_celltype_prop (dataframe) – scRNA-seq cells cell-type proportions (cells * cell-types)

  • scRNA_embed (2-D numpy array) – mu in latent space of scRNA-seq cells (cells * latent neurons)

  • scrna_n_cell (list) – number of cells in scRNA-seq data. Note augmented single cells also included

  • pseudo_spots_celltype_prop (dataframe) – pseudo-spot cell-type proportions (pseudo-spots * cell-types; NO scRNA-seq cells included)

  • n_cell_in_spot (list) – number of cells in pseudo-spots (no scRNA-seq cells included)

  • pseudo_spot_embed (dataframe) – mu in latent space of pseudo spots (pseudo-spots * latent neurons); NO scRNA-seq cells included. It will have 0 rows if no scRNA-seq pseudo-spots generated

  • scRNA_decode_df (dataframe) – CVAE decodered gene expression (normalized) of scRNA-seq cells (cells * genes)

  • scRNA_decode_avg_df (dataframe) – CVAE decodered average gene expression (normalized) of cell-types in scRNA-seq data (cell-types * genes)

  • new_markers (list or None) – marker genes from re-run DE on CVAE transformed scRNA-seq data. It will be None if not re-run DE (rerun_DE=False)

  • plot_colors (dict) – color palette for plot

Return type:

None.

diagnosis_plots.diagnosisParamsSpotwiseLambdarTuning(x, strategy)[source]

generate histogram plot show optimal lambda_r in hyper parameter tuning by BIC

Parameters:
  • x (list) – a list of optimal lambda_r

  • strategy (string) – the used strategy for tuning lambda_r

Return type:

None.

diagnosis_plots.diagnosisParamsTuning(x, y, optimal_idx, x_label, y_label)[source]

generate scatter plot show performance in hyper parameter tuning by cross-validation

Parameters:
  • x (list) – a list of candidates of hyper parameter

  • y (list) – corresponding performance for each hyper parameter candidate. NOTE not all candidates have corresponding performance due to early stop

  • optimal_idx (int) – index of list to indicate the optimal hyper parameter value which achieve the best performance

  • x_label (string) – label of x axis in the plot, representing the number of hyper parameter

  • y_label (string) – label of y axis in the plot, representing the performance metric used

Return type:

None.

diagnosis_plots.plotCVAELoss(history)[source]

plot training and validation loss in CVAE training

NOTE validation loss may be unavailable if no validation data in training

Parameters:

history (History object) – History object returned by Keras Model.fit in CVAE training.

Return type:

None.

diagnosis_plots.plot_imputation(df, grid_df, contours, hierarchy, figname, figsize=(6.4, 4.8))[source]

draw scatter plot of spatial spots and imputed spots for diagnosis.

Parameters:
  • df (dataframe) – dataframe of spatial spots at original resolution, with columns ‘x’, ‘y’, ‘border’.

  • grid_df (dataframe) – dataframe of generated grid points at higher resolution, with columns ‘x’, ‘y’.

  • contours (tuple) – contours variable returned by cv2.findContours, used for creating grid.

  • hierarchy (3-D numpy array (1 * #contours * 4)) – hierarchy variable returned by cv2.findContours, used for creating grid.

  • figname (string) – name of figure.

  • figsize (tuple, optional) – figure size. The default is (6.4, 4.8).

Return type:

None.

diagnosis_plots.rawInputUMAP(spatial_df, scRNA_df, scRNA_celltype, plot_colors)[source]

generate UMAP of spatial spots together with scRNA-seq cells

Parameters:
  • spatial_df (dataframe) – normalized gene expression in spatial transcriptomic data (spots * genes).

  • scRNA_df (dataframe) – normalized gene expression in scRNA-seq data (cells * genes).

  • scRNA_celltype (dataframe) – cell-type annotations for cells in scRNA-seq data. Only 1 column named <celltype>.

  • plot_colors (dict) – color palette for plot.

Return type:

None.

hyper_parameter_optimization

Created on Sat May 7 23:32:08 2022

@author: hill103

this script stores functions of cross-validation to find optimal values for hyper-parameters:
  1. lambda_r: for Adaptive Lasso

  2. lambda_g: for graph edge weight, will affect the Laplacian Matrix

in k-fold cross-validation, we random subset the GENEs, then predict the gene expression in validation fold the performance metric is RMSE or negative log-likelihood of predicted gene expressions in validation fold

UPDATE: we add BIC for lambda_r selection

hyper_parameter_optimization.BIC_find_lambda_r_one_spot(mu, y_vec, N, spot_name, mle_theta, mle_e_alpha, gamma_g, sigma2, lasso_weight, candidate_list, hybrid_version=True, opt_method='L-BFGS-B', hv_x=None, hv_log_p=None, use_AIC=False, reinitialize_theta=False, verbose=False)[source]

find optimal value for hyper-parameter lambda_r by BIC

Parameters:
  • mu (2-D numpy matrix) – matrix of celltype specific marker gene expression (celltypes * genes).

  • y_vec (1-D numpy array) – spatial gene expression (length #genes).

  • N (int or None) – sequencing depth of this spot. If it’s None, use sum of observed marker gene expressions as sequencing depth.

  • spot_name (string) – name of this spot.

  • mle_theta (1-D numpy array (length #celltypes)) – estimated theta (celltype proportion) by MLE.

  • mle_e_alpha (float) – estimated e_alpha by MLE.

  • gamma_g (1-D numpy array) – gene-specific platform effect for all genes.

  • sigma2 (float) – variance paramter of the lognormal distribution of ln(lambda). All gene share the same variance.

  • lasso_weight (1-D numpy array (length #celltypes)) – weight of Adaptive Lasso, 1 ./ MLE theta

  • candidate_list (list) – candidates for the hyper-parameter

  • hybrid_version (bool, optional) – if True, use the hybrid_version of GLRM, i.e. in ADMM local model loss function optimization for w but adaptive lasso constrain on theta. If False, local model loss function optimization and adaptive lasso will on the same w. The default is True.

  • opt_method (string, optional) – specify method used in scipy.optimize.minimize for local model fitting. The default is ‘L-BFGS-B’, a default method in scipy for optimization with bounds. Another choice would be ‘SLSQP’, a default method in scipy for optimization with constrains and bounds.

  • hv_x (1-D numpy array, optional) – data points served as x for calculation of probability density values. Only used for heavy-tail.

  • hv_log_p (1-D numpy array, optional) – log density values of normal distribution N(0, sigma^2) + heavy-tail. Only used for heavy-tail.

  • use_AIC (bool, optional) – if True, calculate and return AIC.

  • reinitialize_theta (bool, optional) – if True, reinitialize theta as 1/#celltype and e_alpha as 1 for optimization.

  • verbose (bool, optional) – if True, print more information.

Returns:

optimal lambda_r value and corresponding optimized theta (for refit)

Return type:

tuple of (float, 1-D numpy array)

hyper_parameter_optimization.BIC_find_theta_subset_one_spot(mu, y_vec, N, spot_name, mle_theta, mle_e_alpha, gamma_g, sigma2, hybrid_version=True, opt_method='L-BFGS-B', hv_x=None, hv_log_p=None, use_AIC=False, reinitialize_theta=False, verbose=False)[source]

Select sparse theta for one spot by nested subset selection and BIC.

Strategy: - Start from the MLE theta. - Cell types with mle_theta == 0 are fixed to zero and never considered. - Among cell types with mle_theta > 0, sort by mle_theta (ascending). - Construct nested active sets by sequentially setting the smallest MLE-positive thetas to zero. - For each active set, refit the (unpenalized) local model with the remaining theta entries free (sum-to-one constraint), compute BIC, and select the active set with minimal BIC.

Parameters:
  • mu (2-D numpy matrix) – matrix of celltype specific marker gene expression (celltypes * genes).

  • y_vec (1-D numpy array) – spatial gene expression (length #genes).

  • N (int or None) – sequencing depth of this spot. If it’s None, use sum of observed marker gene expressions as sequencing depth.

  • spot_name (string) – name of this spot.

  • mle_theta (1-D numpy array (length #celltypes)) – estimated theta (celltype proportion) by MLE.

  • mle_e_alpha (float) – estimated e_alpha by MLE.

  • gamma_g (1-D numpy array) – gene-specific platform effect for all genes.

  • sigma2 (float) – variance paramter of the lognormal distribution of ln(lambda). All gene share the same variance.

  • hybrid_version (bool, optional) – if True, use the hybrid_version of GLRM, i.e. in ADMM local model loss function optimization for w but adaptive lasso constrain on theta. If False, local model loss function optimization and adaptive lasso will on the same w. The default is True.

  • opt_method (string, optional) – specify method used in scipy.optimize.minimize for local model fitting. The default is ‘L-BFGS-B’, a default method in scipy for optimization with bounds. Another choice would be ‘SLSQP’, a default method in scipy for optimization with constrains and bounds.

  • hv_x (1-D numpy array, optional) – data points served as x for calculation of probability density values. Only used for heavy-tail.

  • hv_log_p (1-D numpy array, optional) – log density values of normal distribution N(0, sigma^2) + heavy-tail. Only used for heavy-tail.

  • use_AIC (bool, optional) – if True, calculate and return AIC.

  • reinitialize_theta (bool, optional) – if True, reinitialize theta as 1/#celltype and e_alpha as 1 for optimization.

  • verbose (bool, optional) – if True, print more information.

Returns:

  • optimal theta (1-D numpy array (length #celltypes)) – updated theta (celltype proportion); already after re-fit.

  • optimal e_alpha (float) – updated e_alpha (spot-specific effect); already after re-fit.

hyper_parameter_optimization.calcHVBaseModelLoss(theta, e_alpha, y, mu, gamma_g, sigma2, hv_x, hv_log_p, N, non_zero_mtx)[source]

calculate the negative log-likelihood of Poisson log-normal distribution + heavy_tail for all spots

Parameters:
  • theta (2-D numpy matrix) – estimated cell-type proportions (spots * celltypes).

  • e_alpha (1-D numpy array) – spot-specific effect for all spots.

  • y (2-D numpy matrix) – observed gene nUMI (spots * genes).

  • mu (2-D numpy matrix) – celltype specific marker genes (celltypes * genes).

  • gamma_g (1-D numpy array) – gene-specific platform effect for all genes.

  • sigma2 (float) – estimated variance paramter of the lognormal distribution of ln(lambda). All genes share the same variance.

  • hv_x (1-D numpy array) – data points served as x for calculation of probability density values. Only used for heavy-tail.

  • hv_log_p (1-D numpy array) – log density values of normal distribution N(0, sigma^2) + heavy-tail. Only used for heavy-tail.

  • N (1-D numpy array) – sequencing depth for all spots.

  • non_zero_mtx (None or 2-D numpy matrix (spots * genes)) – If it’s None, then do not filter zeros during regression. If it’s a bool 2-D numpy matrix (spots * genes) as False means genes whose nUMI=0 while True means genes whose nUMI>0 in corresponding spots. The bool indicators can be calculated based on either observerd raw nUMI counts in spatial data, or CVAE transformed nUMI counts.

Returns:

sum of negative log-likelihood across all spots.

Return type:

float

hyper_parameter_optimization.calc_AIC_BIC(w, y_vec, mu, gamma_g, sigma2, hv_x, hv_log_p, N, use_AIC=False)[source]

calculate BIC or AIC

Parameters:
  • w (1-D numpy array (length #celltypes)) – estimated theta (celltype proportion) multiply e_alpha.

  • y_vec (1-D numpy array) – spatial gene expression (length #genes).

  • mu (2-D numpy matrix) – matrix of celltype specific marker gene expression (celltypes * genes).

  • gamma_g (1-D numpy array) – gene-specific platform effect for all genes.

  • sigma2 (float) – variance paramter of the lognormal distribution of ln(lambda). All gene share the same variance.

  • hv_x (1-D numpy array, optional) – data points served as x for calculation of probability density values. Only used for heavy-tail.

  • hv_log_p (1-D numpy array, optional) – log density values of normal distribution N(0, sigma^2) + heavy-tail. Only used for heavy-tail.

  • N (int or None) – sequencing depth of this spot. If it’s None, use sum of observed marker gene expressions as sequencing depth.

  • use_AIC (bool, optional) – if True, calculate and return AIC.

Returns:

  • value (float) – calculated BIC or AIC.

  • likelihood_part (float) – negative log-likelihood.

  • dof (float) – degree of freedom, number of non-zeros minus 1.

hyper_parameter_optimization.checkEarlyStop(x, num=3)[source]

check whether we can early stop hyper-parameter tuning without test on rest candidates

if we observe num successive increasings, then do early stop

Parameters:

x (list) – the performance metric of already tested candidates.

Return type:

either None (means continuing test next candidate) or index of the optimal candidate with smallest performance metric value.

hyper_parameter_optimization.cv_find_lambda_g(data, L, stage1_theta, stage1_e_alpha, theta_mask, gamma_g, sigma2, candidate_list, hybrid_version=True, opt_method='L-BFGS-B', hv_x=None, hv_log_p=None, use_admm=False, use_likelihood=True, k=5, diagnosis=False)[source]

find optimal value for hyper-parameter lambda_g by k fold cross-validation

Parameters:
  • data (Dict) –

    a Dict contains all info need for modeling:

    X: a 2-D numpy matrix of celltype specific marker gene expression (celltypes * genes).

    Y: a 2-D numpy matrix of spatial gene expression (spots * genes).

    A: a 2-D numpy matrix of Adjacency matrix (spots * spots), or is None. Adjacency matrix of spatial sptots (1: connected / 0: disconnected). All 0 in diagonal.

    N: a 1-D numpy array of sequencing depth of all spots (length #spots). If it’s None, use sum of observed marker gene expressions as sequencing depth.

    non_zero_mtx: If it’s None, then do not filter zeros during regression. If it’s a bool 2-D numpy matrix (spots * genes) as False means genes whose nUMI=0 while True means genes whose nUMI>0 in corresponding spots. The bool indicators can be calculated based on either observerd raw nUMI counts in spatial data, or CVAE transformed nUMI counts.

    spot_names: a list of string of spot barcodes. Only keep spots passed filtering.

    gene_names: a list of string of gene symbols. Only keep actually used marker genes.

    celltype_names: a list of string of celltype names.

    initial_guess: initial guess of cell-type proportions of spatial spots.

  • L (scipy sparse matrix) – already calculated Laplacian Matrix (spots * spots).

  • stage1_theta (3-D numpy array (spots * celltypes * 1)) – estimated theta (celltype proportion) in stage 1.

  • stage1_e_alpha (1-D numpy array) – estimated e_alpha (spot-specific effect) in stage 1.

  • theta_mask (3-D numpy array (spots * celltypes * 1)) – mask for cell-type proportions (1: present, 0: not present). Only used for stage 2 theta optmization.

  • gamma_g (1-D numpy array) – gene-specific platform effect for all genes.

  • sigma2 (float) – variance paramter of the lognormal distribution of ln(lambda). All gene share the same variance.

  • candidate_list (list) – candidates for the hyper-parameter

  • hybrid_version (bool, optional) – if True, use the hybrid_version of GLRM, i.e. in ADMM local model loss function optimization for w but adaptive lasso constrain on theta. If False, local model loss function optimization and adaptive lasso will on the same w. The default is True.

  • opt_method (string, optional) – specify method used in scipy.optimize.minimize for local model fitting. The default is ‘L-BFGS-B’, a default method in scipy for optimization with bounds. Another choice would be ‘SLSQP’, a default method in scipy for optimization with constrains and bounds.

  • hv_x (1-D numpy array, optional) – data points served as x for calculation of probability density values. Only used for heavy-tail.

  • hv_log_p (1-D numpy array, optional) – log density values of normal distribution N(0, sigma^2) + heavy-tail. Only used for heavy-tail.

  • use_admm (bool, optional) – whether use ADMM iteration or directly use Laplacian loss function. The default is False, i.e. NOT use ADMM in cross-validation.

  • use_likelihood (bool, optional) – whether use negative log-likelihood as performance metric in cross-validation. The default is True, if False use RMSE of predicted gene expression.

  • k (int, optional) – the number of folds in cross-validation, The default value is 5.

  • diagnosis (bool) – if True save more information to files for diagnosis CVAE and hyper-parameter selection

Returns:

optimal lambda_g value

Return type:

float

hyper_parameter_optimization.cv_find_lambda_r(data, mle_theta, mle_e_alpha, gamma_g, sigma2, lasso_weight, candidate_list, hybrid_version=True, opt_method='L-BFGS-B', hv_x=None, hv_log_p=None, use_admm=False, use_likelihood=True, k=5, diagnosis=False, verbose=False)[source]

find optimal value for hyper-parameter lambda_r by k fold cross-validation

Parameters:
  • data (Dict) –

    a Dict contains all info need for modeling:

    X: a 2-D numpy matrix of celltype specific marker gene expression (celltypes * genes).

    Y: a 2-D numpy matrix of spatial gene expression (spots * genes).

    A: a 2-D numpy matrix of Adjacency matrix (spots * spots), or is None. Adjacency matrix of spatial sptots (1: connected / 0: disconnected). All 0 in diagonal.

    N: a 1-D numpy array of sequencing depth of all spots (length #spots). If it’s None, use sum of observed marker gene expressions as sequencing depth.

    non_zero_mtx: If it’s None, then do not filter zeros during regression. If it’s a bool 2-D numpy matrix (spots * genes) as False means genes whose nUMI=0 while True means genes whose nUMI>0 in corresponding spots. The bool indicators can be calculated based on either observerd raw nUMI counts in spatial data, or CVAE transformed nUMI counts.

    spot_names: a list of string of spot barcodes. Only keep spots passed filtering.

    gene_names: a list of string of gene symbols. Only keep actually used marker genes.

    celltype_names: a list of string of celltype names.

    initial_guess: initial guess of cell-type proportions of spatial spots.

  • mle_theta (3-D numpy array (spots * celltypes * 1)) – estimated theta (celltype proportion) by MLE.

  • mle_e_alpha (1-D numpy array) – estimated e_alpha (spot-specific effect) by MLE.

  • gamma_g (1-D numpy array) – gene-specific platform effect for all genes.

  • sigma2 (float) – variance paramter of the lognormal distribution of ln(lambda). All gene share the same variance.

  • lasso_weight (3-D numpy array (spots * celltypes * 1)) – weight of Adaptive Lasso, 1 ./ MLE theta

  • candidate_list (list) – candidates for the hyper-parameter

  • hybrid_version (bool, optional) – if True, use the hybrid_version of GLRM, i.e. in ADMM local model loss function optimization for w but adaptive lasso constrain on theta. If False, local model loss function optimization and adaptive lasso will on the same w. The default is True.

  • opt_method (string, optional) – specify method used in scipy.optimize.minimize for local model fitting. The default is ‘L-BFGS-B’, a default method in scipy for optimization with bounds. Another choice would be ‘SLSQP’, a default method in scipy for optimization with constrains and bounds.

  • hv_x (1-D numpy array, optional) – data points served as x for calculation of probability density values. Only used for heavy-tail.

  • hv_log_p (1-D numpy array, optional) – log density values of normal distribution N(0, sigma^2) + heavy-tail. Only used for heavy-tail.

  • use_admm (bool, optional) – whether use ADMM iteration or directly use Adative Lasso loss function. The default is False, i.e. NOT use ADMM in cross-validation as ADMM will cost more time.

  • use_likelihood (bool, optional) – whether use negative log-likelihood as performance metric in cross-validation. The default is True, if False use RMSE of predicted gene expression.

  • k (int, optional) – the number of folds in cross-validation, The default value is 5.

  • diagnosis (bool) – if True save more information to files for diagnosis CVAE and hyper-parameter selection.

  • verbose (bool, optional) – if True, print more information.

Returns:

optimal lambda_r value

Return type:

float

imputation

Created on Sat Jan 28 07:15:15 2023

@author: hill103

this script perform imputation on cell-types and gene expression of spatial spots

Steps:
  1. Identify edge spots (both outer and inner edges) at the original resolution

  2. Generate grid at a specified higher resolution (location of imputed smaller spots)

  3. Initialize cell type proportions for grid points

  4. Impute cell type proportions theta at higher resolution

  5. Further impute gene expression X at higher resolution

imputation.check_technique(df)[source]

check the input spatial tissue map is derived from ST or 10x Visium technique. The distance of two neighboring spots in the same row is 1 for ST and 2 for 10x. We only select 5 rows with most spots for checking.

Parameters:

df (dataframe) – dataframe of spatial spots, with columns ‘x’, ‘y’.

Returns:

tech_mode – the technique for this dataset, either ‘st’ or ‘visium’

Return type:

str

imputation.construct_M_matrix(df, sigma, phi, lazy_rw=False, alpha=0.5)[source]
construct nearest neighbor random walk matrix M, also called transition probability matrix in compute science. The value in M represents the probability that, given a node, the random walker will stay on that node in the next step

W(x,y) = exp(-||x-y||^2 / 2*sigma^2)

matrix M is based on a Gaussian kernel W with SD/bandwidth sigma and distance threshold phi

M = D^−1 * W

if lazy_rw is True, i.e. use lazy random walk with a probability alpha of staying at the current node, an additional lazy transition matrix will be calculated

P_lazy = alpha * I + (1−alpha) * P

Parameters:
  • df (dataframe) – dataframe of generated grid points at higher resolution, with columns ‘x’, ‘y’.

  • sigma (float) – standard deviation of Gaussian kernel W.

  • phi (float) – threshold of Euclidean distance in Gaussian kernel W. If two points with distance > phi, the corresponding weight in W will be 0.

  • lazy_rw (bool, optional) – if True, return lazy random walk matrix. The default is False.

  • alpha (float, optional) – the probability alpha of staying at the current node. The default is 0.5.

Returns:

M – the transition probalility matrix for random walk.

Return type:

2-D numpy array

imputation.create_grid(df, contours, hierarchy, contour_info, grid_step, miss_spot_threshold, add_edge=False)[source]

create a high resolution grid inside the contours. we first create a grid on the bounding box of the contour with any shape, then filter out any points outside the contour. we also consider the generated points falling inside the holes and determine whether to filter them. add_edge controls whether add (x,y) coordinates of edge points to the grid, which will cause the step of grid to be uneven. we filter the generated points row-by-row. creatied grid are independent across multiple outer contours if they exist, so in the imputed heatmap, there may be some empty lines inside one contour shape as some points at the same line are presented in other contours

Parameters:
  • df (dataframe) – dataframe of spatial spots, with columns ‘x’, ‘y’, and extra columns: ‘border’, ‘inner_border’, ‘outer_border’.

  • contours (tuple) – contours variable returned by cv2.findContours, used for creating grid.

  • hierarchy (3-D numpy array (1 * #contours * 4)) – hierarchy variable returned by cv2.findContours, used for creating grid.

  • contour_info (list of dicts) – parsed contour information used for creating grid.

  • grid_step (float) – a float number in (0,1), served as the steps of grid.

  • miss_spot_threshold (int) – if the number of missing (uncaptured) spots in an identified hole is less or equal this value, this hole will be ignored as if there is no hole at all.

  • add_edge (bool, optional) – if True, add (x,y) coordinates of contour edge points to the grid. Default is False.

Returns:

grid_df – dataframe of generated poins (passed filtering), with columns ‘x’, ‘y’, ‘contour_idx’.

Return type:

dataframe

imputation.do_imputation(loc_file, theta_file, spatial_file, grid_step=0.5, miss_spot_threshold=1, add_edge=False, sigma=0.35, phi=1, diagnosis=False)[source]

perform imputation given location, cell type proportions and nUMI counts of original spatial spots.

optimal hyperparameters in the Gaussian kernel sigma and phi are selected based on analysis on coarse-grained data.

this function can be called in Python to perform imputation.

Parameters:
  • loc_file (string) – full path of input csv file of spot locations in spatial transcriptomic data, with columns x and y (spots * 2).

  • theta_file (string) – full path of input csv file of cell type proportions of spots in spatial transcriptomic data (spots * cell types).

  • spatial_file (string) – full path of input csv file of raw nUMI counts in spatial transcriptomic data (spots * genes).

  • grid_step (float, optional) – step size of generated grid at higher resolution.It also equals the distance of two neighboring imputed spots, given the distance of two neighboring spatial spots is 1 for ST or 2 for 10x Visium techniques. The default is 0.5.

  • miss_spot_threshold (int, optional) – if the number of missing (uncaptured) spots in an identified hole is less or equal this value, this hole will be ignored as if there is no hole at all. The default is 1.

  • add_edge (bool, optional) – if True, add (x,y) coordinates of contour edge points to the grid. Default is False.

  • sigma (float, optional) – standard deviation of Gaussian kernel W. The default is 0.35.

  • phi (float, optional) – threshold of Euclidean distance in Gaussian kernel W. If two points with distance > phi, the corresponding weight in W will be 0. The default is 1.

  • diagnosis (bool, optional) – if True save the scatter plot including spatial spots and imputed spots.

Returns:

  • impute_df (dataframe) – ataframe of generated grid poins at higher resolution (passed filtering), with columns ‘x’, ‘y’, ‘contour_idx’.

  • impute_theta (dataframe) – imputed cell type proportions of generated grid points at higher resolution, with columns are cell types.

  • impute_X (dataframe) – imputed normalized gene expression of generated grid points at higher resolution, rows are grid points, columns are gene expressions.

imputation.do_random_walk(df, theta, sigma, phi, n_step=1, lazy_rw=False, alpha=0.5)[source]

impute cell type proportion theta for generated grid points at higher resolution using one step random walk.

note if the input cell type proportions are normalized (i.e., each row sums to 1), the imputed results using a valid transition probability matrix will also be normalized.

Parameters:
  • df (dataframe) – dataframe of generated grid points at higher resolution, with columns ‘x’, ‘y’.

  • theta (dataframe) – initial cell type proportions of generated grid points at higher resolution, with columns are cell types.

  • sigma (float) – standard deviation of Gaussian kernel W.

  • phi (float) – threshold of Euclidean distance in Gaussian kernel W. If two points with distance > phi, the corresponding weight in W will be 0.

  • n_step (int, optional) – steps for random walk. The default is 1.

  • lazy_rw (bool, optional) – if True, return lazy random walk matrix. The default is False.

  • alpha (float, optional) – the probability alpha of staying at the current node. The default is 0.5.

Returns:

impute_theta – imputed cell type proportions of generated grid points at higher resolution, with columns are cell types.

Return type:

dataframe

imputation.identify_edges(df, mode, keep_filled=False)[source]

identify edge spots. Column ‘border’ of Edge spots are True. NOTE edge spot identification is based on ‘x’ and ‘y’ coordinate indexes not pixels. when keep_filled=True, filled spots for 10x will be kept for creating grid.

Parameters:
  • df (dataframe) – dataframe of spatial spots, with columns ‘x’, ‘y’.

  • mode (string) – either ‘st’ for Spatial Transcriptomics technique, or ‘visium’ for 10x Genomics Visium technique.

  • keep_filled (bool, optional) – if True, filled spots for 10x will be kept for creating grid.

Returns:

  • df (dataframe) – dataframe with extra columns: ‘border’, ‘inner_border’, ‘outer_border’, ‘filled’.

  • contours (tuple) – contours variable returned by cv2.findContours, used for creating grid.

  • hierarchy (3-D numpy array (1 * #contours * 4)) – hierarchy variable returned by cv2.findContours, used for creating grid.

  • contour_info (list of dicts) – parsed contour information used for creating grid.

imputation.impute_expression(X, theta, grid_theta)[source]

impute gene expression for generated grid points at higher resolution. The imputed expression are assured to sum to 1 for each spot/row. And all values in imputed expression are non-negative, as multiplying non-negative numbers will always yield a non-negative result.

X_imputed = theta_imputed * theta^+ * X

theta^+ = (theta^T * theta)^(−1) * theta^T

Parameters:
  • X (dataframe) – nUMI counts of spatial spots, rows are spots and columns are gene expressions

  • theta (dataframe) – cell type proportions of spatial spots at original resolution, with columns are cell types.

  • grid_theta (dataframe) – imputed cell type proportions of generated grid points at higher resolution, with columns are cell types.

Returns:

X_imputed – imputed normalized gene expression of generated grid points at higher resolution, rows are grid points, columns are gene expressions.

Return type:

dataframe

imputation.impute_spots_with_theta(df, theta, grid_step=0.5, miss_spot_threshold=1, add_edge=False, sigma=0.35, phi=1, diagnosis=False)[source]

given the spatial spots location and cell type proportions, impute smaller spots at a higher resolution, return the imputed location and cell type proportions for these smaller spots.

Parameters:
  • df (dataframe) – dataframe of spatial spots at original resolution, with columns ‘x’, ‘y’.

  • theta (dataframe) – cell type proportions of spatial spots at original resolution, with columns are cell types.

  • grid_step (float, optional) – step size of generated grid at higher resolution.It also equals the distance of two neighboring imputed spots, given the distance of two neighboring spatial spots is 1 for ST or 2 for 10x Visium techniques. The default is 0.5.

  • miss_spot_threshold (int, optional) – if the number of missing (uncaptured) spots in an identified hole is less or equal this value, this hole will be ignored as if there is no hole at all. The default is 1.

  • add_edge (bool, optional) – if True, add (x,y) coordinates of contour edge points to the grid. Default is False.

  • sigma (float, optional) – standard deviation of Gaussian kernel W. The default is 0.35.

  • phi (float, optional) – threshold of Euclidean distance in Gaussian kernel W. If two points with distance > phi, the corresponding weight in W will be 0. The default is 1.

  • diagnosis (bool, optional) – if True save the scatter plot including spatial spots and imputed spots.

Returns:

  • impute_df (dataframe) – dataframe of generated grid points at higher resolution, with columns ‘x’, ‘y’, ‘contour_idx’.

  • impute_theta (dataframe) – imputed cell type proportions of generated grid points at higher resolution, with columns are cell types.

imputation.initialize_theta_grid_points(df, theta, grid_df)[source]

initialize cell type proportions grid_theta for generated grid points in higher resolution. for one grid point, grid_theta is set as the theta value of spatial spot in original resolution with the smallest Euclidean distance. if there are multiple spatial spots with the same smallest Euclidean distance, then take average of theta of those spots finally we will normalize all grid_theta to sum to 1.

Parameters:
  • df (dataframe) – dataframe of spatial spots at original resolution, with columns ‘x’, ‘y’.

  • theta (dataframe) – cell type proportions of spatial spots at original resolution, with columns are cell types.

  • grid_df (dataframe) – dataframe of generated grid points at higher resolution, with columns ‘x’, ‘y’.

Returns:

grid_theta – cell type proportions of generated grid points at higher resolution, with columns are cell types.

Return type:

dataframe

imputation.main()[source]
imputation.parseOpt()[source]

parse the command line parameters and return them as a dict

Parameters:

None.

Returns:

paramdict

parsed command line parameters for model, including:

spatial_file : full file path of spatial transcriptomic data

loc_file : full file path of spot locations in spatial transcriptomic data

prop_file : full file path of cell type proportions of spots in spatial transcriptomic data

diameter : the physical diameter of spatial spots

impute_diameter : target spot diameter for imputation

hole_min_spots : threshold of number of uncaptured spots to validate holes

preserve_shape : whether to preserve the exact shape of tissue map

diagnosis : whether to draw plot for diagnosis

Return type:

Dict

imputation.usage()[source]

help information of the main function

Parameters:

None.

Return type:

None.

local_fit_numba

Created on Thu May 5 22:17:11 2022

@author: hill103

this script implement the optimization of theta and sigma^2 using Poisson log-normal distribution + heavy-tail

use Numba parallel for best performance

we calculate the likelihood for each gene sequentially, utilizing Numba’s parallel capabilities to accelerate the computation within each gene’s calculation.

NOTE: simplifying the loss-computation code and removing unneeded nested wrappers yielded a speedup

and print not supported in Numba function

functions for likelihood calculation:

calc_hv_numba: Numba function so parallel computing by Numba; calculate likelihoods given an array of y and mu then return; NOT related with cache dict

hv_comb: return sum of negative log-likelihoods + gradient of w; used for theta optimization; directly call hv_numba

hv_wrapper: return sum of negative log-likelihoods; used for sigma^2 optimization; directly call hv_numba

hv_numba: call calc_hv_numba and return array of likelihoods; support caching (DELETED)

sequence of function calls for parameter optimization:

update_theta -> objective_loss_theta -> hv_comb -> calc_hv_numba

update_sigma2 -> objective_loss_sigma2 -> hv_wrapper -> calc_hv_numba

calcHVBaseModelLoss -> hv_wrapper in hyper parameter tunning

class local_fit_numba.RandomDisplacementBounds(xmin, xmax, stepsize=0.5)[source]

Bases: object

A class to perform random displacement with bounds on parameters during optimization.

This step function is designed to modify the basinhopping algorithm’s step-taking behavior by ensuring that new positions remain within specified bounds. This implementation uses a direct calculation to determine the minimum and maximum allowable steps rather than using acceptance-rejection sampling, which can be found in the more general approach discussed on StackOverflow (https://stackoverflow.com/a/21967888/2320035). The approach has been modified for enhanced performance with specific bounds.

ref: https://stackoverflow.com/questions/47055970/how-can-i-write-bounds-of-parameters-by-using-basinhopping

xmin

The lower bounds of the parameters. None indicates unbounded.

Type:

np.ndarray or list

xmax

The upper bounds of the parameters. None indicates unbounded.

Type:

np.ndarray or list

stepsize

The maximum step size to take in any parameter direction.

Type:

float

local_fit_numba.adaptive_lasso(nu, rho, lambda_r=1.0, lasso_weight=None)[source]

update theta_tilde by Adaptive Lasso and ADMM loss

theta_tilde = argmin lambda_r*lasso_weight*theta_tilde + 1/(2*rho) * (||theta_tilde-theta_hat+u_tilde||_2)^2

that is

  • if theta_tilde>=0, theta_tilde = theta_hat - u_tilde - lambda_r*rho*lasso_weight

  • if theta_tilde<0, theta_tilde = theta_hat - u_tilde + lambda_r*rho*lasso_weight

change it to

  • if theta_hat-u_tilde-lambda_r*rho*lasso_weight>=0, theta_tilde = theta_hat - u_tilde - lambda_r*rho*lasso_weight

  • if theta_hat-u_tilde+lambda_r*rho*lasso_weight<=0, theta_tilde = theta_hat - u_tilde + lambda_r*rho*lasso_weight

  • if theta_hat-u_tilde-lambda_r*rho*lasso_weight<0 or theta_hat-u_tilde+lambda_r*rho*lasso_weight>0, not defined, let theta_tilde = 0

and nu = theta_hat - u_tilde

WANRING: negative theta tilde value observed, so also clip it to >= 0

Parameters:
  • nu (3-D numpy array (spots * celltypes * 1)) – variable for ADMM penalty of all spots in 3 part ADMM nu = theta_hat (used for regularization/penalty) - u_tilde (scaled dual variables to make theta_tilde = theta_hat)

  • rho (float) – parameter for the strength of ADMM loss to make theta_tilde equals theta_hat

  • lambda_r (float, optional) – strength for Adaptive Lasso penalty. The default is 1.0.

  • lasso_weight (3-D numpy array (spots * celltypes * 1), optional) – calculated weight for adaptive lasso. The default is None.

Returns:

updated theta_tilde.

Return type:

3-D numpy array (spots * celltypes * 1)

local_fit_numba.calc_hv_numba(MU, y_vec, hv_x, hv_log_p, output, gamma)[source]

calculate the likelihood value given the distribution parameters considering heavy-tail and assign them to the output variable

note that add a small value to avoid log(0)

Parameters:
  • MU (1-D numpy array) – mean value of log-normal distribution for all genes.

  • y_vec (1-D numpy array) – observed gene counts of one spot.

  • hv_x (1-D numpy array, optional) – data points served as x for calculation of probability density values. Only used for heavy-tail.

  • hv_log_p (1-D numpy array, optional) – log density values of normal distribution N(0, sigma^2) + heavy-tail. Only used for heavy-tail.

  • output (1-D numpy array) – empty array used to store the calculated values.

  • gamma (float) – increment to calculate the heavy-tail probabilities.

Returns:

likelihood of input genes in one spot.

Return type:

1-D numpy array

local_fit_numba.fit_base_model_plus_laplacian(data, L, theta, e_alpha, gamma_g, sigma2, lambda_r=None, lasso_weight=None, hv_x=None, hv_log_p=None, theta_mask=None, opt_method='L-BFGS-B', global_optimize=False, hybrid_version=True, verbose=False)[source]

fit base model plus Laplacian penalty, this function is to replace the whole ADMM framework in GLRM stage two

Note: To keep consistent, the output of the fit result is still 3-Dimensional. The dimension transform will be performed outside this function if needed

Parameters:
  • data (Dict) –

    a Dict contains all info need for modeling:

    X: a 2-D numpy matrix of celltype specific marker gene expression (celltypes * genes).

    Y: a 2-D numpy matrix of spatial gene expression (spots * genes).

    A: a 2-D numpy matrix of Adjacency matrix (spots * spots), or is None. Adjacency matrix of spatial sptots (1: connected / 0: disconnected). All 0 in diagonal.

    N: a 1-D numpy array of sequencing depth of all spots (length #spots). If it’s None, use sum of observed marker gene expressions as sequencing depth.

    non_zero_mtx: If it’s None, then do not filter zeros during regression. If it’s a bool 2-D numpy matrix (spots * genes) as False means genes whose nUMI=0 while True means genes whose nUMI>0 in corresponding spots. The bool indicators can be calculated based on either observerd raw nUMI counts in spatial data, or CVAE transformed nUMI counts.

    spot_names: a list of string of spot barcodes. Only keep spots passed filtering.

    gene_names: a list of string of gene symbols. Only keep actually used marker genes.

    celltype_names: a list of string of celltype names.

    initial_guess: initial guess of cell-type proportions of spatial spots.

  • L (scipy sparse matrix (spots * spots)) – Laplacian matrix. Note the strenth lambda_g is already absorbed in the L

  • theta (3-D numpy array (spots * celltypes * 1)) – initial guess of theta (celltype proportion).

  • e_alpha (1-D numpy array) – initial guess of e_alpha (spot-specific effect).

  • gamma_g (1-D numpy array) – gene-specific platform effect for all genes.

  • sigma2 (float) – initial guess of variance paramter of the lognormal distribution of ln(lambda). All genes and spots share the same variance. it may not be updated during this optimization, i.e. sigma2 is treated as an already optimized value.

  • lambda_r (float, optional) – strength for Adaptive Lasso penalty. The default is None here.

  • lasso_weight (3-D numpy array (spots * celltypes * 1), optional) – calculated weight for adaptive lasso. The default is None.

  • hv_x (1-D numpy array, optional) – data points served as x for calculation of probability density values. Only used for heavy-tail.

  • hv_log_p (1-D numpy array, optional) – log density values of normal distribution N(0, sigma^2) + heavy-tail. Only used for heavy-tail.

  • theta_mask (3-D numpy array (spots * celltypes * 1), optional) – mask for cell-type proportions (1: present, 0: not present). Only used for stage 2 theta optmization.

  • opt_method (string, optional) – specify method used in scipy.optimize.minimize for local model fitting. The default is ‘L-BFGS-B’, a default method in scipy for optimization with bounds. Another choice would be ‘SLSQP’, a default method in scipy for optimization with constrains and bounds.

  • global_optimize (bool, optional) – if is True, use basin-hopping algorithm to find the global minimum. The default is False.

  • hybrid_version (bool, optional) – if True, use the hybrid_version of GLRM, i.e. in ADMM local model loss function optimization for w but adaptive lasso constrain on theta. If False, local model loss function optimization and adaptive lasso will on the same w. The default is True.

  • verbose (bool, optional) – if True, print more information.

Returns:

estimated model coefficients, including:

theta : celltype proportions (#spots * #celltypes * 1)

e_alpha : spot-specific effect (1-D array with length #spot)

sigma2 : variance paramter of the lognormal distribution (float)

gamma_g : gene-specific platform effect for all genes (1-D array with length #gene)

Return type:

Dict

local_fit_numba.generate_log_heavytail_array(x, sigma)[source]

generate a array of log density values of standard normal distribution + heavy-tail given a specific sigma

we assume the normal distribution is N(0, sigma^2), and transform to N(0, 1) by divided by sigma, and calculate the heavy-tail density values, then transform back to N(0, sigma^2) and take log. We can re-use these pre-calculated values for all genes and spots

after divided by sigma, the normal + heavy-tail integrate to 1, i.e. sum(p*gamma)=1, gamma is the interval

Parameters:
  • x (1-D numpy array) – data points served as x for calculation of probability density values

  • sigma (float) – SD of normal distribution, corresponding to the dispertion in GLRM. All genes and all spots share the same SD

Returns:

log density values of normal distribution N(0, sigma^2) + heavy-tail.

Return type:

1-D numpy array

local_fit_numba.hv_comb(w_vec, y_vec, mu, gamma_g, sigma2, hv_x, hv_log_p, N)[source]

a function to calculate the negative likelihood value given the distribution parameters considering heavy-tail plus corresponding gradient vector

also return 1st order derivative vector of w

this function will be used as target function for theta optimization

we assume:

y ~ Poisson(N*lambda)

ln(lambda) follow epsilon’s distribution ~ N(mu, sigma2)

then ln(N*lambda) ~ N(mu+ln(N), sigma2)

N is sequencing depth for this spot mu is alpha + log(theta*marker gene) + gamma

Now we also consider the heavy-tail instead of just normal distribution

Parameters:
  • w_vec (1-D numpy array) – e_alpha (spot-specific effect) * theta (celltype proportion) of one spot.

  • y_vec (1-D numpy array) – observed gene counts of one spot.

  • mu (2-D numpy matrix) – celltype specific marker genes (celltypes * genes)

  • gamma_g (1-D numpy array) – gene-specific platform effect for all genes.

  • sigma2 (float) – variance paramter of the lognormal distribution of ln(lambda). All gene share the same variance.

  • hv_x (1-D numpy array, optional) – data points served as x for calculation of probability density values. Only used for heavy-tail.

  • hv_log_p (1-D numpy array, optional) – log density values of normal distribution N(0, sigma^2) + heavy-tail. Only used for heavy-tail.

  • N (float) – sequencing depth for this spot. If is None, use sum(y_vec) instead.

Returns:

sum of negative log-likelihood across all genes in one spot + gradient vector of w.

Return type:

tuple of (float, 1-D numpy array)

local_fit_numba.hv_wrapper(w_vec, y_vec, mu, gamma_g, sigma2, hv_x, hv_log_p, N)[source]

a wrapper to calculate the negative log-likelihood value given the distribution parameters considering heavy-tail

this wrapper will be used as target function for sigma^2 optimization

we assume:

y ~ Poisson(N*lambda)

ln(lambda) follow epsilon’s distribution ~ N(mu, sigma2)

then ln(N*lambda) ~ N(mu+ln(N), sigma2)

N is sequencing depth for this spot mu is alpha + log(theta*marker gene) + gamma

Now we also consider the heavy-tail instead of just normal distribution

Parameters:
  • w_vec (1-D numpy array) – e_alpha (spot-specific effect) * theta (celltype proportion) of one spot.

  • y_vec (1-D numpy array) – observed gene counts of one spot.

  • mu (2-D numpy matrix) – celltype specific marker genes (celltypes * genes)

  • gamma_g (1-D numpy array) – gene-specific platform effect for all genes.

  • sigma2 (float) – variance paramter of the lognormal distribution of ln(lambda). All gene share the same variance.

  • hv_x (1-D numpy array, optional) – data points served as x for calculation of probability density values. Only used for heavy-tail.

  • hv_log_p (1-D numpy array, optional) – log density values of normal distribution N(0, sigma^2) + heavy-tail. Only used for heavy-tail.

  • N (float) – sequencing depth for this spot. If is None, use sum(y_vec) instead.

Returns:

sum of negative log-likelihood across all genes in one spot.

Return type:

float

local_fit_numba.objective_loss_theta(w_vec, y_vec, mu, gamma_g, sigma2, nu_vec, rho, lambda_r, lasso_weight_vec, lambda_l2, hybrid_version, hv_x, hv_log_p, N)[source]

calculate loss function for updating theta (celltype proportion)

the loss function contains four parts (defined for each spot separately)

also return gradient of w

  1. negative log-likelihood of the base model given observed data and initial parameter value. It sums across all genes

  2. a loss of ADMM to make theta equals theta_hat (used for regularization/penalty; optional, controlled by nu_vec and rho)

    1/(2*rho) (||w/sum(w) - theta_hat + u||_2)^2

u is the scaled dual variables to make theta = theta_hat and nu = theta_hat - u

and we did re-parametrization w = e^alpha * theta, so theta = w / sum(w)

  1. a loss of Adaptive Lasso to shrink theta (optional, controlled by lambda_r and lasso_weight_vec)

    lambda_r * (inner product(w/sum(w), lasso_weight))

  2. a loss of L2 penalty to shrink theta (optional, controlled by lambda_l2)

    lambda_l2 * (sum(squared(w/sum(w))))

Parameters:
  • w_vec (1-D numpy array) – e_alpha (spot-specific effect) * theta (celltype proportion) of one spot.

  • y_vec (1-D numpy array) – observed gene counts of one spot.

  • mu (2-D numpy matrix) – celltype specific marker genes (celltypes * genes)

  • gamma_g (1-D numpy array) – gene-specific platform effect for all genes.

  • sigma2 (float) – variance paramter of the lognormal distribution of ln(lambda). All gene share the same variance.

  • nu_vec (1-D numpy array) – variable for ADMM penalty of one spot in 3 part ADMM nu = theta_hat (used for regularization/penalty) - u (scaled dual variables to make theta = theta_hat)

  • rho (float) – parameter for the strength of ADMM loss to make theta equals theta_hat

  • lambda_r (float) – parameter for the strength of Adaptive Lasso loss to shrink theta

  • lasso_weight_vec (1-D numpy array) – weight of Adaptive Lasso, 1 ./ theta

  • lambda_l2 (float) – parameter for the strength of L2 panealty to shrink theta

  • hybrid_version (bool, optional) – if True, use the hybrid_version of GLRM, i.e. in ADMM local model loss function optimization for w but adaptive lasso constrain on theta. If False, local model loss function optimization and adaptive lasso will on the same w. The default is True.

  • hv_x (1-D numpy array, optional) – data points served as x for calculation of probability density values. Only used for heavy-tail.

  • hv_log_p (1-D numpy array, optional) – log density values of normal distribution N(0, sigma^2) + heavy-tail. Only used for heavy-tail.

  • N (float) – sequencing depth for this spot. If is None, use sum(y_vec) instead.

Returns:

the loss function (base model loss + ADMM loss + Adaptive Lasso loss + L2 loss) of one spot to update w (e_alpha*theta) + gradient vector

Return type:

a tuple (float, 1-D numpy array)

local_fit_numba.optimize_one_theta(mu, y_vec, N, this_warm_start_theta, this_warm_start_e_alpha, gamma_g, sigma2, spot_name, nu_vec=None, rho=None, lambda_r=None, lasso_weight_vec=None, lambda_l2=None, global_optimize=False, hybrid_version=True, opt_method='L-BFGS-B', hv_x=None, hv_log_p=None, this_theta_mask=None, skip_opt=True, verbose=False)[source]

update theta (celltype proportion) sigma2 (variance paramter of the log-normal distribution) and gamma_g (gene-specific platform effect) by MLE in ONE SPOT

Parameters:
  • mu (2-D numpy matrix) – matrix of celltype specific marker gene expression (celltypes * genes).

  • y_vec (1-D numpy array) – spatial gene expression (length #genes).

  • N (int or None) – sequencing depth of this spot. If it’s None, use sum of observed marker gene expressions as sequencing depth.

  • spot_name (string) – name of this spot.

  • this_warm_start_theta (1-D numpy array (length #celltypes)) – initial guess of theta (celltype proportion).

  • this_warm_start_e_alpha (float) – initial guess of e_alpha (spot-specific effect).

  • gamma_g (1-D numpy array) – gene-specific platform effect for all genes.

  • sigma2 (float) – variance paramter of the lognormal distribution of ln(lambda). All gene share the same variance.

  • nu_vec (1-D numpy array (length #celltypes), optional) – variable for ADMM penalty of all spots in 3 part ADMM nu = theta_hat (used for regularization/penalty) - u (scaled dual variables to make theta = theta_hat)

  • rho (float, optional) – parameter for the strength of ADMM loss to make theta equals theta_hat

  • lambda_r (float, optional) – parameter for the strength of Adaptive Lasso loss to shrink theta

  • lasso_weight_vec (1-D numpy array (length #celltypes), optional) – weight of Adaptive Lasso, 1 ./ theta

  • lambda_l2 (float) – parameter for the strength of L2 panealty to shrink theta

  • global_optimize (bool, optional) – if is True, use basin-hopping algorithm to find the global minimum. The default is False.

  • hybrid_version (bool, optional) – if True, use the hybrid_version of GLRM, i.e. in ADMM local model loss function optimization for w but adaptive lasso constrain on theta. If False, local model loss function optimization and adaptive lasso will on the same w. The default is True.

  • opt_method (string, optional) – specify method used in scipy.optimize.minimize for local model fitting. The default is ‘L-BFGS-B’, a default method in scipy for optimization with bounds. Another choice would be ‘SLSQP’, a default method in scipy for optimization with constrains and bounds.

  • hv_x (1-D numpy array, optional) – data points served as x for calculation of probability density values. Only used for heavy-tail.

  • hv_log_p (1-D numpy array, optional) – log density values of normal distribution N(0, sigma^2) + heavy-tail. Only used for heavy-tail.

  • this_theta_mask (1-D numpy array (length #celltypes), optional) – mask for cell-type proportions (1: present, 0: not present).

  • skip_opt (bool, optional) – if True, when only one cell type present, we skip optimization and directly return theta.

  • verbose (bool, optional) – if True, print more information.

Returns:

w_result – updated w, i.e. theta (celltype proportion) * e_alpha.

Return type:

1-D numpy array (length #celltypes)

local_fit_numba.update_sigma2(data, theta, e_alpha, gamma_g, sigma2, opt_method='L-BFGS-B', global_optimize=False, hv_x=None, verbose=False)[source]

update sigma2 (variance paramter of the lognormal distribution) given theta (celltype proportion), e_alpha (spot-specific effect) and gamma_g (gene-specific platform effect) by MLE

we assume

ln(lambda) = alpha + gamma_g + ln(sum(theta*mu_X)) + epsilon

subject to sum(theta)=1, theta>=0

mu_X is marker genes from data[‘X’]

then the mean parameter of the log-normal distribution of ln(lambda) is alpha + gamma_g + ln(sum(theta*mu_X))

we did re-parametrization w = e^alpha * theta, then

ln(lambda) = gamma_g + ln(sum([e^alpha*theta]*mu_X)) + epsilon

subject to w>=0, it will imply sum(theta)=1 and theta>=0

currently optimization of sigma2 DO NOT support parallel computing

Parameters:
  • data (Dict) –

    a Dict contains all info need for modeling:

    X: a 2-D numpy matrix of celltype specific marker gene expression (celltypes * genes).

    Y: a 2-D numpy matrix of spatial gene expression (spots * genes).

    A: a 2-D numpy matrix of Adjacency matrix (spots * spots), or is None. Adjacency matrix of spatial sptots (1: connected / 0: disconnected). All 0 in diagonal.

    N: a 1-D numpy array of sequencing depth of all spots (length #spots). If it’s None, use sum of observed marker gene expressions as sequencing depth.

    non_zero_mtx: If it’s None, then do not filter zeros during regression. If it’s a bool 2-D numpy matrix (spots * genes) as False means genes whose nUMI=0 while True means genes whose nUMI>0 in corresponding spots. The bool indicators can be calculated based on either observerd raw nUMI counts in spatial data, or CVAE transformed nUMI counts.

    spot_names: a list of string of spot barcodes. Only keep spots passed filtering.

    gene_names: a list of string of gene symbols. Only keep actually used marker genes.

    celltype_names: a list of string of celltype names.

  • theta (3-D numpy array (spots * celltypes * 1)) – theta (celltype proportion).

  • e_alpha (1-D numpy array) – e_alpha (spot-specific effect).

  • gamma_g (1-D numpy array) – gene-specific platform effect for all genes.

  • sigma2 (float) – initial guess of variance paramter of the lognormal distribution of ln(lambda). All genes and spots share the same variance.

  • opt_method (string, optional) – specify method used in scipy.optimize.minimize for local model fitting. The default is ‘L-BFGS-B’, a default method in scipy for optimization with bounds. Another choice would be ‘SLSQP’, a default method in scipy for optimization with constrains and bounds.

  • global_optimize (bool, optional) – if is True, use basin-hopping algorithm to find the global minimum. The default is False.

  • hv_x (1-D numpy array, optional) – data points served as x for calculation of probability density values. Only used for heavy-tail.

  • verbose (bool, optional) – if True, print more information.

Returns:

updated sigma2

Return type:

float

local_fit_numba.update_theta(data, warm_start_theta, warm_start_e_alpha, gamma_g, sigma2, nu=None, rho=None, lambda_r=None, lasso_weight=None, lambda_l2=None, global_optimize=False, hybrid_version=True, opt_method='L-BFGS-B', hv_x=None, hv_log_p=None, theta_mask=None, verbose=False)[source]

update theta (celltype proportion) and e_alpha (spot-specific effect) given sigma2 (variance paramter of the log-normal distribution) and gamma_g (gene-specific platform effect) by MLE

we assume

ln(lambda) = alpha + gamma_g + ln(sum(theta*mu_X)) + epsilon

subject to sum(theta)=1, theta>=0

mu_X is marker genes from data[‘X’]

then the mean parameter of the lognormal distribution of ln(lambda) is alpha + gamma_g + ln(sum(theta*mu_X))

we did re-parametrization w = e^alpha * theta, then

ln(lambda) = gamma_g + ln(sum([e^alpha*theta]*mu_X)) + epsilon

subject to w>=0, it will imply sum(theta)=1 and theta>=0

the steps to update theta and e_alpha:
  1. dimension change of theta, theta_hat, u from 3-D (spots * celltypes * 1) to 1-D (celltypes), and do re-parametrization to get w

  2. solve w for each spot in parallel

  3. extract updated theta and e_alpha from w, and change the dimension of updated theta, theta_hat, u back

Parameters:
  • data (Dict) –

    a Dict contains all info need for modeling:

    X: a 2-D numpy matrix of celltype specific marker gene expression (celltypes * genes).

    Y: a 2-D numpy matrix of spatial gene expression (spots * genes).

    A: a 2-D numpy matrix of Adjacency matrix (spots * spots), or is None. Adjacency matrix of spatial sptots (1: connected / 0: disconnected). All 0 in diagonal.

    N: a 1-D numpy array of sequencing depth of all spots (length #spots). If it’s None, use sum of observed marker gene expressions as sequencing depth.

    non_zero_mtx: If it’s None, then do not filter zeros during regression. If it’s a bool 2-D numpy matrix (spots * genes) as False means genes whose nUMI=0 while True means genes whose nUMI>0 in corresponding spots. The bool indicators can be calculated based on either observerd raw nUMI counts in spatial data, or CVAE transformed nUMI counts.

    spot_names: a list of string of spot barcodes. Only keep spots passed filtering.

    gene_names: a list of string of gene symbols. Only keep actually used marker genes.

    celltype_names: a list of string of celltype names.

  • warm_start_theta (3-D numpy array (spots * celltypes * 1)) – initial guess of theta (celltype proportion).

  • warm_start_e_alpha (1-D numpy array) – initial guess of e_alpha (spot-specific effect).

  • gamma_g (1-D numpy array) – gene-specific platform effect for all genes.

  • sigma2 (float) – variance paramter of the lognormal distribution of ln(lambda). All gene share the same variance.

  • nu (3-D numpy array (spots * celltypes * 1), optional) – variable for ADMM penalty of all spots in 3 part ADMM nu = theta_hat (used for regularization/penalty) - u (scaled dual variables to make theta = theta_hat)

  • rho (float, optional) – parameter for the strength of ADMM loss to make theta equals theta_hat

  • lambda_r (float, optional) – parameter for the strength of Adaptive Lasso loss to shrink theta

  • lasso_weight (3-D numpy array (spots * celltypes * 1), optional) – weight of Adaptive Lasso, 1 ./ theta

  • lambda_l2 (float) – parameter for the strength of L2 panealty to shrink theta

  • global_optimize (bool, optional) – if is True, use basin-hopping algorithm to find the global minimum. The default is False.

  • hybrid_version (bool, optional) – if True, use the hybrid_version of GLRM, i.e. in ADMM local model loss function optimization for w but adaptive lasso constrain on theta. If False, local model loss function optimization and adaptive lasso will on the same w. The default is True.

  • opt_method (string, optional) – specify method used in scipy.optimize.minimize for local model fitting. The default is ‘L-BFGS-B’, a default method in scipy for optimization with bounds. Another choice would be ‘SLSQP’, a default method in scipy for optimization with constrains and bounds.

  • hv_x (1-D numpy array, optional) – data points served as x for calculation of probability density values. Only used for heavy-tail.

  • hv_log_p (1-D numpy array, optional) – log density values of normal distribution N(0, sigma^2) + heavy-tail. Only used for heavy-tail.

  • theta_mask (3-D numpy array (spots * celltypes * 1), optional) – mask for cell-type proportions (1: present, 0: not present). Only used for stage 2 theta optmization.

  • verbose (bool, optional) – if True, print more information.

Returns:

  • theta_results (3-D numpy array (spots * celltypes * 1)) – updated theta (celltype proportion).

  • e_alpha_results (1-D numpy array) – updated e_alpha (spot-specific effect).

model_fit

Created on Sat May 7 22:36:00 2022

@author: hill103

this script stores functions of model fitting currently we use a two-stage implement to fit GLRM

model_fit.estimating_gamma_g(data, hybrid_version=True, opt_method='L-BFGS-B', verbose=False)[source]

estimate platform effect gamma_g by fit model on pseudo-bulk measurement

Parameters:
  • data (Dict) –

    a Dict contains all info need for modeling:

    X: a 2-D numpy matrix of celltype specific marker gene expression (celltypes * genes).

    Y: a 2-D numpy matrix of spatial gene expression (spots * genes).

    A: a 2-D numpy matrix of Adjacency matrix (spots * spots), or is None. Adjacency matrix of spatial sptots (1: connected / 0: disconnected). All 0 in diagonal.

    N: a 1-D numpy array of sequencing depth of all spots (length #spots). If it’s None, use sum of observed marker gene expressions as sequencing depth.

    non_zero_mtx: If it’s None, then do not filter zeros during regression. If it’s a bool 2-D numpy matrix (spots * genes) as False means genes whose nUMI=0 while True means genes whose nUMI>0 in corresponding spots. The bool indicators can be calculated based on either observerd raw nUMI counts in spatial data, or CVAE transformed nUMI counts.

    spot_names: a list of string of spot barcodes. Only keep spots passed filtering.

    gene_names: a list of string of gene symbols. Only keep actually used marker gene symbols.

    celltype_names: a list of string of celltype names.

    initial_guess: initial guess of cell-type proportions of spatial spots.

  • hybrid_version (bool, optional) – if True, use the hybrid_version of GLRM, i.e. in ADMM local model loss function optimization for w but adaptive lasso constrain on theta. If False, local model loss function optimization and adaptive lasso will on the same w. The default is True.

  • opt_method (string, optional) – specify method used in scipy.optimize.minimize for local model fitting. The default is ‘L-BFGS-B’, a default method in scipy for optimization with bounds. Another choice would be ‘SLSQP’, a default method in scipy for optimization with constrains and bounds.

  • verbose (bool, optional) – if True, print more information in program running

Returns:

estimated gamma_g

Return type:

1-D numpy array (#genes)

model_fit.fit_base_model(data, gamma_g=None, global_optimize=False, hybrid_version=True, opt_method='L-BFGS-B', verbose=False, use_initial_guess=False)[source]

fit local or base model without any Adaptive Lasso constrain or Graph Laplacian constrain

this fitting is only used for gamma_g estimation and MLE theta estimation in GLRM stage 1

when fitting the base model, we will perform fitting iteratively until the sigma^2 change <= 0.001

Note: the output of the fit result is 3-Dimensional. The dimension transform will be performed outside this function if needed

Parameters:
  • data (Dict) –

    a Dict contains all info need for modeling:

    X: a 2-D numpy matrix of celltype specific marker gene expression (celltypes * genes).

    Y: a 2-D numpy matrix of spatial gene expression (spots * genes).

    A: a 2-D numpy matrix of Adjacency matrix (spots * spots), or is None. Adjacency matrix of spatial sptots (1: connected / 0: disconnected). All 0 in diagonal.

    N: a 1-D numpy array of sequencing depth of all spots (length #spots). If it’s None, use sum of observed marker gene expressions as sequencing depth.

    non_zero_mtx: If it’s None, then do not filter zeros during regression. If it’s a bool 2-D numpy matrix (spots * genes) as False means genes whose nUMI=0 while True means genes whose nUMI>0 in corresponding spots. The bool indicators can be calculated based on either observerd raw nUMI counts in spatial data, or CVAE transformed nUMI counts.

    spot_names: a list of string of spot barcodes. Only keep spots passed filtering.

    gene_names: a list of string of gene symbols. Only keep actually used marker gene symbols.

    celltype_names: a list of string of celltype names.

    initial_guess: initial guess of cell-type proportions of spatial spots.

  • gamma_g (1-D numpy array) – gene-specific platform effect for all genes.

  • global_optimize (bool, optional) – if is True, use basin-hopping algorithm to find the global minimum. The default is False.

  • hybrid_version (bool, optional) – if True, use the hybrid_version of GLRM, i.e. in ADMM local model loss function optimization for w but adaptive lasso constrain on theta. If False, local model loss function optimization and adaptive lasso will on the same w. The default is True.

  • opt_method (string, optional) – specify method used in scipy.optimize.minimize for local model fitting. The default is ‘L-BFGS-B’, a default method in scipy for optimization with bounds. Another choice would be ‘SLSQP’, a default method in scipy for optimization with constrains and bounds.

  • verbose (bool, optional) – if True, print more information in program running.

  • use_initial_guess (bool, optional) – if True, use initial guess instead of uniform distribution for theta initialization.

Returns:

a tuple of estimated theta, e_alpha and sigma2:

theta : celltype proportions (3-D numpy array (spots * celltypes * 1))

e_alpha : spot-specific effect (1-D array with length #spot)

sigma2 : variance paramter of the lognormal distribution (float)

Return type:

tuple

model_fit.fit_model_two_stage(data, gamma_g=None, lambda_r=None, lambda_g=None, global_optimize=False, hybrid_version=True, opt_method='L-BFGS-B', verbose=False, diagnosis=False, use_admm=False)[source]

fit Graph Laplacian Regularized Stratified Model (GLRM) in a two-stage way

Parameters:
  • data (Dict) –

    a Dict contains all info need for modeling:

    X: a 2-D numpy matrix of celltype specific marker gene expression (celltypes * genes).

    Y: a 2-D numpy matrix of spatial gene expression (spots * genes).

    A: a 2-D numpy matrix of Adjacency matrix (spots * spots), or is None. Adjacency matrix of spatial sptots (1: connected / 0: disconnected). All 0 in diagonal.

    N: a 1-D numpy array of sequencing depth of all spots (length #spots). If it’s None, use sum of observed marker gene expressions as sequencing depth.

    non_zero_mtx: If it’s None, then do not filter zeros during regression. If it’s a bool 2-D numpy matrix (spots * genes) as False means genes whose nUMI=0 while True means genes whose nUMI>0 in corresponding spots. The bool indicators can be calculated based on either observerd raw nUMI counts in spatial data, or CVAE transformed nUMI counts.

    spot_names: a list of string of spot barcodes. Only keep spots passed filtering.

    gene_names: a list of string of gene symbols. Only keep actually used marker genes.

    celltype_names: a list of string of celltype names.

    initial_guess: initial guess of cell-type proportions of spatial spots.

  • gamma_g (1-D numpy array) – gene-specific platform effect for all genes.

  • lambda_r (float) – strength for Adaptive Lasso penalty. If None, use cross-validation to determine optimal lambda_r

  • lambda_g (float) – edge weight for graph, and will affect the strength of Graph Laplacian constrain. If None, use cross-validation to determine optimal graph_weight.

  • global_optimize (bool, optional) – if is True, use basin-hopping algorithm to find the global minimum. The default is False.

  • hybrid_version (bool, optional) – if True, use the hybrid_version of GLRM, i.e. in ADMM local model loss function optimization for w but adaptive lasso constrain on theta. If False, local model loss function optimization and adaptive lasso will on the same w. The default is True.

  • opt_method (string, optional) – specify method used in scipy.optimize.minimize for local model fitting. The default is ‘’, a default method in scipy for optimization with bounds. Another choice would be ‘SLSQP’, a default method in scipy for optimization with constrains and bounds.

  • verbose (bool, optional) – if True, print more information in program running.

  • diagnosis (bool, optional) – if True save more information to files for diagnosis CVAE and hyper-parameter selection.

  • use_admm (bool, optinal) – if True, use ADMM framework in optimization, default is False.

Returns:

estimated model coefficients, including:

theta : celltype proportions (#spots * #celltypes * 1)

e_alpha : spot-specific effect (1-D array with length #spot)

sigma2 : variance paramter of the lognormal distribution (float)

gamma_g : gene-specific platform effect for all genes (1-D array with length #gene)

Return type:

Dict

model_fit.fit_stage1_spotwise_lambda_r(data, warm_start_theta, warm_start_e_alpha, gamma_g, sigma2, lambda_r, lasso_weight, global_optimize=False, hybrid_version=True, opt_method='L-BFGS-B', hv_x=None, hv_log_p=None, verbose=False, diagnosis=False)[source]

update theta (celltype proportion) and e_alpha (spot-specific effect) given sigma2 (variance paramter of the log-normal distribution) and gamma_g (gene-specific platform effect) by MLE

UPDATE: here we assume the hyperparameter lambda_r for Adaptive Lasso is spot-wise, i.e. for each spot, we search for the optimal lambda_r, then fit the model

UPDATE2: we apply Nested Subset Selection Strategy, thus no lambda_r anymore

we assume

ln(lambda) = alpha + gamma_g + ln(sum(theta*mu_X)) + epsilon

subject to sum(theta)=1, theta>=0

mu_X is marker genes from data[‘X’]

then the mean parameter of the lognormal distribution of ln(lambda) is alpha + gamma_g + ln(sum(theta*mu_X))

we did re-parametrization w = e^alpha * theta, then

ln(lambda) = gamma_g + ln(sum([e^alpha*theta]*mu_X)) + epsilon

subject to w>=0, it will imply sum(theta)=1 and theta>=0

the steps to update theta and e_alpha:
  1. dimension change of theta from 3-D (spots * celltypes * 1) to 1-D (celltypes), and do re-parametrization to get w

  2. solve w for each spot in parallel

  3. extract updated theta and e_alpha from w, and change the dimension of updated theta back

Parameters:
  • data (Dict) –

    a Dict contains all info need for modeling:

    X: a 2-D numpy matrix of celltype specific marker gene expression (celltypes * genes).

    Y: a 2-D numpy matrix of spatial gene expression (spots * genes).

    A: a 2-D numpy matrix of Adjacency matrix (spots * spots), or is None. Adjacency matrix of spatial sptots (1: connected / 0: disconnected). All 0 in diagonal.

    N: a 1-D numpy array of sequencing depth of all spots (length #spots). If it’s None, use sum of observed marker gene expressions as sequencing depth.

    non_zero_mtx: If it’s None, then do not filter zeros during regression. If it’s a bool 2-D numpy matrix (spots * genes) as False means genes whose nUMI=0 while True means genes whose nUMI>0 in corresponding spots. The bool indicators can be calculated based on either observerd raw nUMI counts in spatial data, or CVAE transformed nUMI counts.

    spot_names: a list of string of spot barcodes. Only keep spots passed filtering.

    gene_names: a list of string of gene symbols. Only keep actually used marker genes.

    celltype_names: a list of string of celltype names.

  • warm_start_theta (3-D numpy array (spots * celltypes * 1)) – initial guess of theta (celltype proportion).

  • warm_start_e_alpha (1-D numpy array) – initial guess of e_alpha (spot-specific effect).

  • gamma_g (1-D numpy array) – gene-specific platform effect for all genes.

  • sigma2 (float) – variance paramter of the lognormal distribution of ln(lambda). All gene share the same variance.

  • lambda_r (float or 1-D numpy array) – parameter for the strength of Adaptive Lasso loss to shrink theta

  • lasso_weight (3-D numpy array (spots * celltypes * 1)) – weight of Adaptive Lasso, 1 ./ theta

  • global_optimize (bool, optional) – if is True, use basin-hopping algorithm to find the global minimum. The default is False.

  • hybrid_version (bool, optional) – if True, use the hybrid_version of GLRM, i.e. in ADMM local model loss function optimization for w but adaptive lasso constrain on theta. If False, local model loss function optimization and adaptive lasso will on the same w. The default is True.

  • opt_method (string, optional) – specify method used in scipy.optimize.minimize for local model fitting. The default is ‘L-BFGS-B’, a default method in scipy for optimization with bounds. Another choice would be ‘SLSQP’, a default method in scipy for optimization with constrains and bounds.

  • hv_x (1-D numpy array, optional) – data points served as x for calculation of probability density values. Only used for heavy-tail.

  • hv_log_p (1-D numpy array, optional) – log density values of normal distribution N(0, sigma^2) + heavy-tail. Only used for heavy-tail.

  • verbose (bool, optional) – if True, print more information.

  • diagnosis (bool, optional) – if True print more information for diagnosis of hyper-parameter selection.

Returns:

  • theta_results (3-D numpy array (spots * celltypes * 1)) – updated theta (celltype proportion).

  • e_alpha_results (1-D numpy array) – updated e_alpha (spot-specific effect).

parse_opt

Created on Tue May 10 03:23:09 2022

@author: hill103

this script stores functions related to receive and parse the command line parameters checking on parameters will also be performed

parse_opt.parseOpt()[source]

parse the command line parameters and return them as a dict

Parameters:

None.

Returns:

paramdict

parsed command line parameters for model, including:

spatial_file : full file path of spatial transcriptomic data

ref_file : full file path of scRNA-seq data

ref_celltype_file : full file path of the corresponding cell-type annotation of scRNA-seq data

marker_file : full file path of user curated cell-type marker gene expression

loc_file : full file path of spot locations in spatial transcriptomic data

A_file : full file path Adjacency Matrix of spots in spatial transcriptomic data

n_cores : number of CPU cores used for parallel computing

lambda_r : hyper-parameter for Adaptive Lasso

lambda_g : hyper-parameter for graph weight, affecting the Laplacian Matrix

use_cvae : whether to build CVAE

threshold : threshold for hard thresholding estimated cell-type proportion theta

n_hv_gene : number of highly variable genes for CVAE

n_marker_per_cmp : number of TOP marker genes for each comparison in DE

n_pseudo_spot : number of pseudo-spots

pseudo_spot_min_cell : minimum value of cells in pseudo-spot

pseudo_spot_max_cell : maximum value of cells in pseudo-spot

seq_depth_scaler : a scaler of scRNA-seq sequencing depth

cvae_input_scaler : maximum value of the scaled input for CVAE

cvae_init_lr : initial learning rate for training CVAE

num_hidden_layer : number of hidden layers in encoder and decoder

use_batch_norm : whether to use Batch Normalization

cvae_train_epoch : max number of training epochs for the CVAE

use_spatial_pseudo : whether to generate “pseudo-spots” in spatial condition

redo_de : whether to redo DE after CVAE transformation

seed : seed value for random in building CVAE

diagnosis : True or False, if True save more information to files for diagnosis CVAE and hyper-parameter selection

verbose : True or False, if True print more information during program running

use_fdr : whether to use FDR adjusted p value for filtering and sorting

p_val_cutoff : threshold of p value (or FDR if –use_fdr is true) in marker genes filtering

fc_cutoff : threshold of fold change (without log transform!) in marker genes filtering

pct1_cutoff : threshold of pct.1 in marker genes filtering

pct2_cutoff : threshold of pct.2 in marker genes filtering

sortby_fc : whether to sort marker genes by fold change

filter_cell : whether to filter cells before DE

filter_gene : whether to filter genes before DE

use_imputation : whether to perform imputation

diameter : the physical diameter of spatial spots

impute_diameter : target spot diameter for imputation

hole_min_spots : threshold of number of uncaptured spots to validate holes

preserve_shape : whether to preserve the exact shape of tissue map

Return type:

Dict

parse_opt.usage()[source]

help information of the main function

Parameters:

None.

Return type:

None.

preprocess

Created on Tue May 10 22:13:23 2022

@author: hill103

this script stores functions related to pre-processing including:

  1. read data from CSV files

  2. determine whether to perform DE for cell-type specific marker genes:

    if user specified marker file, then DE step will be skipped

  3. determine whether to build CVAE to adjust the platform effect:

    if user specified use_cvae as True and also provide original scRNA-seq data and corresponding cell-type annotation, then CVAE will be built

  4. filter genes and spots before GLRM modeling

  5. check datasets to avoid some mistakes

preprocess.preprocess(spatial_file, ref_file, ref_anno_file, marker_file, A_file, use_cvae, n_hv_gene, n_marker_per_cmp, n_pseudo_spot, pseudo_spot_min_cell, pseudo_spot_max_cell, seq_depth_scaler, cvae_input_scaler, cvae_init_lr, num_hidden_layer, use_batch_norm, cvae_train_epoch, use_spatial_pseudo, redo_de, use_fdr, p_val_cutoff, fc_cutoff, pct1_cutoff, pct2_cutoff, sortby_fc, filter_cell, filter_gene, diagnosis)[source]

preprocess files

Parameters:
  • spatial_file (string) – full path of input csv file of raw nUMI counts in spatial transcriptomic data (spots * genes).

  • ref_file (string) – full path of input csv file of raw nUMI counts in scRNA-seq data (cells * genes).

  • ref_anno_file (string) – full path of input csv file of cell-type annotations for all cells in scRNA-seq data.

  • marker_file (string) – full path of input csv file of cell-typee marker gene expression (cell-types * genes).

  • A_file (string) – full path of input csv file of Adjacency Matrix of spots in spatial transcriptomic data (spots * spots).

  • use_cvae (bool) – whether to build CVAE to adjust platform effect.

  • n_hv_gene (int) – number of highly variable genes for CVAE.

  • n_marker_per_cmp (int) – number of TOP marker genes for each comparison in DE.

  • n_pseudo_spot (int) – number of pseudo-spots.

  • pseudo_spot_min_cell (int) – minimum value of cells in pseudo-spot.

  • pseudo_spot_max_cell (int) – maximum value of cells in pseudo-spot.

  • seq_depth_scaler (int) – a scaler of scRNA-seq sequencing depth.

  • cvae_input_scaler (int) – maximum value of the scaled input for CVAE.

  • cvae_init_lr (float) – initial learning rate for training CVAE.

  • num_hidden_layer (int) – number of hidden layers in encoder and decoder.

  • use_batch_norm (bool) – whether to use Batch Normalization.

  • cvae_train_epoch (int) – max number of training epochs for the CVAE.

  • use_spatial_pseudo (int) – whether to generate “pseudo-spots” in spatial condition.

  • redo_de (bool) – whether to redo DE after CVAE transformation.

  • use_fdr (bool) – whether to use FDR adjusted p value for filtering and sorting.

  • p_val_cutoff (float) – threshold of p value (or FDR if –use_fdr is true) in marker genes filtering.

  • fc_cutoff (float) – threshold of fold change (without log transform!) in marker genes filtering.

  • pct1_cutoff (float) – threshold of pct.1 in marker genes filtering.

  • pct2_cutoff (float) – threshold of pct.2 in marker genes filtering.

  • sortby_fc (bool) – whether to sort marker genes by fold change.

  • diagnosis (bool) – if True save more information to files for diagnosis CVAE and hyper-parameter selection.

  • filter_cell (bool) – whether to filter cells before DE.

  • filter_gene (bool) – whether to filter genes before DE.

Returns:

data

a Dict contains all info need for modeling:

X: a 2-D numpy matrix of celltype specific marker gene expression (celltypes * genes).

Y: a 2-D numpy matrix of spatial gene expression (spots * genes).

A: a 2-D numpy matrix of Adjacency matrix (spots * spots), or is None. Adjacency matrix of spatial sptots (1: connected / 0: disconnected). All 0 in diagonal.

N: a 1-D numpy array of sequencing depth of all spots (length #spots). If it’s None, use sum of observed marker gene expressions as sequencing depth.

non_zero_mtx: If it’s None, then do not filter zeros during regression. If it’s a bool 2-D numpy matrix (spots * genes) as False means genes whose nUMI=0 while True means genes whose nUMI>0 in corresponding spots. The bool indicators can be calculated based on either observerd raw nUMI counts in spatial data, or CVAE transformed nUMI counts.

spot_names: a list of string of spot barcodes. Only keep spots passed filtering.

gene_names: a list of string of gene symbols. Only keep actually used marker gene symbols.

celltype_names: a list of string of cell-type names.

initial_guess: initial guess of cell-type proportions of spatial spots.

Return type:

Dict

run_model

Created on Tue May 10 04:52:02 2022

@author: hill103

this script stores function to define a Graph Laplacian Regularized Stratified Model (GLRM) and call functions to fit the coefficients

It receieve the input data, build graph, build GLRM and fit coefficients

run_model.run_GLRM(data, lambda_r=None, lambda_g=None, estimate_gamma_g=True, n_jobs=1, threshold=0, diagnosis=False, verbose=False)[source]

run GLRM, and fit coefficients

Parameters:
  • data (Dict) –

    a Dict contains all info need for modeling:

    X: a 2-D numpy matrix of celltype specific marker gene expression (celltypes * genes).

    Y: a 2-D numpy matrix of spatial gene expression (spots * genes).

    A: a 2-D numpy matrix of Adjacency matrix (spots * spots), or is None. Adjacency matrix of spatial sptots (1: connected / 0: disconnected). All 0 in diagonal.

    N: a 1-D numpy array of sequencing depth of all spots (length #spots). If it’s None, use sum of observed marker gene expressions as sequencing depth.

    non_zero_mtx: If it’s None, then do not filter zeros during regression. If it’s a bool 2-D numpy matrix (spots * genes) as False means genes whose nUMI=0 while True means genes whose nUMI>0 in corresponding spots. The bool indicators can be calculated based on either observerd raw nUMI counts in spatial data, or CVAE transformed nUMI counts.

    spot_names: a list of string of spot barcodes. Only keep spots passed filtering.

    gene_names: a list of string of gene symbols. Only keep actually used marker genes.

    celltype_names: a list of string of celltype names.

    initial_guess: initial guess of cell-type proportions of spatial spots.

  • lambda_r (float, optional) – strength for Adaptive Lasso penalty. The default is None, i.e. use cross-validation to determine optimal value

  • lambda_g (float, optional) – edge weight for graph, and will affect the strength of Graph Laplacian constrain. The default is None, i.e. use cross-validation to determine optimal value

  • estimate_gamma_g (bool, optional) – whether to estimate gamma_g (gene-specific platform effect). The default is True.

  • n_jobs (int, optional) – number of jobs to spawn. The default is 1.

  • threshold (float, optional) – threshold value for hard thresholding estimated cell-type proportion theta. The default is 0, i.e. no hard thresholding.

  • diagnosis (bool) – if True save more information to files for diagnosis CVAE and hyper-parameter selection

  • verbose (bool, optional) – if True, print variables in each ADMM loop. The default is True.

Returns:

result

estimated model coefficients, including (note theta dimension changed back to 2-D):

theta : celltype proportions (#spots * #celltypes)

e_alpha : spot-specific effect (1-D array with length #spot)

sigma2 : variance paramter of the lognormal distribution (float)

gamma_g : gene-specific platform effect for all genes (1-D array with length #gene)

Return type:

Dict

utils

Created on Thu May 5 23:28:03 2022

@author: hill103

this script stores utils functions

utils.calcRMSE(truth, predicted)[source]

calculate RMSE

Parameters:
  • truth (1-D numpy array) – true values.

  • predicted (1-D numpy array) – predictions.

Returns:

RMSE

Return type:

float

utils.check_decoder(cvae, decoder, data, labels)[source]

since we first create a decoder then update its weights based on the corresponding weights in CVAE, we need to double check the weights are updated correctly, and the decoded output matchs the CVAE output

Parameters:
  • cvae (Keras model) – already trained CVAE model

  • decoder (Keras model) – a separated decoder whose weights are already updated, i.e. it should give the same decoded output with CVAE

  • data (2-D numpy array) – data used for checking decoder (columns are genes, rows are cells, spatial spots or pseudo-spots)

  • labels (1-D numpy array) – corresponding conditional variables for each row in data

Return type:

None.

utils.check_numba()[source]

this function checks the active threading layer and thread count in Numba

Parameters:

None.

Return type:

None.

utils.numba_info()[source]

this function checks the Numba information, especially the Threading Layer Information

It turns out TBB or OMP gives similiar running speed

Parameters:

None.

Return type:

None.

utils.read_scRNA_data(ref_file, ref_anno_file, filter_cell, filter_gene)[source]

read scRNA-seq data saved as a CSV file by Scanpy, also read cell-type annotation, then subset cells with cell-type annotation

Parameters:
  • ref_file (string) – full path of input csv file of raw nUMI counts in scRNA-seq data (cells * genes).

  • ref_anno_file (string) – full path of input csv file of cell-type annotations for all cells in scRNA-seq data.

  • filter_cell (bool) – whether to filter cells before DE.

  • filter_gene (bool) – whether to filter genes before DE.

Return type:

a AnnData object

utils.read_spatial_data(spatial_file, filter_gene, n_hv_gene=0)[source]

read spatial data saved as a CSV file by Scanpy

Parameters:
  • spatial_file (string) – full path of input csv file of raw nUMI counts in spatial transcriptomic data (spots * genes).

  • filter_gene (bool) – whether to filter genes before DE.

  • n_hv_gene (int) – number of highly variable genes to be kept in spatial data. If equals 0, all genes are kept.

Returns:

  • spatial_spot_obj (a AnnData object) – AnnData object of spatial spot data.

  • tmp_df (dataframe) – dataframe of raw nUMI of spatial data (spots * genes).

  • N (series) – sequencing depth per spot.

utils.reparameterTheta(theta, e_alpha)[source]

re-parametrization w = e^alpha * theta

Parameters:
  • theta (3-D numpy array (#spot * #celltype * 1)) – theta (celltype proportion).

  • e_alpha (1-D numpy array) – e^alpha (spot-specific effect).

Returns:

w – re-parametrization w = e^alpha * theta.

Return type:

3-D numpy array (#spot * #celltype * 1)

utils.reportRMSE(true_theta, pred_theta)[source]

calculate the RMSE of theta (celltype proportion) across all spots

we first calculate the RMSE of each spot, then calculate the MEAN of RMSE values across all spots

Parameters:
  • true_theta (2-D numpy array (spots * celltypes)) – true values.

  • pred_theta (2-D numpy array (spots * celltypes)) – predictions.

Returns:

RMSE across all spots

Return type:

float

utils.rerun_DE(scRNA_df, scRNA_celltype, n_marker_per_cmp, use_fdr, p_val_cutoff, fc_cutoff, pct1_cutoff, pct2_cutoff, sortby_fc, save_result=False, filter_gene=True)[source]

rerun DE on CVAE transformed scRNA-seq data

genes are only overlapped genes between spatial and scRNA-seq data

gene expression values are already normalized by sequencing depth

Parameters:
  • scRNA_df (dataframe) – normalized gene expression after CVAE tranforming on scRNA-seq data (cells * genes).

  • scRNA_celltype (dataframe) – cell-type annotations for cells in scRNA-seq data. Only 1 column named <celltype>

  • n_marker_per_cmp (int) – number of TOP marker genes for each comparison in DE

  • use_fdr (bool) – whether to use FDR adjusted p value for filtering and sorting

  • p_val_cutoff (float) – threshold of p value (or FDR if –use_fdr is true) in marker genes filtering

  • fc_cutoff (float) – threshold of fold change (without log transform!) in marker genes filtering

  • pct1_cutoff (float) – threshold of pct.1 in marker genes filtering

  • pct2_cutoff (float) – threshold of pct.2 in marker genes filtering

  • sortby_fc (bool) – whether to sort marker genes by fold change

  • save_result (bool) – if true, save dataframe of DE result to csv file

  • filter_gene (bool) – whether to filter genes before DE.

Returns:

marker_gene_list – a list of cell-type specific marker genes based on DE on CVAE tranformed gene expressions.

Return type:

list

utils.run_DE(sc_obj, n_marker_per_cmp, use_fdr, p_val_cutoff, fc_cutoff, pct1_cutoff, pct2_cutoff, sortby_fc, save_result=False, save_file_name=None)[source]

differential on cell-types in scRNA-seq data.

we compare each cell-type with another one cell-type at a time.

only choose TOP X marker genes for one comparison with one cell-type vs another one cell-type, with filtering (the FDR adjusted p value <= 0.05 + fold change >= 1.2 + pct.1 >= 0.3 + pct.2 <= 0.1, and sorting by fold change (by default). Then combine all marker genes from all comparisons.

Note: the genes in object are overlapped genes with spatial data only.

cell-type annotation is saved in object metadata <celltype>

gene expression in AnnData object has already been normalized by sequencing depth

Parameters:
  • sc_obj (AnnData object) – scRNA-seq data object.

  • n_marker_per_cmp (int) – number of TOP marker genes for each comparison in DE

  • use_fdr (bool) – whether to use FDR adjusted p value for filtering and sorting

  • p_val_cutoff (float) – threshold of p value (or FDR if –use_fdr is true) in marker genes filtering

  • fc_cutoff (float) – threshold of fold change (without log transform!) in marker genes filtering

  • pct1_cutoff (float) – threshold of pct.1 in marker genes filtering

  • pct2_cutoff (float) – threshold of pct.2 in marker genes filtering

  • sortby_fc (bool) – whether to sort marker genes by fold change

  • save_result (bool) – if true, save dataframe of DE result to csv file

  • save_file_name (string) – file name (without path) for saving DE result

Returns:

marker_gene_list – identified cell-type specific marker gene list

Return type:

list

utils.run_DE_only(ref_file, ref_anno_file, spatial_genes, n_marker_per_cmp, use_fdr, p_val_cutoff, fc_cutoff, pct1_cutoff, pct2_cutoff, sortby_fc, save_result=False, filter_cell=True, filter_gene=True)[source]

read scRNA-seq raw nUMI and cell-type annotation, then perform DE analysis.

Note: the genes in scRNA-seq data need to be subsetted to overlapped genes with spatial data only.

Parameters:
  • ref_file (string) – full path of input csv file of raw nUMI counts in scRNA-seq data (cells * genes).

  • ref_anno_file (string) – full path of input csv file of cell-type annotations for all cells in scRNA-seq data.

  • spatial_genes (list) – genes included in spatial dataset.

  • n_marker_per_cmp (int) – number of TOP marker genes for each comparison in DE

  • use_fdr (bool) – whether to use FDR adjusted p value for filtering and sorting

  • p_val_cutoff (float) – threshold of p value (or FDR if –use_fdr is true) in marker genes filtering

  • fc_cutoff (float) – threshold of fold change (without log transform!) in marker genes filtering

  • pct1_cutoff (float) – threshold of pct.1 in marker genes filtering

  • pct2_cutoff (float) – threshold of pct.2 in marker genes filtering

  • sortby_fc (bool) – whether to sort marker genes by fold change

  • save_result (bool) – if true, save dataframe of DE result to csv file

  • filter_cell (bool) – whether to filter cells before DE

  • filter_gene (bool) – whether to filter genes before DE

Returns:

  • scrna_obj (AnnData object) – a AnnData object for scRNA-seq data

  • marker_gene_profile (DataFrame) – average gene expressions of identified cell-type specific marker genes from refer scRNA-seq data

utils.total_size(o, handlers={}, verbose=False)[source]

Returns the approximate memory footprint of an object and all of its contents.

Automatically finds the contents of several built-in containers and their subclasses, including tuple, list, deque, dict, set, and frozenset. To search other containers, add handlers that iterate over their contents.

Parameters:
  • o (object) – The object whose memory footprint is to be calculated.

  • handlers (dict, optional) – A dictionary of handler functions keyed by container types. These handlers are used to iterate over container contents. Defaults to an empty dictionary.

  • verbose (bool, optional) – If True, the function prints more information about what it’s doing. Defaults to False.

Returns:

obj_size – An approximation of the total memory footprint of the object in bytes.

Return type:

int

version