Modules
admm_fit
Created on Thu May 5 23:18:26 2022
@author: hill103
this script stores functions related to ADMM framework for fitting model
ADMM framework based on https://github.com/cvxgrp/strat_models
- admm_fit.one_admm_fit(data, L, theta, e_alpha, gamma_g, sigma2, lambda_r=1.0, lasso_weight=None, hv_x=None, hv_log_p=None, theta_mask=None, abs_tol=0.001, rel_tol=0.001, rho=1, mu=10, tau_incr=2, tau_decr=2, max_rho=10.0, min_rho=0.1, maxiter=100, max_cg_iterations=10, dynamic_rho=True, queue_len=3, diff_threshold=0.05, rho_incr=2, rho_decr=2, diff_scale=5, diff_stop=5e-05, opt_method='L-BFGS-B', global_optimize=False, hybrid_version=True, verbose=False)[source]
perform a whole ADMM iterations once as one fitting procedure in GLRM
Note: the output of the fit result is 3-Dimensional. The dimension transform will be performed outside this function if needed
- Parameters:
data (Dict) –
- a Dict contains all info need for modeling:
X: a 2-D numpy matrix of celltype specific marker gene expression (celltypes * genes).
Y: a 2-D numpy matrix of spatial gene expression (spots * genes).
A: a 2-D numpy matrix of Adjacency matrix (spots * spots), or is None. Adjacency matrix of spatial sptots (1: connected / 0: disconnected). All 0 in diagonal.
N: a 1-D numpy array of sequencing depth of all spots (length #spots). If it’s None, use sum of observed marker gene expressions as sequencing depth.
non_zero_mtx: If it’s None, then do not filter zeros during regression. If it’s a bool 2-D numpy matrix (spots * genes) as False means genes whose nUMI=0 while True means genes whose nUMI>0 in corresponding spots. The bool indicators can be calculated based on either observerd raw nUMI counts in spatial data, or CVAE transformed nUMI counts.
spot_names: a list of string of spot barcodes. Only keep spots passed filtering.
gene_names: a list of string of gene symbols. Only keep actually used marker genes.
celltype_names: a list of string of celltype names.
initial_guess: initial guess of cell-type proportions of spatial spots.
L (scipy sparse matrix (spots * spots)) – Laplacian matrix
theta (3-D numpy array (spots * celltypes * 1)) – initial guess of theta (celltype proportion).
e_alpha (1-D numpy array) – initial guess of e_alpha (spot-specific effect).
gamma_g (1-D numpy array) – gene-specific platform effect for all genes.
sigma2 (float) – initial guess of variance paramter of the lognormal distribution of ln(lambda). All genes and spots share the same variance. it may not be updated during the ADMM iterations, i.e. sigma2 is treated as an already optimized value.
lambda_r (float, optional) – strength for Adaptive Lasso penalty. The default is 1.0.
lasso_weight (3-D numpy array (spots * celltypes * 1), optional) – calculated weight for adaptive lasso. The default is None.
hv_x (1-D numpy array, optional) – data points served as x for calculation of probability density values. Only used for heavy-tail.
hv_log_p (1-D numpy array, optional) – log density values of normal distribution N(0, sigma^2) + heavy-tail. Only used for heavy-tail.
theta_mask (3-D numpy array (spots * celltypes * 1), optional) – mask for cell-type proportions (1: present, 0: not present). Only used for stage 2 theta optmization.
abs_tol (float, optional) – Absolute tolerance. The default is 1e-3.
rel_tol (float, optional) – Relative tolerance. The default is 1e-3.
rho (float, optional) – Initial penalty parameter. The default is 1. Actually used in code is 1/rho, and turned out to fasten the converge than using rho
mu (float, optional) – Adaptive penalty parameters. The default is 10.
tau_incr (float, optional) – Adaptive penalty parameters. The default is 2.
tau_decr (float, optional) – Adaptive penalty parameters. The default is 2.
max_rho (float, optional) – Adaptive penalty parameters. The default is 1e1.
min_rho (float, optional) – Adaptive penalty parameters. The default is 1e-1.
maxiter (int, optional) – Maximum number of ADMM iterations. The default is 100.
max_cg_iterations (int, optional) – Max number of CG iterations for graph Laplacian contrain per ADMM iteration. The default is 10.
dynamic_rho (bool, optional) – if True, dynamically increasing min_rho and max_rho. The default is True.
queue_len (int, optional) – the length of queue to record the mean of theta-theta_tilde ‘RMSE’ and theta-theta_hat ‘RMSE’. The default is 3.
diff_threshold (float, optional) – the threshold of ‘RMSE change’ for rho adjustment. If all ‘RMSE change’ values in queue are less than or equal the threshold, then increasing min_rho and max_rho. The default is 0.05.
rho_incr (float, optional) – the multiplier of min_rho and max_rho increasing. The default is 2.
rho_decr (float, optional) – the divider of min_rho and max_rho decreasing. The default is 2.
diff_scale (float, optional) – if current theta tilde and hat ‘RMSE’ > diff_scale * previous ‘RMSE’, which means current rho value is too large and cause unexpected oscillation, then decreasing min_rho and max_rho. The default is 5.
diff_stop (float, optional) – if average of theta tilde and hat ‘RMSE’ <= diff_stop, stop ADMM iteration. The default is 5e-5.
opt_method (string, optional) – specify method used in scipy.optimize.minimize for local model fitting. The default is ‘L-BFGS-B’, a default method in scipy for optimization with bounds. Another choice would be ‘SLSQP’, a default method in scipy for optimization with constrains and bounds.
global_optimize (bool, optional) – if is True, use basin-hopping algorithm to find the global minimum. The default is False.
hybrid_version (bool, optional) – if True, use the hybrid_version of GLRM, i.e. in ADMM local model loss function optimization for w but adaptive lasso constrain on theta. If False, local model loss function optimization and adaptive lasso will on the same w. The default is True.
verbose (bool, optional) – if True, print information in each ADMM loop
- Returns:
- estimated model coefficients, including:
theta : celltype proportions (#spots * #celltypes * 1)
e_alpha : spot-specific effect (1-D array with length #spot)
sigma2 : variance paramter of the lognormal distribution (float)
gamma_g : gene-specific platform effect for all genes (1-D array with length #gene)
theta_tilde : celltype proportions for Adaptive Lasso (#spots * #celltypes * 1)
theta_hat : celltype proportions for Graph Laplacian constrain (#spots * #celltypes * 1)
- Return type:
Dict
config
Created on Thu May 5 21:58:08 2022
@author: hill103
this script configure the environment variables for GLRM
- config.diagnosis_path = '/data/diagnosis'
define a small value to avoid divided by 0 or log(0)
- config.gamma = 0.004
define eps in optmization for theta and sigma2 (default 1e-08)
- config.min_sigma2 = 1e-06
define the integration range and increment to calculate the heavy-tail probabilities
- config.min_theta = 1e-09
define a small value to make sure sigma^2 > 0
- config.min_val = 1e-12
define a small value to make sure theta (w) > 0
- config.print(*objects, sep=' ', end='\n', file=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, flush=True)[source]
- config.sigma2_eps = 1e-08
configuration for reproducible results in keras + TensorFlow skipped here, we configure the environment in main function
cvae
Created on Thu May 19 21:43:08 2022
@author: hill103
this script stores functions to build a CVAE for platform effect adjustment
- cvae.CVAE_keras_model(p, p_cond, latent_dim, p_encoder_lst, p_decoder_lst, hidden_act='elu', output_act='relu', use_batch_norm=True, cvae_init_lr=0.01)[source]
define a standard CVAE model based on Keras need to build a decoder separately as can not extract it from the whole model
- Parameters:
p (int) – number of nodes in input layer
p_cond (int) – number of conditional nodes in input layer
latent_dim (int) – number of nodes in latent space
p_encoder_lst (list of integers) – including number of nodes in each hidden layer of encoder, the length of list is the number of hidden layers
p_decoder_lst (list of integers) – including number of nodes in each hidden layer of decoder, the length of list is the number of hidden layers
hidden_act (string, optional) – activation function of hidden layers. Default is elu function
output_act (string, optional) – activation function of output layer. Default is relu function
use_batch_norm (bool, optional) – whether to use batch normalization. Default if True, i.e. use batch normalization
cvae_init_lr (float) – initial learning rate for training CVAE
- Returns:
cvae (Keras model) – CVAE model. Encoder can be extracted from it
decoder (Keras model) – corresponding decoder, reset its weights after CVAE training
- cvae.augment_sc(exp_df, celltype_anno, target_count, pseudo_spot_min_cell, pseudo_spot_max_cell)[source]
augment single cells and balance #cells of cell types
generate pseudo-spots by randomly combining scRNA-seq cells within the same cell type
original single cells are put at the end
- Parameters:
exp_df (dataframe) – normalized gene expression (cell * genes)
celltype_anno (dataframe or None) – cell-type annotations for cells in scRNA-seq data. Only 1 column named <celltype>
target_count (int) – target number of cells per cell type
pseudo_spot_min_cell (int) – minimum value of cells in pseudo-spot
pseudo_spot_max_cell (int) – maximum value of cells in pseudo-spot
- Returns:
pseudo_spots_df (dataframe) – pseudo-spot gene expression (pseudo-spots * genes; original cells included first)
pseudo_spots_celltype_prop (dataframe) – pseudo-spot cell-type proportions (pseudo-spots * cell-types; original cells included first)
n_cell_in_spot (list) – number of cells/spots in pseudo-spots (original cells included first)
- cvae.build_CVAE(spatial_df, scRNA_df, scRNA_celltype, n_marker_per_cmp, n_pseudo_spot, pseudo_spot_min_cell, pseudo_spot_max_cell, seq_depth_scaler, cvae_input_scaler, cvae_init_lr, num_hidden_layer, use_batch_norm, cvae_train_epoch, use_spatial_pseudo, use_fdr, p_val_cutoff, fc_cutoff, pct1_cutoff, pct2_cutoff, sortby_fc, diagnosis, rerun_DE=True, filter_gene=True)[source]
build CVAE to adjust platform effect, return transformed spatial gene expression and scRNA-seq cell-type gene signature
input gene expression in datasets only included genes needed for downstream analysis and already been normalized by sequencing depth
- Parameters:
spatial_df (dataframe) – normalized gene expression in spatial transcriptomic data (spots * genes).
scRNA_df (dataframe) – normalized gene expression in scRNA-seq data (cells * genes).
scRNA_celltype (dataframe) – cell-type annotations for cells in scRNA-seq data. Only 1 column named <celltype>.
n_marker_per_cmp (int) – number of TOP marker genes for each comparison in DE.
n_pseudo_spot (int) – number of pseudo-spots.
pseudo_spot_min_cell (int) – minimum value of cells in pseudo-spot.
pseudo_spot_max_cell (int) – maximum value of cells in pseudo-spot.
seq_depth_scaler (int) – a scaler of scRNA-seq sequencing depth.
cvae_input_scaler (int) – maximum value of the scaled input for CVAE.
cvae_init_lr (float) – initial learning rate for training CVAE.
num_hidden_layer (int) – number of hidden layers in encoder and decoder.
use_batch_norm (bool) – whether to use Batch Normalization.
cvae_train_epoch (int) – max number of training epochs for the CVAE.
use_spatial_pseudo (int) – whether to generate “pseudo-spots” in spatial condition.
use_fdr (bool) – whether to use FDR adjusted p value for filtering and sorting.
p_val_cutoff (float) – threshold of p value (or FDR if –use_fdr is true) in marker genes filtering.
fc_cutoff (float) – threshold of fold change (without log transform!) in marker genes filtering.
pct1_cutoff (float) – threshold of pct.1 in marker genes filtering.
pct2_cutoff (float) – threshold of pct.2 in marker genes filtering.
sortby_fc (bool) – whether to sort marker genes by fold change.
diagnosis (bool) – if True save more information to files for diagnosis CVAE and hyper-parameter selection.
rerun_DE (bool, optional) – whether to rerun DE on the CVAE transformed scRNA-seq data, since the DE genes might be different with before CVAE transforming.
filter_gene (bool) – whether to filter genes before DE.
- Returns:
spatial_transformed_numi (dataframe) – CVAE transformed (platform effect adjusted) spatial spot gene raw nUMI counts (spots * genes).
scRNA_decode_avg_df (dataframe) – CVAE decodered average gene expression (normalized) of cell-types in scRNA-seq data (cell-types * genes).
new_markers (list or None) – marker genes from re-run DE on CVAE transformed scRNA-seq data. It will be None if not re-run DE (rerun_DE=False).
cvae_pred (dataframe or None) – cell-type proportions of spatial spots predicted or transferred by CVAE. It will be None if no way to got initial guess of cell-type proportions (spots * cell-types).
- cvae.build_CVAE_whole(spatial_file, ref_file, ref_anno_file, marker_file, n_hv_gene, n_marker_per_cmp, n_pseudo_spot, pseudo_spot_min_cell, pseudo_spot_max_cell, seq_depth_scaler, cvae_input_scaler, cvae_init_lr, num_hidden_layer, use_batch_norm, cvae_train_epoch, use_spatial_pseudo, redo_de, use_fdr, p_val_cutoff, fc_cutoff, pct1_cutoff, pct2_cutoff, sortby_fc, diagnosis, filter_cell, filter_gene)[source]
read related CSV files, build CVAE to adjust platform effect, return transformed spatial gene expression and scRNA-seq cell-type gene signature
- Parameters:
spatial_file (string) – full path of input csv file of raw nUMI counts in spatial transcriptomic data (spots * genes).
ref_file (string) – full path of input csv file of raw nUMI counts in scRNA-seq data (cells * genes).
ref_anno_file (string) – full path of input csv file of cell-type annotations for all cells in scRNA-seq data.
marker_file (string) – full path of input csv file of cell-typee marker gene expression (cell-types * genes).
n_hv_gene (int) – number of highly variable genes for CVAE.
n_marker_per_cmp (int) – number of TOP marker genes for each comparison in DE.
n_pseudo_spot (int) – number of pseudo-spots.
pseudo_spot_min_cell (int) – minimum value of cells in pseudo-spot.
pseudo_spot_max_cell (int) – maximum value of cells in pseudo-spot.
seq_depth_scaler (int) – a scaler of scRNA-seq sequencing depth.
cvae_input_scaler (int) – maximum value of the scaled input for CVAE.
cvae_init_lr (float) – initial learning rate for training CVAE.
num_hidden_layer (int) – number of hidden layers in encoder and decoder.
use_batch_norm (bool) – whether to use Batch Normalization.
cvae_train_epoch (int) – max number of training epochs for the CVAE.
use_spatial_pseudo (int) – whether to generate “pseudo-spots” in spatial condition.
redo_de (bool) – whether to redo DE after CVAE transformation.
use_fdr (bool) – whether to use FDR adjusted p value for filtering and sorting.
p_val_cutoff (float) – threshold of p value (or FDR if –use_fdr is true) in marker genes filtering.
fc_cutoff (float) – threshold of fold change (without log transform!) in marker genes filtering.
pct1_cutoff (float) – threshold of pct.1 in marker genes filtering.
pct2_cutoff (float) – threshold of pct.2 in marker genes filtering.
sortby_fc (bool) – whether to sort marker genes by fold change.
diagnosis (bool) – if True save more information to files for diagnosis CVAE and hyper-parameter selection.
filter_cell (bool) – whether to filter cells before DE.
filter_gene (bool) – whether to filter genes before DE.
- Returns:
spatial_transformed_numi (dataframe) – CVAE transformed (platform effect adjusted) spatial spot gene raw nUMI counts (spots * genes).
scRNA_decode_avg_df (dataframe) – CVAE decodered average gene expression (normalized) of cell-types in scRNA-seq data (cell-types * genes).
new_markers (list or None) – marker genes from re-run DE on CVAE transformed scRNA-seq data. It will be None if not re-run DE.
cvae_pred (dataframe or None) – cell-type proportions of spatial spots predicted or transferred by CVAE. It will be None if no way to got initial guess of cell-type proportions (spots * cell-types).
- cvae.celltype2props(celltype_anno, celltype_order)[source]
calculate cell-type proportions matrix given one cell-type annotation
- Parameters:
celltype_anno (dataframe) – cell-type annotations. Only 1 column named <celltype>
celltype_order (list) – already sorted unique cell-types. Its order matters, and will be the order in cell-type proportions (columns) and cell-type gene expression profile (rows)
- Returns:
celltype_prop – cell-type proportions, columns are already sorted cell-types
- Return type:
dataframe
- cvae.combine_spatial_spots(exp_df, n_spot, pseudo_spot_min_cell, pseudo_spot_max_cell)[source]
we also generate “pseudo-spots” by combining spatial spots
here we do not know the true cell-type proportions in spatial spots, so we also do not know the proportions in generated “pseudo-spots”. We just ignore it, and only use the expressions for CVAE training
- Parameters:
exp_df (dataframe) – normalized gene expression (spots * genes)
n_spot (int) – number of pseudo spots need to be generated, including training and validation set
pseudo_spot_min_cell (int) – minimum value of cells in pseudo-spot
pseudo_spot_max_cell (int) – maximum value of cells in pseudo-spot
- Returns:
pseudo_spots_df – pseudo-spot gene expression (pseudo-spots * genes; NO original cells included)
- Return type:
dataframe
- cvae.generate_pseudo_spots(exp_df, celltype_anno, n_spot, celltype_order, pseudo_spot_min_cell, pseudo_spot_max_cell)[source]
generate pseudo-spots for CVAE training by randomly combining scRNA-seq cells
UPDATE:
now we separate the pseudo-spots and all scRNA-seq cells, i.e. we DO NOT add all cells to the end of the dataframe as special pseudo-spots with only one cell
if n_spot=0, i.e. no pseudo-spots, then return an empty dataframe
- Parameters:
exp_df (dataframe) – normalized gene expression (cell * genes)
celltype_anno (dataframe or None) – cell-type annotations for cells in scRNA-seq data. Only 1 column named <celltype>
n_spot (int) – number of pseudo spots need to be generated, including training and validation set
celltype_order (list) – already sorted unique cell-types. Its order matters, and will be the order in cell-type proportions (columns) and cell-type gene expression profile (rows)
pseudo_spot_min_cell (int) – minimum value of cells in pseudo-spot
pseudo_spot_max_cell (int) – maximum value of cells in pseudo-spot
- Returns:
pseudo_spots_df (dataframe) – pseudo-spot gene expression (pseudo-spots * genes; NO original cells included)
pseudo_spots_celltype_prop (dataframe) – pseudo-spot cell-type proportions (pseudo-spots * cell-types; NO original cells included)
n_cell_in_spot (list) – number of cells/spots in pseudo-spots (NO original cells included)
- cvae.transferProps(query, ref, ref_props, n_neighbors=10, sigma=1, use_embedding='PCA', pca_dimension=None)[source]
transfer cell-type proportions by select K Nearest Neighbors in ref and take Gaussian weighted average of ref proportions
- Parameters:
query (2-D numpy matrix) – encoder embeddings of spatial spots (spots * latent layer neurons).
ref (2-D numpy matrix) – encoder embeddings of scRNA-seq cells and pseudo-spots in scRNA-seq condition (cells+pseudo-spots * latent layer neurons).
ref_props (2-D numpy matrix) – cell-type proportion matrix of scRNA-seq cells and pseudo-spots (cells+pseudo-spots * cell-types).
n_neighbors (int, optional) – Number of neighbors to use. The default is 10.
sigma (float, optional) – Standard deviation for the Gaussian weighting function. The default is 1.
use_embedding (str, optional) – which embedding to use, either PCA, UMAP or none. The default is PCA.
pca_dimension (int, optinal) – specify the number of dimensions for PCA reduction. If set to None, the reduced dimension will be one-third of the input dimension.
- Returns:
query_props – cell-type proportion matrix for spatial spots.
- Return type:
2-D numpy matrix
cvaeglrm
Created on Tue May 10 03:15:36 2022
@author: hill103
this script is the main function of the whole CVAE-GLRM pipeline
pipeline steps:
receive and parse the command line parameters and then pass the parameters to CVAE-GLRM model
pre-process
building CVAE (if available)
GLRM model fitting
post-process and save the results
diagnosis_plots
Created on Thu Apr 25 06:02:34 2024
@author: hill103
We move all functions for generating diagnosis plots to here.
- diagnosis_plots.defineColor(n_spatial_spot, scRNA_celltype)[source]
generate n visually distinct colours for cell-types. Use these colours across whole pipeline for consistency
- Parameters:
n_spatial_spot (int) – number of spatial spots.
scRNA_celltype (dataframe) – cell-type annotations for cells in scRNA-seq data. Only 1 column named <celltype>.
- Returns:
plot_colors – generated color palette, keys are cell-types together with the number of cells with this cell-type, and values are RGB colors.
- Return type:
dict
- diagnosis_plots.diagnosisCVAE(cvae, encoder, decoder, spatial_embed, spatial_transformed_df, spatial_transformed_numi, pseudo_spatial_embed, scRNA_celltype, celltype_order, celltype_count_dict, scrna_cell_celltype_prop, scRNA_embed, scrna_n_cell, pseudo_spots_celltype_prop, n_cell_in_spot, pseudo_spot_embed, scRNA_decode_df, scRNA_decode_avg_df, new_markers, plot_colors)[source]
save CVAE related Keras models to h5 file, generate figures to diagnosis the training of CVAE
- Parameters:
cvae (Keras model) – already trained CVAE model
encoder (Keras model) – encoder part extract from CVAE model
decoder (Keras model) – a separated decoder whose weights are already updated, i.e. it should give the same decoded output with CVAE
spatial_embed (2-D numpy array) – mu in latent space of spatial spots (spots * latent neurons)
spatial_transformed_df (dataframe) – CVAE transformed (platform effect adjusted) spatial spot gene normalized expressions (spots * genes)
spatial_transformed_numi (dataframe) – CVAE transformed (platform effect adjusted) spatial spot gene raw nUMI counts (spots * genes)
pseudo_spatial_embed (dataframe) – mu in latent space of spatial pseudo-spots (spatial pseudo-spots * latent neurons). It will have 0 rows if no spatial pseudo-spots generated
scRNA_celltype (dataframe) – cell-type annotations for cells in scRNA-seq data. Only 1 column named <celltype>
celltype_order (list) – already sorted unique cell-types. Its order matters, and will be the order in pseudo_spots_celltype (columns) and cell-type gene expression profile (rows)
celltype_count_dict (dict) – number of cells in reference scRNA-seq data for each cell-type
scrna_cell_celltype_prop (dataframe) – scRNA-seq cells cell-type proportions (cells * cell-types)
scRNA_embed (2-D numpy array) – mu in latent space of scRNA-seq cells (cells * latent neurons)
scrna_n_cell (list) – number of cells in scRNA-seq data. Note augmented single cells also included
pseudo_spots_celltype_prop (dataframe) – pseudo-spot cell-type proportions (pseudo-spots * cell-types; NO scRNA-seq cells included)
n_cell_in_spot (list) – number of cells in pseudo-spots (no scRNA-seq cells included)
pseudo_spot_embed (dataframe) – mu in latent space of pseudo spots (pseudo-spots * latent neurons); NO scRNA-seq cells included. It will have 0 rows if no scRNA-seq pseudo-spots generated
scRNA_decode_df (dataframe) – CVAE decodered gene expression (normalized) of scRNA-seq cells (cells * genes)
scRNA_decode_avg_df (dataframe) – CVAE decodered average gene expression (normalized) of cell-types in scRNA-seq data (cell-types * genes)
new_markers (list or None) – marker genes from re-run DE on CVAE transformed scRNA-seq data. It will be None if not re-run DE (rerun_DE=False)
plot_colors (dict) – color palette for plot
- Return type:
None.
- diagnosis_plots.diagnosisParamsSpotwiseLambdarTuning(x, strategy)[source]
generate histogram plot show optimal lambda_r in hyper parameter tuning by BIC
- Parameters:
x (list) – a list of optimal lambda_r
strategy (string) – the used strategy for tuning lambda_r
- Return type:
None.
- diagnosis_plots.diagnosisParamsTuning(x, y, optimal_idx, x_label, y_label)[source]
generate scatter plot show performance in hyper parameter tuning by cross-validation
- Parameters:
x (list) – a list of candidates of hyper parameter
y (list) – corresponding performance for each hyper parameter candidate. NOTE not all candidates have corresponding performance due to early stop
optimal_idx (int) – index of list to indicate the optimal hyper parameter value which achieve the best performance
x_label (string) – label of x axis in the plot, representing the number of hyper parameter
y_label (string) – label of y axis in the plot, representing the performance metric used
- Return type:
None.
- diagnosis_plots.plotCVAELoss(history)[source]
plot training and validation loss in CVAE training
NOTE validation loss may be unavailable if no validation data in training
- Parameters:
history (History object) – History object returned by Keras Model.fit in CVAE training.
- Return type:
None.
- diagnosis_plots.plot_imputation(df, grid_df, contours, hierarchy, figname, figsize=(6.4, 4.8))[source]
draw scatter plot of spatial spots and imputed spots for diagnosis.
- Parameters:
df (dataframe) – dataframe of spatial spots at original resolution, with columns ‘x’, ‘y’, ‘border’.
grid_df (dataframe) – dataframe of generated grid points at higher resolution, with columns ‘x’, ‘y’.
contours (tuple) – contours variable returned by cv2.findContours, used for creating grid.
hierarchy (3-D numpy array (1 * #contours * 4)) – hierarchy variable returned by cv2.findContours, used for creating grid.
figname (string) – name of figure.
figsize (tuple, optional) – figure size. The default is (6.4, 4.8).
- Return type:
None.
- diagnosis_plots.rawInputUMAP(spatial_df, scRNA_df, scRNA_celltype, plot_colors)[source]
generate UMAP of spatial spots together with scRNA-seq cells
- Parameters:
spatial_df (dataframe) – normalized gene expression in spatial transcriptomic data (spots * genes).
scRNA_df (dataframe) – normalized gene expression in scRNA-seq data (cells * genes).
scRNA_celltype (dataframe) – cell-type annotations for cells in scRNA-seq data. Only 1 column named <celltype>.
plot_colors (dict) – color palette for plot.
- Return type:
None.
hyper_parameter_optimization
Created on Sat May 7 23:32:08 2022
@author: hill103
- this script stores functions of cross-validation to find optimal values for hyper-parameters:
lambda_r: for Adaptive Lasso
lambda_g: for graph edge weight, will affect the Laplacian Matrix
in k-fold cross-validation, we random subset the GENEs, then predict the gene expression in validation fold the performance metric is RMSE or negative log-likelihood of predicted gene expressions in validation fold
UPDATE: we add BIC for lambda_r selection
- hyper_parameter_optimization.BIC_find_lambda_r_one_spot(mu, y_vec, N, spot_name, mle_theta, mle_e_alpha, gamma_g, sigma2, lasso_weight, candidate_list, hybrid_version=True, opt_method='L-BFGS-B', hv_x=None, hv_log_p=None, use_AIC=False, reinitialize_theta=False, verbose=False)[source]
find optimal value for hyper-parameter lambda_r by BIC
- Parameters:
mu (2-D numpy matrix) – matrix of celltype specific marker gene expression (celltypes * genes).
y_vec (1-D numpy array) – spatial gene expression (length #genes).
N (int or None) – sequencing depth of this spot. If it’s None, use sum of observed marker gene expressions as sequencing depth.
spot_name (string) – name of this spot.
mle_theta (1-D numpy array (length #celltypes)) – estimated theta (celltype proportion) by MLE.
mle_e_alpha (float) – estimated e_alpha by MLE.
gamma_g (1-D numpy array) – gene-specific platform effect for all genes.
sigma2 (float) – variance paramter of the lognormal distribution of ln(lambda). All gene share the same variance.
lasso_weight (1-D numpy array (length #celltypes)) – weight of Adaptive Lasso, 1 ./ MLE theta
candidate_list (list) – candidates for the hyper-parameter
hybrid_version (bool, optional) – if True, use the hybrid_version of GLRM, i.e. in ADMM local model loss function optimization for w but adaptive lasso constrain on theta. If False, local model loss function optimization and adaptive lasso will on the same w. The default is True.
opt_method (string, optional) – specify method used in scipy.optimize.minimize for local model fitting. The default is ‘L-BFGS-B’, a default method in scipy for optimization with bounds. Another choice would be ‘SLSQP’, a default method in scipy for optimization with constrains and bounds.
hv_x (1-D numpy array, optional) – data points served as x for calculation of probability density values. Only used for heavy-tail.
hv_log_p (1-D numpy array, optional) – log density values of normal distribution N(0, sigma^2) + heavy-tail. Only used for heavy-tail.
use_AIC (bool, optional) – if True, calculate and return AIC.
reinitialize_theta (bool, optional) – if True, reinitialize theta as 1/#celltype and e_alpha as 1 for optimization.
verbose (bool, optional) – if True, print more information.
- Returns:
optimal lambda_r value and corresponding optimized theta (for refit)
- Return type:
tuple of (float, 1-D numpy array)
- hyper_parameter_optimization.BIC_find_theta_subset_one_spot(mu, y_vec, N, spot_name, mle_theta, mle_e_alpha, gamma_g, sigma2, hybrid_version=True, opt_method='L-BFGS-B', hv_x=None, hv_log_p=None, use_AIC=False, reinitialize_theta=False, verbose=False)[source]
Select sparse theta for one spot by nested subset selection and BIC.
Strategy: - Start from the MLE theta. - Cell types with mle_theta == 0 are fixed to zero and never considered. - Among cell types with mle_theta > 0, sort by mle_theta (ascending). - Construct nested active sets by sequentially setting the smallest MLE-positive thetas to zero. - For each active set, refit the (unpenalized) local model with the remaining theta entries free (sum-to-one constraint), compute BIC, and select the active set with minimal BIC.
- Parameters:
mu (2-D numpy matrix) – matrix of celltype specific marker gene expression (celltypes * genes).
y_vec (1-D numpy array) – spatial gene expression (length #genes).
N (int or None) – sequencing depth of this spot. If it’s None, use sum of observed marker gene expressions as sequencing depth.
spot_name (string) – name of this spot.
mle_theta (1-D numpy array (length #celltypes)) – estimated theta (celltype proportion) by MLE.
mle_e_alpha (float) – estimated e_alpha by MLE.
gamma_g (1-D numpy array) – gene-specific platform effect for all genes.
sigma2 (float) – variance paramter of the lognormal distribution of ln(lambda). All gene share the same variance.
hybrid_version (bool, optional) – if True, use the hybrid_version of GLRM, i.e. in ADMM local model loss function optimization for w but adaptive lasso constrain on theta. If False, local model loss function optimization and adaptive lasso will on the same w. The default is True.
opt_method (string, optional) – specify method used in scipy.optimize.minimize for local model fitting. The default is ‘L-BFGS-B’, a default method in scipy for optimization with bounds. Another choice would be ‘SLSQP’, a default method in scipy for optimization with constrains and bounds.
hv_x (1-D numpy array, optional) – data points served as x for calculation of probability density values. Only used for heavy-tail.
hv_log_p (1-D numpy array, optional) – log density values of normal distribution N(0, sigma^2) + heavy-tail. Only used for heavy-tail.
use_AIC (bool, optional) – if True, calculate and return AIC.
reinitialize_theta (bool, optional) – if True, reinitialize theta as 1/#celltype and e_alpha as 1 for optimization.
verbose (bool, optional) – if True, print more information.
- Returns:
optimal theta (1-D numpy array (length #celltypes)) – updated theta (celltype proportion); already after re-fit.
optimal e_alpha (float) – updated e_alpha (spot-specific effect); already after re-fit.
- hyper_parameter_optimization.calcHVBaseModelLoss(theta, e_alpha, y, mu, gamma_g, sigma2, hv_x, hv_log_p, N, non_zero_mtx)[source]
calculate the negative log-likelihood of Poisson log-normal distribution + heavy_tail for all spots
- Parameters:
theta (2-D numpy matrix) – estimated cell-type proportions (spots * celltypes).
e_alpha (1-D numpy array) – spot-specific effect for all spots.
y (2-D numpy matrix) – observed gene nUMI (spots * genes).
mu (2-D numpy matrix) – celltype specific marker genes (celltypes * genes).
gamma_g (1-D numpy array) – gene-specific platform effect for all genes.
sigma2 (float) – estimated variance paramter of the lognormal distribution of ln(lambda). All genes share the same variance.
hv_x (1-D numpy array) – data points served as x for calculation of probability density values. Only used for heavy-tail.
hv_log_p (1-D numpy array) – log density values of normal distribution N(0, sigma^2) + heavy-tail. Only used for heavy-tail.
N (1-D numpy array) – sequencing depth for all spots.
non_zero_mtx (None or 2-D numpy matrix (spots * genes)) – If it’s None, then do not filter zeros during regression. If it’s a bool 2-D numpy matrix (spots * genes) as False means genes whose nUMI=0 while True means genes whose nUMI>0 in corresponding spots. The bool indicators can be calculated based on either observerd raw nUMI counts in spatial data, or CVAE transformed nUMI counts.
- Returns:
sum of negative log-likelihood across all spots.
- Return type:
float
- hyper_parameter_optimization.calc_AIC_BIC(w, y_vec, mu, gamma_g, sigma2, hv_x, hv_log_p, N, use_AIC=False)[source]
calculate BIC or AIC
- Parameters:
w (1-D numpy array (length #celltypes)) – estimated theta (celltype proportion) multiply e_alpha.
y_vec (1-D numpy array) – spatial gene expression (length #genes).
mu (2-D numpy matrix) – matrix of celltype specific marker gene expression (celltypes * genes).
gamma_g (1-D numpy array) – gene-specific platform effect for all genes.
sigma2 (float) – variance paramter of the lognormal distribution of ln(lambda). All gene share the same variance.
hv_x (1-D numpy array, optional) – data points served as x for calculation of probability density values. Only used for heavy-tail.
hv_log_p (1-D numpy array, optional) – log density values of normal distribution N(0, sigma^2) + heavy-tail. Only used for heavy-tail.
N (int or None) – sequencing depth of this spot. If it’s None, use sum of observed marker gene expressions as sequencing depth.
use_AIC (bool, optional) – if True, calculate and return AIC.
- Returns:
value (float) – calculated BIC or AIC.
likelihood_part (float) – negative log-likelihood.
dof (float) – degree of freedom, number of non-zeros minus 1.
- hyper_parameter_optimization.checkEarlyStop(x, num=3)[source]
check whether we can early stop hyper-parameter tuning without test on rest candidates
if we observe num successive increasings, then do early stop
- Parameters:
x (list) – the performance metric of already tested candidates.
- Return type:
either None (means continuing test next candidate) or index of the optimal candidate with smallest performance metric value.
- hyper_parameter_optimization.cv_find_lambda_g(data, L, stage1_theta, stage1_e_alpha, theta_mask, gamma_g, sigma2, candidate_list, hybrid_version=True, opt_method='L-BFGS-B', hv_x=None, hv_log_p=None, use_admm=False, use_likelihood=True, k=5, diagnosis=False)[source]
find optimal value for hyper-parameter lambda_g by k fold cross-validation
- Parameters:
data (Dict) –
- a Dict contains all info need for modeling:
X: a 2-D numpy matrix of celltype specific marker gene expression (celltypes * genes).
Y: a 2-D numpy matrix of spatial gene expression (spots * genes).
A: a 2-D numpy matrix of Adjacency matrix (spots * spots), or is None. Adjacency matrix of spatial sptots (1: connected / 0: disconnected). All 0 in diagonal.
N: a 1-D numpy array of sequencing depth of all spots (length #spots). If it’s None, use sum of observed marker gene expressions as sequencing depth.
non_zero_mtx: If it’s None, then do not filter zeros during regression. If it’s a bool 2-D numpy matrix (spots * genes) as False means genes whose nUMI=0 while True means genes whose nUMI>0 in corresponding spots. The bool indicators can be calculated based on either observerd raw nUMI counts in spatial data, or CVAE transformed nUMI counts.
spot_names: a list of string of spot barcodes. Only keep spots passed filtering.
gene_names: a list of string of gene symbols. Only keep actually used marker genes.
celltype_names: a list of string of celltype names.
initial_guess: initial guess of cell-type proportions of spatial spots.
L (scipy sparse matrix) – already calculated Laplacian Matrix (spots * spots).
stage1_theta (3-D numpy array (spots * celltypes * 1)) – estimated theta (celltype proportion) in stage 1.
stage1_e_alpha (1-D numpy array) – estimated e_alpha (spot-specific effect) in stage 1.
theta_mask (3-D numpy array (spots * celltypes * 1)) – mask for cell-type proportions (1: present, 0: not present). Only used for stage 2 theta optmization.
gamma_g (1-D numpy array) – gene-specific platform effect for all genes.
sigma2 (float) – variance paramter of the lognormal distribution of ln(lambda). All gene share the same variance.
candidate_list (list) – candidates for the hyper-parameter
hybrid_version (bool, optional) – if True, use the hybrid_version of GLRM, i.e. in ADMM local model loss function optimization for w but adaptive lasso constrain on theta. If False, local model loss function optimization and adaptive lasso will on the same w. The default is True.
opt_method (string, optional) – specify method used in scipy.optimize.minimize for local model fitting. The default is ‘L-BFGS-B’, a default method in scipy for optimization with bounds. Another choice would be ‘SLSQP’, a default method in scipy for optimization with constrains and bounds.
hv_x (1-D numpy array, optional) – data points served as x for calculation of probability density values. Only used for heavy-tail.
hv_log_p (1-D numpy array, optional) – log density values of normal distribution N(0, sigma^2) + heavy-tail. Only used for heavy-tail.
use_admm (bool, optional) – whether use ADMM iteration or directly use Laplacian loss function. The default is False, i.e. NOT use ADMM in cross-validation.
use_likelihood (bool, optional) – whether use negative log-likelihood as performance metric in cross-validation. The default is True, if False use RMSE of predicted gene expression.
k (int, optional) – the number of folds in cross-validation, The default value is 5.
diagnosis (bool) – if True save more information to files for diagnosis CVAE and hyper-parameter selection
- Returns:
optimal lambda_g value
- Return type:
float
- hyper_parameter_optimization.cv_find_lambda_r(data, mle_theta, mle_e_alpha, gamma_g, sigma2, lasso_weight, candidate_list, hybrid_version=True, opt_method='L-BFGS-B', hv_x=None, hv_log_p=None, use_admm=False, use_likelihood=True, k=5, diagnosis=False, verbose=False)[source]
find optimal value for hyper-parameter lambda_r by k fold cross-validation
- Parameters:
data (Dict) –
- a Dict contains all info need for modeling:
X: a 2-D numpy matrix of celltype specific marker gene expression (celltypes * genes).
Y: a 2-D numpy matrix of spatial gene expression (spots * genes).
A: a 2-D numpy matrix of Adjacency matrix (spots * spots), or is None. Adjacency matrix of spatial sptots (1: connected / 0: disconnected). All 0 in diagonal.
N: a 1-D numpy array of sequencing depth of all spots (length #spots). If it’s None, use sum of observed marker gene expressions as sequencing depth.
non_zero_mtx: If it’s None, then do not filter zeros during regression. If it’s a bool 2-D numpy matrix (spots * genes) as False means genes whose nUMI=0 while True means genes whose nUMI>0 in corresponding spots. The bool indicators can be calculated based on either observerd raw nUMI counts in spatial data, or CVAE transformed nUMI counts.
spot_names: a list of string of spot barcodes. Only keep spots passed filtering.
gene_names: a list of string of gene symbols. Only keep actually used marker genes.
celltype_names: a list of string of celltype names.
initial_guess: initial guess of cell-type proportions of spatial spots.
mle_theta (3-D numpy array (spots * celltypes * 1)) – estimated theta (celltype proportion) by MLE.
mle_e_alpha (1-D numpy array) – estimated e_alpha (spot-specific effect) by MLE.
gamma_g (1-D numpy array) – gene-specific platform effect for all genes.
sigma2 (float) – variance paramter of the lognormal distribution of ln(lambda). All gene share the same variance.
lasso_weight (3-D numpy array (spots * celltypes * 1)) – weight of Adaptive Lasso, 1 ./ MLE theta
candidate_list (list) – candidates for the hyper-parameter
hybrid_version (bool, optional) – if True, use the hybrid_version of GLRM, i.e. in ADMM local model loss function optimization for w but adaptive lasso constrain on theta. If False, local model loss function optimization and adaptive lasso will on the same w. The default is True.
opt_method (string, optional) – specify method used in scipy.optimize.minimize for local model fitting. The default is ‘L-BFGS-B’, a default method in scipy for optimization with bounds. Another choice would be ‘SLSQP’, a default method in scipy for optimization with constrains and bounds.
hv_x (1-D numpy array, optional) – data points served as x for calculation of probability density values. Only used for heavy-tail.
hv_log_p (1-D numpy array, optional) – log density values of normal distribution N(0, sigma^2) + heavy-tail. Only used for heavy-tail.
use_admm (bool, optional) – whether use ADMM iteration or directly use Adative Lasso loss function. The default is False, i.e. NOT use ADMM in cross-validation as ADMM will cost more time.
use_likelihood (bool, optional) – whether use negative log-likelihood as performance metric in cross-validation. The default is True, if False use RMSE of predicted gene expression.
k (int, optional) – the number of folds in cross-validation, The default value is 5.
diagnosis (bool) – if True save more information to files for diagnosis CVAE and hyper-parameter selection.
verbose (bool, optional) – if True, print more information.
- Returns:
optimal lambda_r value
- Return type:
float
imputation
Created on Sat Jan 28 07:15:15 2023
@author: hill103
this script perform imputation on cell-types and gene expression of spatial spots
- Steps:
Identify edge spots (both outer and inner edges) at the original resolution
Generate grid at a specified higher resolution (location of imputed smaller spots)
Initialize cell type proportions for grid points
Impute cell type proportions theta at higher resolution
Further impute gene expression X at higher resolution
- imputation.check_technique(df)[source]
check the input spatial tissue map is derived from ST or 10x Visium technique. The distance of two neighboring spots in the same row is 1 for ST and 2 for 10x. We only select 5 rows with most spots for checking.
- Parameters:
df (dataframe) – dataframe of spatial spots, with columns ‘x’, ‘y’.
- Returns:
tech_mode – the technique for this dataset, either ‘st’ or ‘visium’
- Return type:
str
- imputation.construct_M_matrix(df, sigma, phi, lazy_rw=False, alpha=0.5)[source]
- construct nearest neighbor random walk matrix M, also called transition probability matrix in compute science. The value in M represents the probability that, given a node, the random walker will stay on that node in the next step
W(x,y) = exp(-||x-y||^2 / 2*sigma^2)
- matrix M is based on a Gaussian kernel W with SD/bandwidth sigma and distance threshold phi
M = D^−1 * W
- if lazy_rw is True, i.e. use lazy random walk with a probability alpha of staying at the current node, an additional lazy transition matrix will be calculated
P_lazy = alpha * I + (1−alpha) * P
- Parameters:
df (dataframe) – dataframe of generated grid points at higher resolution, with columns ‘x’, ‘y’.
sigma (float) – standard deviation of Gaussian kernel W.
phi (float) – threshold of Euclidean distance in Gaussian kernel W. If two points with distance > phi, the corresponding weight in W will be 0.
lazy_rw (bool, optional) – if True, return lazy random walk matrix. The default is False.
alpha (float, optional) – the probability alpha of staying at the current node. The default is 0.5.
- Returns:
M – the transition probalility matrix for random walk.
- Return type:
2-D numpy array
- imputation.create_grid(df, contours, hierarchy, contour_info, grid_step, miss_spot_threshold, add_edge=False)[source]
create a high resolution grid inside the contours. we first create a grid on the bounding box of the contour with any shape, then filter out any points outside the contour. we also consider the generated points falling inside the holes and determine whether to filter them. add_edge controls whether add (x,y) coordinates of edge points to the grid, which will cause the step of grid to be uneven. we filter the generated points row-by-row. creatied grid are independent across multiple outer contours if they exist, so in the imputed heatmap, there may be some empty lines inside one contour shape as some points at the same line are presented in other contours
- Parameters:
df (dataframe) – dataframe of spatial spots, with columns ‘x’, ‘y’, and extra columns: ‘border’, ‘inner_border’, ‘outer_border’.
contours (tuple) – contours variable returned by cv2.findContours, used for creating grid.
hierarchy (3-D numpy array (1 * #contours * 4)) – hierarchy variable returned by cv2.findContours, used for creating grid.
contour_info (list of dicts) – parsed contour information used for creating grid.
grid_step (float) – a float number in (0,1), served as the steps of grid.
miss_spot_threshold (int) – if the number of missing (uncaptured) spots in an identified hole is less or equal this value, this hole will be ignored as if there is no hole at all.
add_edge (bool, optional) – if True, add (x,y) coordinates of contour edge points to the grid. Default is False.
- Returns:
grid_df – dataframe of generated poins (passed filtering), with columns ‘x’, ‘y’, ‘contour_idx’.
- Return type:
dataframe
- imputation.do_imputation(loc_file, theta_file, spatial_file, grid_step=0.5, miss_spot_threshold=1, add_edge=False, sigma=0.35, phi=1, diagnosis=False)[source]
perform imputation given location, cell type proportions and nUMI counts of original spatial spots.
optimal hyperparameters in the Gaussian kernel sigma and phi are selected based on analysis on coarse-grained data.
this function can be called in Python to perform imputation.
- Parameters:
loc_file (string) – full path of input csv file of spot locations in spatial transcriptomic data, with columns x and y (spots * 2).
theta_file (string) – full path of input csv file of cell type proportions of spots in spatial transcriptomic data (spots * cell types).
spatial_file (string) – full path of input csv file of raw nUMI counts in spatial transcriptomic data (spots * genes).
grid_step (float, optional) – step size of generated grid at higher resolution.It also equals the distance of two neighboring imputed spots, given the distance of two neighboring spatial spots is 1 for ST or 2 for 10x Visium techniques. The default is 0.5.
miss_spot_threshold (int, optional) – if the number of missing (uncaptured) spots in an identified hole is less or equal this value, this hole will be ignored as if there is no hole at all. The default is 1.
add_edge (bool, optional) – if True, add (x,y) coordinates of contour edge points to the grid. Default is False.
sigma (float, optional) – standard deviation of Gaussian kernel W. The default is 0.35.
phi (float, optional) – threshold of Euclidean distance in Gaussian kernel W. If two points with distance > phi, the corresponding weight in W will be 0. The default is 1.
diagnosis (bool, optional) – if True save the scatter plot including spatial spots and imputed spots.
- Returns:
impute_df (dataframe) – ataframe of generated grid poins at higher resolution (passed filtering), with columns ‘x’, ‘y’, ‘contour_idx’.
impute_theta (dataframe) – imputed cell type proportions of generated grid points at higher resolution, with columns are cell types.
impute_X (dataframe) – imputed normalized gene expression of generated grid points at higher resolution, rows are grid points, columns are gene expressions.
- imputation.do_random_walk(df, theta, sigma, phi, n_step=1, lazy_rw=False, alpha=0.5)[source]
impute cell type proportion theta for generated grid points at higher resolution using one step random walk.
note if the input cell type proportions are normalized (i.e., each row sums to 1), the imputed results using a valid transition probability matrix will also be normalized.
- Parameters:
df (dataframe) – dataframe of generated grid points at higher resolution, with columns ‘x’, ‘y’.
theta (dataframe) – initial cell type proportions of generated grid points at higher resolution, with columns are cell types.
sigma (float) – standard deviation of Gaussian kernel W.
phi (float) – threshold of Euclidean distance in Gaussian kernel W. If two points with distance > phi, the corresponding weight in W will be 0.
n_step (int, optional) – steps for random walk. The default is 1.
lazy_rw (bool, optional) – if True, return lazy random walk matrix. The default is False.
alpha (float, optional) – the probability alpha of staying at the current node. The default is 0.5.
- Returns:
impute_theta – imputed cell type proportions of generated grid points at higher resolution, with columns are cell types.
- Return type:
dataframe
- imputation.identify_edges(df, mode, keep_filled=False)[source]
identify edge spots. Column ‘border’ of Edge spots are True. NOTE edge spot identification is based on ‘x’ and ‘y’ coordinate indexes not pixels. when keep_filled=True, filled spots for 10x will be kept for creating grid.
- Parameters:
df (dataframe) – dataframe of spatial spots, with columns ‘x’, ‘y’.
mode (string) – either ‘st’ for Spatial Transcriptomics technique, or ‘visium’ for 10x Genomics Visium technique.
keep_filled (bool, optional) – if True, filled spots for 10x will be kept for creating grid.
- Returns:
df (dataframe) – dataframe with extra columns: ‘border’, ‘inner_border’, ‘outer_border’, ‘filled’.
contours (tuple) – contours variable returned by cv2.findContours, used for creating grid.
hierarchy (3-D numpy array (1 * #contours * 4)) – hierarchy variable returned by cv2.findContours, used for creating grid.
contour_info (list of dicts) – parsed contour information used for creating grid.
- imputation.impute_expression(X, theta, grid_theta)[source]
impute gene expression for generated grid points at higher resolution. The imputed expression are assured to sum to 1 for each spot/row. And all values in imputed expression are non-negative, as multiplying non-negative numbers will always yield a non-negative result.
X_imputed = theta_imputed * theta^+ * X
theta^+ = (theta^T * theta)^(−1) * theta^T
- Parameters:
X (dataframe) – nUMI counts of spatial spots, rows are spots and columns are gene expressions
theta (dataframe) – cell type proportions of spatial spots at original resolution, with columns are cell types.
grid_theta (dataframe) – imputed cell type proportions of generated grid points at higher resolution, with columns are cell types.
- Returns:
X_imputed – imputed normalized gene expression of generated grid points at higher resolution, rows are grid points, columns are gene expressions.
- Return type:
dataframe
- imputation.impute_spots_with_theta(df, theta, grid_step=0.5, miss_spot_threshold=1, add_edge=False, sigma=0.35, phi=1, diagnosis=False)[source]
given the spatial spots location and cell type proportions, impute smaller spots at a higher resolution, return the imputed location and cell type proportions for these smaller spots.
- Parameters:
df (dataframe) – dataframe of spatial spots at original resolution, with columns ‘x’, ‘y’.
theta (dataframe) – cell type proportions of spatial spots at original resolution, with columns are cell types.
grid_step (float, optional) – step size of generated grid at higher resolution.It also equals the distance of two neighboring imputed spots, given the distance of two neighboring spatial spots is 1 for ST or 2 for 10x Visium techniques. The default is 0.5.
miss_spot_threshold (int, optional) – if the number of missing (uncaptured) spots in an identified hole is less or equal this value, this hole will be ignored as if there is no hole at all. The default is 1.
add_edge (bool, optional) – if True, add (x,y) coordinates of contour edge points to the grid. Default is False.
sigma (float, optional) – standard deviation of Gaussian kernel W. The default is 0.35.
phi (float, optional) – threshold of Euclidean distance in Gaussian kernel W. If two points with distance > phi, the corresponding weight in W will be 0. The default is 1.
diagnosis (bool, optional) – if True save the scatter plot including spatial spots and imputed spots.
- Returns:
impute_df (dataframe) – dataframe of generated grid points at higher resolution, with columns ‘x’, ‘y’, ‘contour_idx’.
impute_theta (dataframe) – imputed cell type proportions of generated grid points at higher resolution, with columns are cell types.
- imputation.initialize_theta_grid_points(df, theta, grid_df)[source]
initialize cell type proportions grid_theta for generated grid points in higher resolution. for one grid point, grid_theta is set as the theta value of spatial spot in original resolution with the smallest Euclidean distance. if there are multiple spatial spots with the same smallest Euclidean distance, then take average of theta of those spots finally we will normalize all grid_theta to sum to 1.
- Parameters:
df (dataframe) – dataframe of spatial spots at original resolution, with columns ‘x’, ‘y’.
theta (dataframe) – cell type proportions of spatial spots at original resolution, with columns are cell types.
grid_df (dataframe) – dataframe of generated grid points at higher resolution, with columns ‘x’, ‘y’.
- Returns:
grid_theta – cell type proportions of generated grid points at higher resolution, with columns are cell types.
- Return type:
dataframe
- imputation.parseOpt()[source]
parse the command line parameters and return them as a dict
- Parameters:
None.
- Returns:
paramdict –
- parsed command line parameters for model, including:
spatial_file : full file path of spatial transcriptomic data
loc_file : full file path of spot locations in spatial transcriptomic data
prop_file : full file path of cell type proportions of spots in spatial transcriptomic data
diameter : the physical diameter of spatial spots
impute_diameter : target spot diameter for imputation
hole_min_spots : threshold of number of uncaptured spots to validate holes
preserve_shape : whether to preserve the exact shape of tissue map
diagnosis : whether to draw plot for diagnosis
- Return type:
Dict
local_fit_numba
Created on Thu May 5 22:17:11 2022
@author: hill103
this script implement the optimization of theta and sigma^2 using Poisson log-normal distribution + heavy-tail
use Numba parallel for best performance
we calculate the likelihood for each gene sequentially, utilizing Numba’s parallel capabilities to accelerate the computation within each gene’s calculation.
NOTE: simplifying the loss-computation code and removing unneeded nested wrappers yielded a speedup
and print not supported in Numba function
functions for likelihood calculation:
calc_hv_numba: Numba function so parallel computing by Numba; calculate likelihoods given an array of y and mu then return; NOT related with cache dict
hv_comb: return sum of negative log-likelihoods + gradient of w; used for theta optimization; directly call hv_numba
hv_wrapper: return sum of negative log-likelihoods; used for sigma^2 optimization; directly call hv_numba
hv_numba: call calc_hv_numba and return array of likelihoods; support caching (DELETED)
sequence of function calls for parameter optimization:
update_theta -> objective_loss_theta -> hv_comb -> calc_hv_numba
update_sigma2 -> objective_loss_sigma2 -> hv_wrapper -> calc_hv_numba
calcHVBaseModelLoss -> hv_wrapper in hyper parameter tunning
- class local_fit_numba.RandomDisplacementBounds(xmin, xmax, stepsize=0.5)[source]
Bases:
objectA class to perform random displacement with bounds on parameters during optimization.
This step function is designed to modify the basinhopping algorithm’s step-taking behavior by ensuring that new positions remain within specified bounds. This implementation uses a direct calculation to determine the minimum and maximum allowable steps rather than using acceptance-rejection sampling, which can be found in the more general approach discussed on StackOverflow (https://stackoverflow.com/a/21967888/2320035). The approach has been modified for enhanced performance with specific bounds.
- xmin
The lower bounds of the parameters. None indicates unbounded.
- Type:
np.ndarray or list
- xmax
The upper bounds of the parameters. None indicates unbounded.
- Type:
np.ndarray or list
- stepsize
The maximum step size to take in any parameter direction.
- Type:
float
- local_fit_numba.adaptive_lasso(nu, rho, lambda_r=1.0, lasso_weight=None)[source]
update theta_tilde by Adaptive Lasso and ADMM loss
theta_tilde = argmin lambda_r*lasso_weight*theta_tilde + 1/(2*rho) * (||theta_tilde-theta_hat+u_tilde||_2)^2
that is
if theta_tilde>=0, theta_tilde = theta_hat - u_tilde - lambda_r*rho*lasso_weight
if theta_tilde<0, theta_tilde = theta_hat - u_tilde + lambda_r*rho*lasso_weight
change it to
if theta_hat-u_tilde-lambda_r*rho*lasso_weight>=0, theta_tilde = theta_hat - u_tilde - lambda_r*rho*lasso_weight
if theta_hat-u_tilde+lambda_r*rho*lasso_weight<=0, theta_tilde = theta_hat - u_tilde + lambda_r*rho*lasso_weight
if theta_hat-u_tilde-lambda_r*rho*lasso_weight<0 or theta_hat-u_tilde+lambda_r*rho*lasso_weight>0, not defined, let theta_tilde = 0
and nu = theta_hat - u_tilde
WANRING: negative theta tilde value observed, so also clip it to >= 0
- Parameters:
nu (3-D numpy array (spots * celltypes * 1)) – variable for ADMM penalty of all spots in 3 part ADMM nu = theta_hat (used for regularization/penalty) - u_tilde (scaled dual variables to make theta_tilde = theta_hat)
rho (float) – parameter for the strength of ADMM loss to make theta_tilde equals theta_hat
lambda_r (float, optional) – strength for Adaptive Lasso penalty. The default is 1.0.
lasso_weight (3-D numpy array (spots * celltypes * 1), optional) – calculated weight for adaptive lasso. The default is None.
- Returns:
updated theta_tilde.
- Return type:
3-D numpy array (spots * celltypes * 1)
- local_fit_numba.calc_hv_numba(MU, y_vec, hv_x, hv_log_p, output, gamma)[source]
calculate the likelihood value given the distribution parameters considering heavy-tail and assign them to the output variable
note that add a small value to avoid log(0)
- Parameters:
MU (1-D numpy array) – mean value of log-normal distribution for all genes.
y_vec (1-D numpy array) – observed gene counts of one spot.
hv_x (1-D numpy array, optional) – data points served as x for calculation of probability density values. Only used for heavy-tail.
hv_log_p (1-D numpy array, optional) – log density values of normal distribution N(0, sigma^2) + heavy-tail. Only used for heavy-tail.
output (1-D numpy array) – empty array used to store the calculated values.
gamma (float) – increment to calculate the heavy-tail probabilities.
- Returns:
likelihood of input genes in one spot.
- Return type:
1-D numpy array
- local_fit_numba.fit_base_model_plus_laplacian(data, L, theta, e_alpha, gamma_g, sigma2, lambda_r=None, lasso_weight=None, hv_x=None, hv_log_p=None, theta_mask=None, opt_method='L-BFGS-B', global_optimize=False, hybrid_version=True, verbose=False)[source]
fit base model plus Laplacian penalty, this function is to replace the whole ADMM framework in GLRM stage two
Note: To keep consistent, the output of the fit result is still 3-Dimensional. The dimension transform will be performed outside this function if needed
- Parameters:
data (Dict) –
- a Dict contains all info need for modeling:
X: a 2-D numpy matrix of celltype specific marker gene expression (celltypes * genes).
Y: a 2-D numpy matrix of spatial gene expression (spots * genes).
A: a 2-D numpy matrix of Adjacency matrix (spots * spots), or is None. Adjacency matrix of spatial sptots (1: connected / 0: disconnected). All 0 in diagonal.
N: a 1-D numpy array of sequencing depth of all spots (length #spots). If it’s None, use sum of observed marker gene expressions as sequencing depth.
non_zero_mtx: If it’s None, then do not filter zeros during regression. If it’s a bool 2-D numpy matrix (spots * genes) as False means genes whose nUMI=0 while True means genes whose nUMI>0 in corresponding spots. The bool indicators can be calculated based on either observerd raw nUMI counts in spatial data, or CVAE transformed nUMI counts.
spot_names: a list of string of spot barcodes. Only keep spots passed filtering.
gene_names: a list of string of gene symbols. Only keep actually used marker genes.
celltype_names: a list of string of celltype names.
initial_guess: initial guess of cell-type proportions of spatial spots.
L (scipy sparse matrix (spots * spots)) – Laplacian matrix. Note the strenth lambda_g is already absorbed in the L
theta (3-D numpy array (spots * celltypes * 1)) – initial guess of theta (celltype proportion).
e_alpha (1-D numpy array) – initial guess of e_alpha (spot-specific effect).
gamma_g (1-D numpy array) – gene-specific platform effect for all genes.
sigma2 (float) – initial guess of variance paramter of the lognormal distribution of ln(lambda). All genes and spots share the same variance. it may not be updated during this optimization, i.e. sigma2 is treated as an already optimized value.
lambda_r (float, optional) – strength for Adaptive Lasso penalty. The default is None here.
lasso_weight (3-D numpy array (spots * celltypes * 1), optional) – calculated weight for adaptive lasso. The default is None.
hv_x (1-D numpy array, optional) – data points served as x for calculation of probability density values. Only used for heavy-tail.
hv_log_p (1-D numpy array, optional) – log density values of normal distribution N(0, sigma^2) + heavy-tail. Only used for heavy-tail.
theta_mask (3-D numpy array (spots * celltypes * 1), optional) – mask for cell-type proportions (1: present, 0: not present). Only used for stage 2 theta optmization.
opt_method (string, optional) – specify method used in scipy.optimize.minimize for local model fitting. The default is ‘L-BFGS-B’, a default method in scipy for optimization with bounds. Another choice would be ‘SLSQP’, a default method in scipy for optimization with constrains and bounds.
global_optimize (bool, optional) – if is True, use basin-hopping algorithm to find the global minimum. The default is False.
hybrid_version (bool, optional) – if True, use the hybrid_version of GLRM, i.e. in ADMM local model loss function optimization for w but adaptive lasso constrain on theta. If False, local model loss function optimization and adaptive lasso will on the same w. The default is True.
verbose (bool, optional) – if True, print more information.
- Returns:
- estimated model coefficients, including:
theta : celltype proportions (#spots * #celltypes * 1)
e_alpha : spot-specific effect (1-D array with length #spot)
sigma2 : variance paramter of the lognormal distribution (float)
gamma_g : gene-specific platform effect for all genes (1-D array with length #gene)
- Return type:
Dict
- local_fit_numba.generate_log_heavytail_array(x, sigma)[source]
generate a array of log density values of standard normal distribution + heavy-tail given a specific sigma
we assume the normal distribution is N(0, sigma^2), and transform to N(0, 1) by divided by sigma, and calculate the heavy-tail density values, then transform back to N(0, sigma^2) and take log. We can re-use these pre-calculated values for all genes and spots
after divided by sigma, the normal + heavy-tail integrate to 1, i.e. sum(p*gamma)=1, gamma is the interval
- Parameters:
x (1-D numpy array) – data points served as x for calculation of probability density values
sigma (float) – SD of normal distribution, corresponding to the dispertion in GLRM. All genes and all spots share the same SD
- Returns:
log density values of normal distribution N(0, sigma^2) + heavy-tail.
- Return type:
1-D numpy array
- local_fit_numba.hv_comb(w_vec, y_vec, mu, gamma_g, sigma2, hv_x, hv_log_p, N)[source]
a function to calculate the negative likelihood value given the distribution parameters considering heavy-tail plus corresponding gradient vector
also return 1st order derivative vector of w
this function will be used as target function for theta optimization
we assume:
y ~ Poisson(N*lambda)
ln(lambda) follow epsilon’s distribution ~ N(mu, sigma2)
then ln(N*lambda) ~ N(mu+ln(N), sigma2)
N is sequencing depth for this spot mu is alpha + log(theta*marker gene) + gamma
Now we also consider the heavy-tail instead of just normal distribution
- Parameters:
w_vec (1-D numpy array) – e_alpha (spot-specific effect) * theta (celltype proportion) of one spot.
y_vec (1-D numpy array) – observed gene counts of one spot.
mu (2-D numpy matrix) – celltype specific marker genes (celltypes * genes)
gamma_g (1-D numpy array) – gene-specific platform effect for all genes.
sigma2 (float) – variance paramter of the lognormal distribution of ln(lambda). All gene share the same variance.
hv_x (1-D numpy array, optional) – data points served as x for calculation of probability density values. Only used for heavy-tail.
hv_log_p (1-D numpy array, optional) – log density values of normal distribution N(0, sigma^2) + heavy-tail. Only used for heavy-tail.
N (float) – sequencing depth for this spot. If is None, use sum(y_vec) instead.
- Returns:
sum of negative log-likelihood across all genes in one spot + gradient vector of w.
- Return type:
tuple of (float, 1-D numpy array)
- local_fit_numba.hv_wrapper(w_vec, y_vec, mu, gamma_g, sigma2, hv_x, hv_log_p, N)[source]
a wrapper to calculate the negative log-likelihood value given the distribution parameters considering heavy-tail
this wrapper will be used as target function for sigma^2 optimization
we assume:
y ~ Poisson(N*lambda)
ln(lambda) follow epsilon’s distribution ~ N(mu, sigma2)
then ln(N*lambda) ~ N(mu+ln(N), sigma2)
N is sequencing depth for this spot mu is alpha + log(theta*marker gene) + gamma
Now we also consider the heavy-tail instead of just normal distribution
- Parameters:
w_vec (1-D numpy array) – e_alpha (spot-specific effect) * theta (celltype proportion) of one spot.
y_vec (1-D numpy array) – observed gene counts of one spot.
mu (2-D numpy matrix) – celltype specific marker genes (celltypes * genes)
gamma_g (1-D numpy array) – gene-specific platform effect for all genes.
sigma2 (float) – variance paramter of the lognormal distribution of ln(lambda). All gene share the same variance.
hv_x (1-D numpy array, optional) – data points served as x for calculation of probability density values. Only used for heavy-tail.
hv_log_p (1-D numpy array, optional) – log density values of normal distribution N(0, sigma^2) + heavy-tail. Only used for heavy-tail.
N (float) – sequencing depth for this spot. If is None, use sum(y_vec) instead.
- Returns:
sum of negative log-likelihood across all genes in one spot.
- Return type:
float
- local_fit_numba.objective_loss_theta(w_vec, y_vec, mu, gamma_g, sigma2, nu_vec, rho, lambda_r, lasso_weight_vec, lambda_l2, hybrid_version, hv_x, hv_log_p, N)[source]
calculate loss function for updating theta (celltype proportion)
the loss function contains four parts (defined for each spot separately)
also return gradient of w
negative log-likelihood of the base model given observed data and initial parameter value. It sums across all genes
a loss of ADMM to make theta equals theta_hat (used for regularization/penalty; optional, controlled by nu_vec and rho)
1/(2*rho) (||w/sum(w) - theta_hat + u||_2)^2
u is the scaled dual variables to make theta = theta_hat and nu = theta_hat - u
and we did re-parametrization w = e^alpha * theta, so theta = w / sum(w)
a loss of Adaptive Lasso to shrink theta (optional, controlled by lambda_r and lasso_weight_vec)
lambda_r * (inner product(w/sum(w), lasso_weight))
a loss of L2 penalty to shrink theta (optional, controlled by lambda_l2)
lambda_l2 * (sum(squared(w/sum(w))))
- Parameters:
w_vec (1-D numpy array) – e_alpha (spot-specific effect) * theta (celltype proportion) of one spot.
y_vec (1-D numpy array) – observed gene counts of one spot.
mu (2-D numpy matrix) – celltype specific marker genes (celltypes * genes)
gamma_g (1-D numpy array) – gene-specific platform effect for all genes.
sigma2 (float) – variance paramter of the lognormal distribution of ln(lambda). All gene share the same variance.
nu_vec (1-D numpy array) – variable for ADMM penalty of one spot in 3 part ADMM nu = theta_hat (used for regularization/penalty) - u (scaled dual variables to make theta = theta_hat)
rho (float) – parameter for the strength of ADMM loss to make theta equals theta_hat
lambda_r (float) – parameter for the strength of Adaptive Lasso loss to shrink theta
lasso_weight_vec (1-D numpy array) – weight of Adaptive Lasso, 1 ./ theta
lambda_l2 (float) – parameter for the strength of L2 panealty to shrink theta
hybrid_version (bool, optional) – if True, use the hybrid_version of GLRM, i.e. in ADMM local model loss function optimization for w but adaptive lasso constrain on theta. If False, local model loss function optimization and adaptive lasso will on the same w. The default is True.
hv_x (1-D numpy array, optional) – data points served as x for calculation of probability density values. Only used for heavy-tail.
hv_log_p (1-D numpy array, optional) – log density values of normal distribution N(0, sigma^2) + heavy-tail. Only used for heavy-tail.
N (float) – sequencing depth for this spot. If is None, use sum(y_vec) instead.
- Returns:
the loss function (base model loss + ADMM loss + Adaptive Lasso loss + L2 loss) of one spot to update w (e_alpha*theta) + gradient vector
- Return type:
a tuple (float, 1-D numpy array)
- local_fit_numba.optimize_one_theta(mu, y_vec, N, this_warm_start_theta, this_warm_start_e_alpha, gamma_g, sigma2, spot_name, nu_vec=None, rho=None, lambda_r=None, lasso_weight_vec=None, lambda_l2=None, global_optimize=False, hybrid_version=True, opt_method='L-BFGS-B', hv_x=None, hv_log_p=None, this_theta_mask=None, skip_opt=True, verbose=False)[source]
update theta (celltype proportion) sigma2 (variance paramter of the log-normal distribution) and gamma_g (gene-specific platform effect) by MLE in ONE SPOT
- Parameters:
mu (2-D numpy matrix) – matrix of celltype specific marker gene expression (celltypes * genes).
y_vec (1-D numpy array) – spatial gene expression (length #genes).
N (int or None) – sequencing depth of this spot. If it’s None, use sum of observed marker gene expressions as sequencing depth.
spot_name (string) – name of this spot.
this_warm_start_theta (1-D numpy array (length #celltypes)) – initial guess of theta (celltype proportion).
this_warm_start_e_alpha (float) – initial guess of e_alpha (spot-specific effect).
gamma_g (1-D numpy array) – gene-specific platform effect for all genes.
sigma2 (float) – variance paramter of the lognormal distribution of ln(lambda). All gene share the same variance.
nu_vec (1-D numpy array (length #celltypes), optional) – variable for ADMM penalty of all spots in 3 part ADMM nu = theta_hat (used for regularization/penalty) - u (scaled dual variables to make theta = theta_hat)
rho (float, optional) – parameter for the strength of ADMM loss to make theta equals theta_hat
lambda_r (float, optional) – parameter for the strength of Adaptive Lasso loss to shrink theta
lasso_weight_vec (1-D numpy array (length #celltypes), optional) – weight of Adaptive Lasso, 1 ./ theta
lambda_l2 (float) – parameter for the strength of L2 panealty to shrink theta
global_optimize (bool, optional) – if is True, use basin-hopping algorithm to find the global minimum. The default is False.
hybrid_version (bool, optional) – if True, use the hybrid_version of GLRM, i.e. in ADMM local model loss function optimization for w but adaptive lasso constrain on theta. If False, local model loss function optimization and adaptive lasso will on the same w. The default is True.
opt_method (string, optional) – specify method used in scipy.optimize.minimize for local model fitting. The default is ‘L-BFGS-B’, a default method in scipy for optimization with bounds. Another choice would be ‘SLSQP’, a default method in scipy for optimization with constrains and bounds.
hv_x (1-D numpy array, optional) – data points served as x for calculation of probability density values. Only used for heavy-tail.
hv_log_p (1-D numpy array, optional) – log density values of normal distribution N(0, sigma^2) + heavy-tail. Only used for heavy-tail.
this_theta_mask (1-D numpy array (length #celltypes), optional) – mask for cell-type proportions (1: present, 0: not present).
skip_opt (bool, optional) – if True, when only one cell type present, we skip optimization and directly return theta.
verbose (bool, optional) – if True, print more information.
- Returns:
w_result – updated w, i.e. theta (celltype proportion) * e_alpha.
- Return type:
1-D numpy array (length #celltypes)
- local_fit_numba.update_sigma2(data, theta, e_alpha, gamma_g, sigma2, opt_method='L-BFGS-B', global_optimize=False, hv_x=None, verbose=False)[source]
update sigma2 (variance paramter of the lognormal distribution) given theta (celltype proportion), e_alpha (spot-specific effect) and gamma_g (gene-specific platform effect) by MLE
we assume
ln(lambda) = alpha + gamma_g + ln(sum(theta*mu_X)) + epsilon
subject to sum(theta)=1, theta>=0
mu_X is marker genes from data[‘X’]
then the mean parameter of the log-normal distribution of ln(lambda) is alpha + gamma_g + ln(sum(theta*mu_X))
we did re-parametrization w = e^alpha * theta, then
ln(lambda) = gamma_g + ln(sum([e^alpha*theta]*mu_X)) + epsilon
subject to w>=0, it will imply sum(theta)=1 and theta>=0
currently optimization of sigma2 DO NOT support parallel computing
- Parameters:
data (Dict) –
- a Dict contains all info need for modeling:
X: a 2-D numpy matrix of celltype specific marker gene expression (celltypes * genes).
Y: a 2-D numpy matrix of spatial gene expression (spots * genes).
A: a 2-D numpy matrix of Adjacency matrix (spots * spots), or is None. Adjacency matrix of spatial sptots (1: connected / 0: disconnected). All 0 in diagonal.
N: a 1-D numpy array of sequencing depth of all spots (length #spots). If it’s None, use sum of observed marker gene expressions as sequencing depth.
non_zero_mtx: If it’s None, then do not filter zeros during regression. If it’s a bool 2-D numpy matrix (spots * genes) as False means genes whose nUMI=0 while True means genes whose nUMI>0 in corresponding spots. The bool indicators can be calculated based on either observerd raw nUMI counts in spatial data, or CVAE transformed nUMI counts.
spot_names: a list of string of spot barcodes. Only keep spots passed filtering.
gene_names: a list of string of gene symbols. Only keep actually used marker genes.
celltype_names: a list of string of celltype names.
theta (3-D numpy array (spots * celltypes * 1)) – theta (celltype proportion).
e_alpha (1-D numpy array) – e_alpha (spot-specific effect).
gamma_g (1-D numpy array) – gene-specific platform effect for all genes.
sigma2 (float) – initial guess of variance paramter of the lognormal distribution of ln(lambda). All genes and spots share the same variance.
opt_method (string, optional) – specify method used in scipy.optimize.minimize for local model fitting. The default is ‘L-BFGS-B’, a default method in scipy for optimization with bounds. Another choice would be ‘SLSQP’, a default method in scipy for optimization with constrains and bounds.
global_optimize (bool, optional) – if is True, use basin-hopping algorithm to find the global minimum. The default is False.
hv_x (1-D numpy array, optional) – data points served as x for calculation of probability density values. Only used for heavy-tail.
verbose (bool, optional) – if True, print more information.
- Returns:
updated sigma2
- Return type:
float
- local_fit_numba.update_theta(data, warm_start_theta, warm_start_e_alpha, gamma_g, sigma2, nu=None, rho=None, lambda_r=None, lasso_weight=None, lambda_l2=None, global_optimize=False, hybrid_version=True, opt_method='L-BFGS-B', hv_x=None, hv_log_p=None, theta_mask=None, verbose=False)[source]
update theta (celltype proportion) and e_alpha (spot-specific effect) given sigma2 (variance paramter of the log-normal distribution) and gamma_g (gene-specific platform effect) by MLE
we assume
ln(lambda) = alpha + gamma_g + ln(sum(theta*mu_X)) + epsilon
subject to sum(theta)=1, theta>=0
mu_X is marker genes from data[‘X’]
then the mean parameter of the lognormal distribution of ln(lambda) is alpha + gamma_g + ln(sum(theta*mu_X))
we did re-parametrization w = e^alpha * theta, then
ln(lambda) = gamma_g + ln(sum([e^alpha*theta]*mu_X)) + epsilon
subject to w>=0, it will imply sum(theta)=1 and theta>=0
- the steps to update theta and e_alpha:
dimension change of theta, theta_hat, u from 3-D (spots * celltypes * 1) to 1-D (celltypes), and do re-parametrization to get w
solve w for each spot in parallel
extract updated theta and e_alpha from w, and change the dimension of updated theta, theta_hat, u back
- Parameters:
data (Dict) –
- a Dict contains all info need for modeling:
X: a 2-D numpy matrix of celltype specific marker gene expression (celltypes * genes).
Y: a 2-D numpy matrix of spatial gene expression (spots * genes).
A: a 2-D numpy matrix of Adjacency matrix (spots * spots), or is None. Adjacency matrix of spatial sptots (1: connected / 0: disconnected). All 0 in diagonal.
N: a 1-D numpy array of sequencing depth of all spots (length #spots). If it’s None, use sum of observed marker gene expressions as sequencing depth.
non_zero_mtx: If it’s None, then do not filter zeros during regression. If it’s a bool 2-D numpy matrix (spots * genes) as False means genes whose nUMI=0 while True means genes whose nUMI>0 in corresponding spots. The bool indicators can be calculated based on either observerd raw nUMI counts in spatial data, or CVAE transformed nUMI counts.
spot_names: a list of string of spot barcodes. Only keep spots passed filtering.
gene_names: a list of string of gene symbols. Only keep actually used marker genes.
celltype_names: a list of string of celltype names.
warm_start_theta (3-D numpy array (spots * celltypes * 1)) – initial guess of theta (celltype proportion).
warm_start_e_alpha (1-D numpy array) – initial guess of e_alpha (spot-specific effect).
gamma_g (1-D numpy array) – gene-specific platform effect for all genes.
sigma2 (float) – variance paramter of the lognormal distribution of ln(lambda). All gene share the same variance.
nu (3-D numpy array (spots * celltypes * 1), optional) – variable for ADMM penalty of all spots in 3 part ADMM nu = theta_hat (used for regularization/penalty) - u (scaled dual variables to make theta = theta_hat)
rho (float, optional) – parameter for the strength of ADMM loss to make theta equals theta_hat
lambda_r (float, optional) – parameter for the strength of Adaptive Lasso loss to shrink theta
lasso_weight (3-D numpy array (spots * celltypes * 1), optional) – weight of Adaptive Lasso, 1 ./ theta
lambda_l2 (float) – parameter for the strength of L2 panealty to shrink theta
global_optimize (bool, optional) – if is True, use basin-hopping algorithm to find the global minimum. The default is False.
hybrid_version (bool, optional) – if True, use the hybrid_version of GLRM, i.e. in ADMM local model loss function optimization for w but adaptive lasso constrain on theta. If False, local model loss function optimization and adaptive lasso will on the same w. The default is True.
opt_method (string, optional) – specify method used in scipy.optimize.minimize for local model fitting. The default is ‘L-BFGS-B’, a default method in scipy for optimization with bounds. Another choice would be ‘SLSQP’, a default method in scipy for optimization with constrains and bounds.
hv_x (1-D numpy array, optional) – data points served as x for calculation of probability density values. Only used for heavy-tail.
hv_log_p (1-D numpy array, optional) – log density values of normal distribution N(0, sigma^2) + heavy-tail. Only used for heavy-tail.
theta_mask (3-D numpy array (spots * celltypes * 1), optional) – mask for cell-type proportions (1: present, 0: not present). Only used for stage 2 theta optmization.
verbose (bool, optional) – if True, print more information.
- Returns:
theta_results (3-D numpy array (spots * celltypes * 1)) – updated theta (celltype proportion).
e_alpha_results (1-D numpy array) – updated e_alpha (spot-specific effect).
model_fit
Created on Sat May 7 22:36:00 2022
@author: hill103
this script stores functions of model fitting currently we use a two-stage implement to fit GLRM
- model_fit.estimating_gamma_g(data, hybrid_version=True, opt_method='L-BFGS-B', verbose=False)[source]
estimate platform effect gamma_g by fit model on pseudo-bulk measurement
- Parameters:
data (Dict) –
- a Dict contains all info need for modeling:
X: a 2-D numpy matrix of celltype specific marker gene expression (celltypes * genes).
Y: a 2-D numpy matrix of spatial gene expression (spots * genes).
A: a 2-D numpy matrix of Adjacency matrix (spots * spots), or is None. Adjacency matrix of spatial sptots (1: connected / 0: disconnected). All 0 in diagonal.
N: a 1-D numpy array of sequencing depth of all spots (length #spots). If it’s None, use sum of observed marker gene expressions as sequencing depth.
non_zero_mtx: If it’s None, then do not filter zeros during regression. If it’s a bool 2-D numpy matrix (spots * genes) as False means genes whose nUMI=0 while True means genes whose nUMI>0 in corresponding spots. The bool indicators can be calculated based on either observerd raw nUMI counts in spatial data, or CVAE transformed nUMI counts.
spot_names: a list of string of spot barcodes. Only keep spots passed filtering.
gene_names: a list of string of gene symbols. Only keep actually used marker gene symbols.
celltype_names: a list of string of celltype names.
initial_guess: initial guess of cell-type proportions of spatial spots.
hybrid_version (bool, optional) – if True, use the hybrid_version of GLRM, i.e. in ADMM local model loss function optimization for w but adaptive lasso constrain on theta. If False, local model loss function optimization and adaptive lasso will on the same w. The default is True.
opt_method (string, optional) – specify method used in scipy.optimize.minimize for local model fitting. The default is ‘L-BFGS-B’, a default method in scipy for optimization with bounds. Another choice would be ‘SLSQP’, a default method in scipy for optimization with constrains and bounds.
verbose (bool, optional) – if True, print more information in program running
- Returns:
estimated gamma_g
- Return type:
1-D numpy array (#genes)
- model_fit.fit_base_model(data, gamma_g=None, global_optimize=False, hybrid_version=True, opt_method='L-BFGS-B', verbose=False, use_initial_guess=False)[source]
fit local or base model without any Adaptive Lasso constrain or Graph Laplacian constrain
this fitting is only used for gamma_g estimation and MLE theta estimation in GLRM stage 1
when fitting the base model, we will perform fitting iteratively until the sigma^2 change <= 0.001
Note: the output of the fit result is 3-Dimensional. The dimension transform will be performed outside this function if needed
- Parameters:
data (Dict) –
- a Dict contains all info need for modeling:
X: a 2-D numpy matrix of celltype specific marker gene expression (celltypes * genes).
Y: a 2-D numpy matrix of spatial gene expression (spots * genes).
A: a 2-D numpy matrix of Adjacency matrix (spots * spots), or is None. Adjacency matrix of spatial sptots (1: connected / 0: disconnected). All 0 in diagonal.
N: a 1-D numpy array of sequencing depth of all spots (length #spots). If it’s None, use sum of observed marker gene expressions as sequencing depth.
non_zero_mtx: If it’s None, then do not filter zeros during regression. If it’s a bool 2-D numpy matrix (spots * genes) as False means genes whose nUMI=0 while True means genes whose nUMI>0 in corresponding spots. The bool indicators can be calculated based on either observerd raw nUMI counts in spatial data, or CVAE transformed nUMI counts.
spot_names: a list of string of spot barcodes. Only keep spots passed filtering.
gene_names: a list of string of gene symbols. Only keep actually used marker gene symbols.
celltype_names: a list of string of celltype names.
initial_guess: initial guess of cell-type proportions of spatial spots.
gamma_g (1-D numpy array) – gene-specific platform effect for all genes.
global_optimize (bool, optional) – if is True, use basin-hopping algorithm to find the global minimum. The default is False.
hybrid_version (bool, optional) – if True, use the hybrid_version of GLRM, i.e. in ADMM local model loss function optimization for w but adaptive lasso constrain on theta. If False, local model loss function optimization and adaptive lasso will on the same w. The default is True.
opt_method (string, optional) – specify method used in scipy.optimize.minimize for local model fitting. The default is ‘L-BFGS-B’, a default method in scipy for optimization with bounds. Another choice would be ‘SLSQP’, a default method in scipy for optimization with constrains and bounds.
verbose (bool, optional) – if True, print more information in program running.
use_initial_guess (bool, optional) – if True, use initial guess instead of uniform distribution for theta initialization.
- Returns:
- a tuple of estimated theta, e_alpha and sigma2:
theta : celltype proportions (3-D numpy array (spots * celltypes * 1))
e_alpha : spot-specific effect (1-D array with length #spot)
sigma2 : variance paramter of the lognormal distribution (float)
- Return type:
tuple
- model_fit.fit_model_two_stage(data, gamma_g=None, lambda_r=None, lambda_g=None, global_optimize=False, hybrid_version=True, opt_method='L-BFGS-B', verbose=False, diagnosis=False, use_admm=False)[source]
fit Graph Laplacian Regularized Stratified Model (GLRM) in a two-stage way
- Parameters:
data (Dict) –
- a Dict contains all info need for modeling:
X: a 2-D numpy matrix of celltype specific marker gene expression (celltypes * genes).
Y: a 2-D numpy matrix of spatial gene expression (spots * genes).
A: a 2-D numpy matrix of Adjacency matrix (spots * spots), or is None. Adjacency matrix of spatial sptots (1: connected / 0: disconnected). All 0 in diagonal.
N: a 1-D numpy array of sequencing depth of all spots (length #spots). If it’s None, use sum of observed marker gene expressions as sequencing depth.
non_zero_mtx: If it’s None, then do not filter zeros during regression. If it’s a bool 2-D numpy matrix (spots * genes) as False means genes whose nUMI=0 while True means genes whose nUMI>0 in corresponding spots. The bool indicators can be calculated based on either observerd raw nUMI counts in spatial data, or CVAE transformed nUMI counts.
spot_names: a list of string of spot barcodes. Only keep spots passed filtering.
gene_names: a list of string of gene symbols. Only keep actually used marker genes.
celltype_names: a list of string of celltype names.
initial_guess: initial guess of cell-type proportions of spatial spots.
gamma_g (1-D numpy array) – gene-specific platform effect for all genes.
lambda_r (float) – strength for Adaptive Lasso penalty. If None, use cross-validation to determine optimal lambda_r
lambda_g (float) – edge weight for graph, and will affect the strength of Graph Laplacian constrain. If None, use cross-validation to determine optimal graph_weight.
global_optimize (bool, optional) – if is True, use basin-hopping algorithm to find the global minimum. The default is False.
hybrid_version (bool, optional) – if True, use the hybrid_version of GLRM, i.e. in ADMM local model loss function optimization for w but adaptive lasso constrain on theta. If False, local model loss function optimization and adaptive lasso will on the same w. The default is True.
opt_method (string, optional) – specify method used in scipy.optimize.minimize for local model fitting. The default is ‘’, a default method in scipy for optimization with bounds. Another choice would be ‘SLSQP’, a default method in scipy for optimization with constrains and bounds.
verbose (bool, optional) – if True, print more information in program running.
diagnosis (bool, optional) – if True save more information to files for diagnosis CVAE and hyper-parameter selection.
use_admm (bool, optinal) – if True, use ADMM framework in optimization, default is False.
- Returns:
- estimated model coefficients, including:
theta : celltype proportions (#spots * #celltypes * 1)
e_alpha : spot-specific effect (1-D array with length #spot)
sigma2 : variance paramter of the lognormal distribution (float)
gamma_g : gene-specific platform effect for all genes (1-D array with length #gene)
- Return type:
Dict
- model_fit.fit_stage1_spotwise_lambda_r(data, warm_start_theta, warm_start_e_alpha, gamma_g, sigma2, lambda_r, lasso_weight, global_optimize=False, hybrid_version=True, opt_method='L-BFGS-B', hv_x=None, hv_log_p=None, verbose=False, diagnosis=False)[source]
update theta (celltype proportion) and e_alpha (spot-specific effect) given sigma2 (variance paramter of the log-normal distribution) and gamma_g (gene-specific platform effect) by MLE
UPDATE: here we assume the hyperparameter lambda_r for Adaptive Lasso is spot-wise, i.e. for each spot, we search for the optimal lambda_r, then fit the model
UPDATE2: we apply Nested Subset Selection Strategy, thus no lambda_r anymore
we assume
ln(lambda) = alpha + gamma_g + ln(sum(theta*mu_X)) + epsilon
subject to sum(theta)=1, theta>=0
mu_X is marker genes from data[‘X’]
then the mean parameter of the lognormal distribution of ln(lambda) is alpha + gamma_g + ln(sum(theta*mu_X))
we did re-parametrization w = e^alpha * theta, then
ln(lambda) = gamma_g + ln(sum([e^alpha*theta]*mu_X)) + epsilon
subject to w>=0, it will imply sum(theta)=1 and theta>=0
- the steps to update theta and e_alpha:
dimension change of theta from 3-D (spots * celltypes * 1) to 1-D (celltypes), and do re-parametrization to get w
solve w for each spot in parallel
extract updated theta and e_alpha from w, and change the dimension of updated theta back
- Parameters:
data (Dict) –
- a Dict contains all info need for modeling:
X: a 2-D numpy matrix of celltype specific marker gene expression (celltypes * genes).
Y: a 2-D numpy matrix of spatial gene expression (spots * genes).
A: a 2-D numpy matrix of Adjacency matrix (spots * spots), or is None. Adjacency matrix of spatial sptots (1: connected / 0: disconnected). All 0 in diagonal.
N: a 1-D numpy array of sequencing depth of all spots (length #spots). If it’s None, use sum of observed marker gene expressions as sequencing depth.
non_zero_mtx: If it’s None, then do not filter zeros during regression. If it’s a bool 2-D numpy matrix (spots * genes) as False means genes whose nUMI=0 while True means genes whose nUMI>0 in corresponding spots. The bool indicators can be calculated based on either observerd raw nUMI counts in spatial data, or CVAE transformed nUMI counts.
spot_names: a list of string of spot barcodes. Only keep spots passed filtering.
gene_names: a list of string of gene symbols. Only keep actually used marker genes.
celltype_names: a list of string of celltype names.
warm_start_theta (3-D numpy array (spots * celltypes * 1)) – initial guess of theta (celltype proportion).
warm_start_e_alpha (1-D numpy array) – initial guess of e_alpha (spot-specific effect).
gamma_g (1-D numpy array) – gene-specific platform effect for all genes.
sigma2 (float) – variance paramter of the lognormal distribution of ln(lambda). All gene share the same variance.
lambda_r (float or 1-D numpy array) – parameter for the strength of Adaptive Lasso loss to shrink theta
lasso_weight (3-D numpy array (spots * celltypes * 1)) – weight of Adaptive Lasso, 1 ./ theta
global_optimize (bool, optional) – if is True, use basin-hopping algorithm to find the global minimum. The default is False.
hybrid_version (bool, optional) – if True, use the hybrid_version of GLRM, i.e. in ADMM local model loss function optimization for w but adaptive lasso constrain on theta. If False, local model loss function optimization and adaptive lasso will on the same w. The default is True.
opt_method (string, optional) – specify method used in scipy.optimize.minimize for local model fitting. The default is ‘L-BFGS-B’, a default method in scipy for optimization with bounds. Another choice would be ‘SLSQP’, a default method in scipy for optimization with constrains and bounds.
hv_x (1-D numpy array, optional) – data points served as x for calculation of probability density values. Only used for heavy-tail.
hv_log_p (1-D numpy array, optional) – log density values of normal distribution N(0, sigma^2) + heavy-tail. Only used for heavy-tail.
verbose (bool, optional) – if True, print more information.
diagnosis (bool, optional) – if True print more information for diagnosis of hyper-parameter selection.
- Returns:
theta_results (3-D numpy array (spots * celltypes * 1)) – updated theta (celltype proportion).
e_alpha_results (1-D numpy array) – updated e_alpha (spot-specific effect).
parse_opt
Created on Tue May 10 03:23:09 2022
@author: hill103
this script stores functions related to receive and parse the command line parameters checking on parameters will also be performed
- parse_opt.parseOpt()[source]
parse the command line parameters and return them as a dict
- Parameters:
None.
- Returns:
paramdict –
- parsed command line parameters for model, including:
spatial_file : full file path of spatial transcriptomic data
ref_file : full file path of scRNA-seq data
ref_celltype_file : full file path of the corresponding cell-type annotation of scRNA-seq data
marker_file : full file path of user curated cell-type marker gene expression
loc_file : full file path of spot locations in spatial transcriptomic data
A_file : full file path Adjacency Matrix of spots in spatial transcriptomic data
n_cores : number of CPU cores used for parallel computing
lambda_r : hyper-parameter for Adaptive Lasso
lambda_g : hyper-parameter for graph weight, affecting the Laplacian Matrix
use_cvae : whether to build CVAE
threshold : threshold for hard thresholding estimated cell-type proportion theta
n_hv_gene : number of highly variable genes for CVAE
n_marker_per_cmp : number of TOP marker genes for each comparison in DE
n_pseudo_spot : number of pseudo-spots
pseudo_spot_min_cell : minimum value of cells in pseudo-spot
pseudo_spot_max_cell : maximum value of cells in pseudo-spot
seq_depth_scaler : a scaler of scRNA-seq sequencing depth
cvae_input_scaler : maximum value of the scaled input for CVAE
cvae_init_lr : initial learning rate for training CVAE
num_hidden_layer : number of hidden layers in encoder and decoder
use_batch_norm : whether to use Batch Normalization
cvae_train_epoch : max number of training epochs for the CVAE
use_spatial_pseudo : whether to generate “pseudo-spots” in spatial condition
redo_de : whether to redo DE after CVAE transformation
seed : seed value for random in building CVAE
diagnosis : True or False, if True save more information to files for diagnosis CVAE and hyper-parameter selection
verbose : True or False, if True print more information during program running
use_fdr : whether to use FDR adjusted p value for filtering and sorting
p_val_cutoff : threshold of p value (or FDR if –use_fdr is true) in marker genes filtering
fc_cutoff : threshold of fold change (without log transform!) in marker genes filtering
pct1_cutoff : threshold of pct.1 in marker genes filtering
pct2_cutoff : threshold of pct.2 in marker genes filtering
sortby_fc : whether to sort marker genes by fold change
filter_cell : whether to filter cells before DE
filter_gene : whether to filter genes before DE
use_imputation : whether to perform imputation
diameter : the physical diameter of spatial spots
impute_diameter : target spot diameter for imputation
hole_min_spots : threshold of number of uncaptured spots to validate holes
preserve_shape : whether to preserve the exact shape of tissue map
- Return type:
Dict
preprocess
Created on Tue May 10 22:13:23 2022
@author: hill103
this script stores functions related to pre-processing including:
read data from CSV files
- determine whether to perform DE for cell-type specific marker genes:
if user specified marker file, then DE step will be skipped
- determine whether to build CVAE to adjust the platform effect:
if user specified use_cvae as True and also provide original scRNA-seq data and corresponding cell-type annotation, then CVAE will be built
filter genes and spots before GLRM modeling
check datasets to avoid some mistakes
- preprocess.preprocess(spatial_file, ref_file, ref_anno_file, marker_file, A_file, use_cvae, n_hv_gene, n_marker_per_cmp, n_pseudo_spot, pseudo_spot_min_cell, pseudo_spot_max_cell, seq_depth_scaler, cvae_input_scaler, cvae_init_lr, num_hidden_layer, use_batch_norm, cvae_train_epoch, use_spatial_pseudo, redo_de, use_fdr, p_val_cutoff, fc_cutoff, pct1_cutoff, pct2_cutoff, sortby_fc, filter_cell, filter_gene, diagnosis)[source]
preprocess files
- Parameters:
spatial_file (string) – full path of input csv file of raw nUMI counts in spatial transcriptomic data (spots * genes).
ref_file (string) – full path of input csv file of raw nUMI counts in scRNA-seq data (cells * genes).
ref_anno_file (string) – full path of input csv file of cell-type annotations for all cells in scRNA-seq data.
marker_file (string) – full path of input csv file of cell-typee marker gene expression (cell-types * genes).
A_file (string) – full path of input csv file of Adjacency Matrix of spots in spatial transcriptomic data (spots * spots).
use_cvae (bool) – whether to build CVAE to adjust platform effect.
n_hv_gene (int) – number of highly variable genes for CVAE.
n_marker_per_cmp (int) – number of TOP marker genes for each comparison in DE.
n_pseudo_spot (int) – number of pseudo-spots.
pseudo_spot_min_cell (int) – minimum value of cells in pseudo-spot.
pseudo_spot_max_cell (int) – maximum value of cells in pseudo-spot.
seq_depth_scaler (int) – a scaler of scRNA-seq sequencing depth.
cvae_input_scaler (int) – maximum value of the scaled input for CVAE.
cvae_init_lr (float) – initial learning rate for training CVAE.
num_hidden_layer (int) – number of hidden layers in encoder and decoder.
use_batch_norm (bool) – whether to use Batch Normalization.
cvae_train_epoch (int) – max number of training epochs for the CVAE.
use_spatial_pseudo (int) – whether to generate “pseudo-spots” in spatial condition.
redo_de (bool) – whether to redo DE after CVAE transformation.
use_fdr (bool) – whether to use FDR adjusted p value for filtering and sorting.
p_val_cutoff (float) – threshold of p value (or FDR if –use_fdr is true) in marker genes filtering.
fc_cutoff (float) – threshold of fold change (without log transform!) in marker genes filtering.
pct1_cutoff (float) – threshold of pct.1 in marker genes filtering.
pct2_cutoff (float) – threshold of pct.2 in marker genes filtering.
sortby_fc (bool) – whether to sort marker genes by fold change.
diagnosis (bool) – if True save more information to files for diagnosis CVAE and hyper-parameter selection.
filter_cell (bool) – whether to filter cells before DE.
filter_gene (bool) – whether to filter genes before DE.
- Returns:
data –
- a Dict contains all info need for modeling:
X: a 2-D numpy matrix of celltype specific marker gene expression (celltypes * genes).
Y: a 2-D numpy matrix of spatial gene expression (spots * genes).
A: a 2-D numpy matrix of Adjacency matrix (spots * spots), or is None. Adjacency matrix of spatial sptots (1: connected / 0: disconnected). All 0 in diagonal.
N: a 1-D numpy array of sequencing depth of all spots (length #spots). If it’s None, use sum of observed marker gene expressions as sequencing depth.
non_zero_mtx: If it’s None, then do not filter zeros during regression. If it’s a bool 2-D numpy matrix (spots * genes) as False means genes whose nUMI=0 while True means genes whose nUMI>0 in corresponding spots. The bool indicators can be calculated based on either observerd raw nUMI counts in spatial data, or CVAE transformed nUMI counts.
spot_names: a list of string of spot barcodes. Only keep spots passed filtering.
gene_names: a list of string of gene symbols. Only keep actually used marker gene symbols.
celltype_names: a list of string of cell-type names.
initial_guess: initial guess of cell-type proportions of spatial spots.
- Return type:
Dict
run_model
Created on Tue May 10 04:52:02 2022
@author: hill103
this script stores function to define a Graph Laplacian Regularized Stratified Model (GLRM) and call functions to fit the coefficients
It receieve the input data, build graph, build GLRM and fit coefficients
- run_model.run_GLRM(data, lambda_r=None, lambda_g=None, estimate_gamma_g=True, n_jobs=1, threshold=0, diagnosis=False, verbose=False)[source]
run GLRM, and fit coefficients
- Parameters:
data (Dict) –
- a Dict contains all info need for modeling:
X: a 2-D numpy matrix of celltype specific marker gene expression (celltypes * genes).
Y: a 2-D numpy matrix of spatial gene expression (spots * genes).
A: a 2-D numpy matrix of Adjacency matrix (spots * spots), or is None. Adjacency matrix of spatial sptots (1: connected / 0: disconnected). All 0 in diagonal.
N: a 1-D numpy array of sequencing depth of all spots (length #spots). If it’s None, use sum of observed marker gene expressions as sequencing depth.
non_zero_mtx: If it’s None, then do not filter zeros during regression. If it’s a bool 2-D numpy matrix (spots * genes) as False means genes whose nUMI=0 while True means genes whose nUMI>0 in corresponding spots. The bool indicators can be calculated based on either observerd raw nUMI counts in spatial data, or CVAE transformed nUMI counts.
spot_names: a list of string of spot barcodes. Only keep spots passed filtering.
gene_names: a list of string of gene symbols. Only keep actually used marker genes.
celltype_names: a list of string of celltype names.
initial_guess: initial guess of cell-type proportions of spatial spots.
lambda_r (float, optional) – strength for Adaptive Lasso penalty. The default is None, i.e. use cross-validation to determine optimal value
lambda_g (float, optional) – edge weight for graph, and will affect the strength of Graph Laplacian constrain. The default is None, i.e. use cross-validation to determine optimal value
estimate_gamma_g (bool, optional) – whether to estimate gamma_g (gene-specific platform effect). The default is True.
n_jobs (int, optional) – number of jobs to spawn. The default is 1.
threshold (float, optional) – threshold value for hard thresholding estimated cell-type proportion theta. The default is 0, i.e. no hard thresholding.
diagnosis (bool) – if True save more information to files for diagnosis CVAE and hyper-parameter selection
verbose (bool, optional) – if True, print variables in each ADMM loop. The default is True.
- Returns:
result –
- estimated model coefficients, including (note theta dimension changed back to 2-D):
theta : celltype proportions (#spots * #celltypes)
e_alpha : spot-specific effect (1-D array with length #spot)
sigma2 : variance paramter of the lognormal distribution (float)
gamma_g : gene-specific platform effect for all genes (1-D array with length #gene)
- Return type:
Dict
utils
Created on Thu May 5 23:28:03 2022
@author: hill103
this script stores utils functions
- utils.calcRMSE(truth, predicted)[source]
calculate RMSE
- Parameters:
truth (1-D numpy array) – true values.
predicted (1-D numpy array) – predictions.
- Returns:
RMSE
- Return type:
float
- utils.check_decoder(cvae, decoder, data, labels)[source]
since we first create a decoder then update its weights based on the corresponding weights in CVAE, we need to double check the weights are updated correctly, and the decoded output matchs the CVAE output
- Parameters:
cvae (Keras model) – already trained CVAE model
decoder (Keras model) – a separated decoder whose weights are already updated, i.e. it should give the same decoded output with CVAE
data (2-D numpy array) – data used for checking decoder (columns are genes, rows are cells, spatial spots or pseudo-spots)
labels (1-D numpy array) – corresponding conditional variables for each row in data
- Return type:
None.
- utils.check_numba()[source]
this function checks the active threading layer and thread count in Numba
- Parameters:
None.
- Return type:
None.
- utils.numba_info()[source]
this function checks the Numba information, especially the Threading Layer Information
It turns out TBB or OMP gives similiar running speed
- Parameters:
None.
- Return type:
None.
- utils.read_scRNA_data(ref_file, ref_anno_file, filter_cell, filter_gene)[source]
read scRNA-seq data saved as a CSV file by Scanpy, also read cell-type annotation, then subset cells with cell-type annotation
- Parameters:
ref_file (string) – full path of input csv file of raw nUMI counts in scRNA-seq data (cells * genes).
ref_anno_file (string) – full path of input csv file of cell-type annotations for all cells in scRNA-seq data.
filter_cell (bool) – whether to filter cells before DE.
filter_gene (bool) – whether to filter genes before DE.
- Return type:
a AnnData object
- utils.read_spatial_data(spatial_file, filter_gene, n_hv_gene=0)[source]
read spatial data saved as a CSV file by Scanpy
- Parameters:
spatial_file (string) – full path of input csv file of raw nUMI counts in spatial transcriptomic data (spots * genes).
filter_gene (bool) – whether to filter genes before DE.
n_hv_gene (int) – number of highly variable genes to be kept in spatial data. If equals 0, all genes are kept.
- Returns:
spatial_spot_obj (a AnnData object) – AnnData object of spatial spot data.
tmp_df (dataframe) – dataframe of raw nUMI of spatial data (spots * genes).
N (series) – sequencing depth per spot.
- utils.reparameterTheta(theta, e_alpha)[source]
re-parametrization w = e^alpha * theta
- Parameters:
theta (3-D numpy array (#spot * #celltype * 1)) – theta (celltype proportion).
e_alpha (1-D numpy array) – e^alpha (spot-specific effect).
- Returns:
w – re-parametrization w = e^alpha * theta.
- Return type:
3-D numpy array (#spot * #celltype * 1)
- utils.reportRMSE(true_theta, pred_theta)[source]
calculate the RMSE of theta (celltype proportion) across all spots
we first calculate the RMSE of each spot, then calculate the MEAN of RMSE values across all spots
- Parameters:
true_theta (2-D numpy array (spots * celltypes)) – true values.
pred_theta (2-D numpy array (spots * celltypes)) – predictions.
- Returns:
RMSE across all spots
- Return type:
float
- utils.rerun_DE(scRNA_df, scRNA_celltype, n_marker_per_cmp, use_fdr, p_val_cutoff, fc_cutoff, pct1_cutoff, pct2_cutoff, sortby_fc, save_result=False, filter_gene=True)[source]
rerun DE on CVAE transformed scRNA-seq data
genes are only overlapped genes between spatial and scRNA-seq data
gene expression values are already normalized by sequencing depth
- Parameters:
scRNA_df (dataframe) – normalized gene expression after CVAE tranforming on scRNA-seq data (cells * genes).
scRNA_celltype (dataframe) – cell-type annotations for cells in scRNA-seq data. Only 1 column named <celltype>
n_marker_per_cmp (int) – number of TOP marker genes for each comparison in DE
use_fdr (bool) – whether to use FDR adjusted p value for filtering and sorting
p_val_cutoff (float) – threshold of p value (or FDR if –use_fdr is true) in marker genes filtering
fc_cutoff (float) – threshold of fold change (without log transform!) in marker genes filtering
pct1_cutoff (float) – threshold of pct.1 in marker genes filtering
pct2_cutoff (float) – threshold of pct.2 in marker genes filtering
sortby_fc (bool) – whether to sort marker genes by fold change
save_result (bool) – if true, save dataframe of DE result to csv file
filter_gene (bool) – whether to filter genes before DE.
- Returns:
marker_gene_list – a list of cell-type specific marker genes based on DE on CVAE tranformed gene expressions.
- Return type:
list
- utils.run_DE(sc_obj, n_marker_per_cmp, use_fdr, p_val_cutoff, fc_cutoff, pct1_cutoff, pct2_cutoff, sortby_fc, save_result=False, save_file_name=None)[source]
differential on cell-types in scRNA-seq data.
we compare each cell-type with another one cell-type at a time.
only choose TOP X marker genes for one comparison with one cell-type vs another one cell-type, with filtering (the FDR adjusted p value <= 0.05 + fold change >= 1.2 + pct.1 >= 0.3 + pct.2 <= 0.1, and sorting by fold change (by default). Then combine all marker genes from all comparisons.
Note: the genes in object are overlapped genes with spatial data only.
cell-type annotation is saved in object metadata <celltype>
gene expression in AnnData object has already been normalized by sequencing depth
- Parameters:
sc_obj (AnnData object) – scRNA-seq data object.
n_marker_per_cmp (int) – number of TOP marker genes for each comparison in DE
use_fdr (bool) – whether to use FDR adjusted p value for filtering and sorting
p_val_cutoff (float) – threshold of p value (or FDR if –use_fdr is true) in marker genes filtering
fc_cutoff (float) – threshold of fold change (without log transform!) in marker genes filtering
pct1_cutoff (float) – threshold of pct.1 in marker genes filtering
pct2_cutoff (float) – threshold of pct.2 in marker genes filtering
sortby_fc (bool) – whether to sort marker genes by fold change
save_result (bool) – if true, save dataframe of DE result to csv file
save_file_name (string) – file name (without path) for saving DE result
- Returns:
marker_gene_list – identified cell-type specific marker gene list
- Return type:
list
- utils.run_DE_only(ref_file, ref_anno_file, spatial_genes, n_marker_per_cmp, use_fdr, p_val_cutoff, fc_cutoff, pct1_cutoff, pct2_cutoff, sortby_fc, save_result=False, filter_cell=True, filter_gene=True)[source]
read scRNA-seq raw nUMI and cell-type annotation, then perform DE analysis.
Note: the genes in scRNA-seq data need to be subsetted to overlapped genes with spatial data only.
- Parameters:
ref_file (string) – full path of input csv file of raw nUMI counts in scRNA-seq data (cells * genes).
ref_anno_file (string) – full path of input csv file of cell-type annotations for all cells in scRNA-seq data.
spatial_genes (list) – genes included in spatial dataset.
n_marker_per_cmp (int) – number of TOP marker genes for each comparison in DE
use_fdr (bool) – whether to use FDR adjusted p value for filtering and sorting
p_val_cutoff (float) – threshold of p value (or FDR if –use_fdr is true) in marker genes filtering
fc_cutoff (float) – threshold of fold change (without log transform!) in marker genes filtering
pct1_cutoff (float) – threshold of pct.1 in marker genes filtering
pct2_cutoff (float) – threshold of pct.2 in marker genes filtering
sortby_fc (bool) – whether to sort marker genes by fold change
save_result (bool) – if true, save dataframe of DE result to csv file
filter_cell (bool) – whether to filter cells before DE
filter_gene (bool) – whether to filter genes before DE
- Returns:
scrna_obj (AnnData object) – a AnnData object for scRNA-seq data
marker_gene_profile (DataFrame) – average gene expressions of identified cell-type specific marker genes from refer scRNA-seq data
- utils.total_size(o, handlers={}, verbose=False)[source]
Returns the approximate memory footprint of an object and all of its contents.
Automatically finds the contents of several built-in containers and their subclasses, including tuple, list, deque, dict, set, and frozenset. To search other containers, add handlers that iterate over their contents.
- Parameters:
o (object) – The object whose memory footprint is to be calculated.
handlers (dict, optional) – A dictionary of handler functions keyed by container types. These handlers are used to iterate over container contents. Defaults to an empty dictionary.
verbose (bool, optional) – If True, the function prints more information about what it’s doing. Defaults to False.
- Returns:
obj_size – An approximation of the total memory footprint of the object in bytes.
- Return type:
int