Tutorial 5: Batch-learning on large-scale dataset

Here we will use scATAC-seq dataset `10XBlood’ as an example to illustrate how to train large-scale scATAC-seq data with batch-learning strategy in an end-to-end style.

1. Read and preprocess data

We first read ‘.h5ad’ data file using Scanpy package

[2]:
import scanpy as sc
adata = sc.read_h5ad("data/10XBlood.h5ad")

We can use Scanpy to further filter data. In our case, we pass this step because the loaded dataset has been preprocessed. Some codes for filtering are copied below for easy reference:

[3]:
# sc.pp.filter_cells(adata, min_genes=100)
# min_cells = int(adata.shape[0] * 0.01)
# sc.pp.filter_genes(adata, min_cells=min_cells)
[4]:
adata
[4]:
AnnData object with n_obs × n_vars = 17736 × 39565
    obs: 'orig.ident', 'nCount_scATACseq', 'nFeature_scATACseq', 'celltype', 'n_genes'
    var: 'features', 'n_cells'
    uns: 'neighbors', 'pca', 'tsne', 'umap'
    obsm: 'X_pca', 'X_tsne', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

2. Setup and train scAGDE model

Now we can initialize the trainer with the AnnData object, which will ensure settings for model are in place for training.

We can specify the outdir to the dir path where we want to save the output file (mainly the model weights file).

n_centroids represents the cluster number of dataset. If this information is unknown, we can set n_centroids=None and in this case, scAGDE will apply the estimation strategy to estimate the optimal cluster number for the initialization of its cluster layer. Here, we set n_centroids=9.

We can train scAGDE on specified device by setting gpu. For example, train scAGDE on CPUs by gpu=None and trian it on GPU #0 by gpu="0"

To stop early once the model converges, we set early_stopping=True, and patience=50 representing epochs to wait for improvement before early stopping.

[5]:
import scAGDE
trainer = scAGDE.Trainer_scale(adata,outdir="output",n_centroids=9,gpu="1",early_stopping=True,patience=50)
device used: cuda:1

Now we can train scAGDE model in end-to-end style. The whole pipeline behind the function of fit() mainly consists of three stages, as below:

  1. scAGDE first trained an chromatin accessibility-based autoencoder to measure the importance of the peaks and select the key peaks. The number of selected peaks is set to 10,000 in default, or you can change it by setting top_n. In the meanwhile, the initial cell representations for cell graph construction are stored in adata.obs[embed_init_key], which is "latent_init" in default.

  2. scAGDE then constructed cell graph and trains the GCN-based embedded model to extract essential structural information from both count and cell graph data.

  3. scAGDE finally yiels robust and discriminative cell embeddings which are stored in adata.obsm[embed_key], which is "latent" in default. Also, scAGDE enables imputation task if impute_key is not None and the imputed data will be stored in adata.obsm[impute_key], which is "impute" in default.

scAGDE performs clustering on final embeddings if cluster_key is not None, and the cluster assignments will be in adata.obs[cluster_key], which is "cluster" in default. The cluster number is the value of n_centroids and if estimation is used, the cluster number is the value of estimated cluster number.

[6]:
embed_key = "latent"
adata = trainer.fit(topn=10000,embed_key=embed_key)
print(adata)
Cell number: 17736
Peak number: 39565
n_centroids: 9


## Training CountModel ##
CountModel:   0%|          | 0/200 [06:53<?, ?it/s, loss=8358.6855]

## Constructing Cell Graph ##
Cell number: 17736
Peak number: 10000
n_centroids: 9


## Training GraphModel ##
Pretrain Epochs: 100%|██████████| 200/200 [03:11<00:00,  1.04it/s]
Epochs: 100%|██████████| 300/300 [04:48<00:00,  1.04it/s]
R[write to console]:                    __           __
   ____ ___  _____/ /_  _______/ /_
  / __ `__ \/ ___/ / / / / ___/ __/
 / / / / / / /__/ / /_/ (__  ) /_
/_/ /_/ /_/\___/_/\__,_/____/\__/   version 6.0.0
Type 'citation("mclust")' for citing this R package in publications.

AnnData object with n_obs × n_vars = 17736 × 39565
    obs: 'orig.ident', 'nCount_scATACseq', 'nFeature_scATACseq', 'celltype', 'n_genes', 'cluster'
    var: 'features', 'n_cells', 'is_selected'
    uns: 'neighbors', 'pca', 'tsne', 'umap'
    obsm: 'X_pca', 'X_tsne', 'X_umap', 'latent_init', 'impute', 'latent'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

3. Visualizing and evaluation

We can now use Scanpy to visualize our latent space.

[7]:
sc.pp.neighbors(adata, use_rep="latent")
sc.tl.umap(adata, min_dist=0.2)
sc.pl.umap(adata,color=["celltype","cluster"])
/home/haogaoyang/anaconda3/envs/torch/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/home/haogaoyang/anaconda3/envs/torch/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
../_images/notebooks_tutorial5_15_1.png

We can evaluate the clustering performance with multiple metrics as below:

[8]:
y = adata.obs["celltype"].astype("category").cat.codes.values
res = scAGDE.utils.cluster_report(y, adata.obs["cluster"].astype(int))

## Clustering Evaluation Report ##
# Confusion matrix: #
[[2913    0    3    0    1    0    3    1    1]
 [  12  642  242   32    2    3  377    6   74]
 [   1 1060 1029   24    3    3  302    1  136]
 [   1   46    0   39    1    1    0  971    0]
 [ 113    1    3   46 1648    8    3    3    0]
 [   0   18    2   50    0 1014    0    0    0]
 [   1  590    3    7    3    0 1855    0    2]
 [   5    0    3  376    1    0    1 1512    0]
 [   0  137    8    6    0    0    6    0 2381]]
# Metric values: #
Adjusted Rand Index: 0.6619
Normalized Mutual Info: 0.7354
F1 score: 0.7348