{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tutorial 5: Batch-learning on large-scale dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we will use scATAC-seq dataset `10XBlood' as an example to illustrate how to train large-scale scATAC-seq data with batch-learning strategy in an end-to-end style." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Read and preprocess data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We first read '.h5ad' data file using [Scanpy](https://github.com/scverse/scanpy) package" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import scanpy as sc\n", "adata = sc.read_h5ad(\"data/10XBlood.h5ad\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use Scanpy to further filter data. In our case, we pass this step because the loaded dataset has been preprocessed. Some codes for filtering are copied below for easy reference:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# sc.pp.filter_cells(adata, min_genes=100)\n", "# min_cells = int(adata.shape[0] * 0.01)\n", "# sc.pp.filter_genes(adata, min_cells=min_cells)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "AnnData object with n_obs × n_vars = 17736 × 39565\n", " obs: 'orig.ident', 'nCount_scATACseq', 'nFeature_scATACseq', 'celltype', 'n_genes'\n", " var: 'features', 'n_cells'\n", " uns: 'neighbors', 'pca', 'tsne', 'umap'\n", " obsm: 'X_pca', 'X_tsne', 'X_umap'\n", " varm: 'PCs'\n", " obsp: 'connectivities', 'distances'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Setup and train scAGDE model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can initialize the trainer with the AnnData object, which will ensure settings for model are in place for training. \n", "\n", "We can specify the `outdir` to the dir path where we want to save the output file (mainly the model weights file).\n", "\n", "`n_centroids` represents the cluster number of dataset. If this information is unknown, we can set `n_centroids=None` and in this case, scAGDE will apply the estimation strategy to estimate the optimal cluster number for the initialization of its cluster layer. Here, we set `n_centroids=9`.\n", " \n", "We can train scAGDE on specified device by setting `gpu`. For example, train scAGDE on CPUs by `gpu=None` and trian it on GPU #0 by `gpu=\"0\"`\n", "\n", "To stop early once the model converges, we set `early_stopping=True`, and `patience=50` representing epochs to wait for improvement before early stopping." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "device used: cuda:1\n", "\n" ] } ], "source": [ "import scAGDE\n", "trainer = scAGDE.Trainer_scale(adata,outdir=\"output\",n_centroids=9,gpu=\"1\",early_stopping=True,patience=50)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can train scAGDE model in end-to-end style. The whole pipeline behind the function of `fit()` mainly consists of three stages, as below:\n", "\n", "1. scAGDE first trained an chromatin accessibility-based autoencoder to measure the importance of the peaks and select the key peaks. The number of selected peaks is set to 10,000 in default, or you can change it by setting `top_n`. In the meanwhile, the initial cell representations for cell graph construction are stored in `adata.obs[embed_init_key]`, which is `\"latent_init\"` in default.\n", "\n", "2. scAGDE then constructed cell graph and trains the GCN-based embedded model to extract essential structural information from both count and cell graph data.\n", "\n", "3. scAGDE finally yiels robust and discriminative cell embeddings which are stored in `adata.obsm[embed_key]`, which is `\"latent\"` in default. Also, scAGDE enables imputation task if `impute_key` is not None and the imputed data will be stored in `adata.obsm[impute_key]`, which is `\"impute\"` in default. \n", "\n", "scAGDE performs clustering on final embeddings if `cluster_key` is not None, and the cluster assignments will be in `adata.obs[cluster_key]`, which is `\"cluster\"` in default. The cluster number is the value of `n_centroids` and if estimation is used, the cluster number is the value of estimated cluster number.\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Cell number: 17736\n", "Peak number: 39565\n", "n_centroids: 9\n", "\n", "\n", "## Training CountModel ##\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "CountModel: 0%| | 0/200 [06:53" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sc.pp.neighbors(adata, use_rep=\"latent\")\n", "sc.tl.umap(adata, min_dist=0.2)\n", "sc.pl.umap(adata,color=[\"celltype\",\"cluster\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can evaluate the clustering performance with multiple metrics as below:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "## Clustering Evaluation Report ##\n", "# Confusion matrix: #\n", "[[2913 0 3 0 1 0 3 1 1]\n", " [ 12 642 242 32 2 3 377 6 74]\n", " [ 1 1060 1029 24 3 3 302 1 136]\n", " [ 1 46 0 39 1 1 0 971 0]\n", " [ 113 1 3 46 1648 8 3 3 0]\n", " [ 0 18 2 50 0 1014 0 0 0]\n", " [ 1 590 3 7 3 0 1855 0 2]\n", " [ 5 0 3 376 1 0 1 1512 0]\n", " [ 0 137 8 6 0 0 6 0 2381]]\n", "# Metric values: #\n", "Adjusted Rand Index: 0.6619\n", "Normalized Mutual Info: 0.7354\n", "F1 score: 0.7348\n" ] } ], "source": [ "y = adata.obs[\"celltype\"].astype(\"category\").cat.codes.values\n", "res = scAGDE.utils.cluster_report(y, adata.obs[\"cluster\"].astype(int))" ] } ], "metadata": { "kernelspec": { "display_name": "torch", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.19" } }, "nbformat": 4, "nbformat_minor": 2 }