分享是一种态度
作者 | 周运来
男,
一个长大了才会遇到的帅哥,
稳健,潇洒,大方,靠谱。
一段生信缘,一棵技能树,
一枚大型测序工厂的螺丝钉,
一个随机森林中提灯觅食的津门旅客。
轨迹分析
说起轨迹推断,很多人的第一印象就是monocle的轨迹图,大概率是长这样子的:
如果说单细胞转录组数据分析中的分群是寻找细胞的离散属性,那么轨迹推断就是寻找细胞分化连续性的尝试。为什么细胞的分化既有离散性又有连续性呢?这是一个历史问题,细胞的分化当然是连续的,之所以用分群的方法来解释异质性,实在是一种无奈之举。每一个细胞都是独一无二的,没有一个细胞是孤岛,这是我们的口号,但是理想与现实总是不能统一。
monocle提供了一套具有启发意义的轨迹方法,以简单粗暴的方式试图弥补这理想与现实的大峡谷。在monocle的世界里轨迹与图谱是分离的,即图谱是tsne/umap的,轨迹是另一个降维空间。那么有没有一种降维技术能够再走一步呢?今天我们介绍的scanpy的PAGA(graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells(https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1663-x))就是这方面的一个尝试:在保留细胞图谱的基础上完成细胞轨迹的推断:
它是如何实现的呢?其实是统一了聚类和轨迹推断的空间结构。
基于分区的图抽象(Partition-based graph abstraction )生成单个细胞的拓扑结构并保留映射。高维基因表达数据降维后计算邻域关系的相关距离度量来表示kNN图(pca和欧氏距离)。kNN图按所需的分辨率进行分群,其中分群表示连续细胞群。为此,我们通常使用Louvain算法,然而,分群也可以通过其他任何方式获得。PAGA图是通过将一个节点与每个亚群关联起来,并通过表示亚群之间连接性的统计度量的加权边连接每个节点得到。通过丢弃低权重的假边,PAGA图揭示了数据在选定分辨率下的去噪拓扑结构,并揭示了其连接和断开的区域。
好了,我们来看看scanpy中PAGA是如何实现的吧,好不好?
导入数据
import numpy as np import pandas as pd import scanpy as sc import seaborn as sns sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3) sc.logging.print_versions() sc.settings.set_figure_params(dpi=80)
scanpy==1.4.5.1 anndata==0.7.1 umap==0.3.10 numpy==1.16.5 scipy==1.3.1 pandas==0.25.1 scikit-learn==0.21.3 statsmodels==0.10.1 python-igraph==0.8.0
results_file = 'E:/learnscanpy/write/pbmc3k.h5ad' help(sc.read_10x_mtx) adata = sc.read_10x_mtx( 'E:/learnscanpy/data/filtered_feature_bc_matrix', # the directory with the `.mtx` file var_names='gene_symbols', # use gene symbols for the variable names (variables-axis index) cache=True) adata.var_names_make_unique() # this is unnecessary if using `var_names='gene_ids'` in `sc.read_10x_mtx`
adata Out[282]: AnnData object with n_obs × n_vars = 5025 × 33694 var: 'gene_ids', 'feature_types'
这里我们使用zheng17的数据预处理方法,仅仅是因为简单,您也可以像scanpy教程:预处理与聚类一样自己一步一步地执行预处理。
sc.pp.recipe_zheng17(adata) sc.tl.pca(adata, svd_solver='arpack') sc.pp.neighbors(adata, n_neighbors=4, n_pcs=20) sc.tl.leiden(adata)
也许,你会问这到底是一种怎样的cell QC的过程,不难,来看看它的帮助文档啊,都给你写好了呢!
help(sc.pp.recipe_zheng17) Help on function recipe_zheng17 in module scanpy.preprocessing._recipes: recipe_zheng17(adata: anndata._core.anndata.AnnData, n_top_genes: int = 1000, log: bool = True, plot: bool = False, copy: bool = False) -> Union[anndata._core.anndata.AnnData, NoneType] Normalization and filtering as of [Zheng17]_. Reproduces the preprocessing of [Zheng17]_ – the Cell Ranger R Kit of 10x Genomics. Expects non-logarithmized data. If using logarithmized data, pass `log=False`. The recipe runs the following steps .. code:: python sc.pp.filter_genes(adata, min_counts=1) # only consider genes with more than 1 count sc.pp.normalize_per_cell( # normalize with total UMI count per cell adata, key_n_counts='n_counts_all' ) filter_result = sc.pp.filter_genes_dispersion( # select highly-variable genes adata.X, flavor='cell_ranger', n_top_genes=n_top_genes, log=False ) adata = adata[:, filter_result.gene_subset] # subset the genes sc.pp.normalize_per_cell(adata) # renormalize after filtering if log: sc.pp.log1p(adata) # log transform: adata.X = log(adata.X + 1) sc.pp.scale(adata) # scale to unit variance and shift to zero mean Parameters ---------- adata Annotated data matrix. n_top_genes Number of genes to keep. log Take logarithm. plot Show a plot of the gene dispersion vs. mean relation. copy Return a copy of `adata` instead of updating it. Returns ------- Returns or updates `adata` depending on `copy`.
加个plot参数试试,好使!
sc.tl.draw_graph(adata) sc.pl.draw_graph(adata, color='leiden', legend_loc='on data')
这里你不要问下没有执行降维,这个细胞图谱是在哪个空间里的呢?这就要问一下sc.tl.draw_graph了:
adata Out[292]: AnnData object with n_obs × n_vars = 5025 × 1000 obs: 'n_counts_all', 'leiden' var: 'gene_ids', 'feature_types', 'n_counts' uns: 'log1p', 'pca', 'neighbors', 'leiden', 'draw_graph', 'leiden_colors' obsm: 'X_pca', 'X_draw_graph_fr' varm: 'PCs'
help(sc.tl.draw_graph) Help on function draw_graph in module scanpy.tools._draw_graph: draw_graph(adata: anndata._core.anndata.AnnData, layout: scanpy._compat.Literal_ = 'fa', init_pos: Union[str, bool, NoneType] = None, root: Union[int, NoneType] = None, random_state: Union[int, mtrand.RandomState, NoneType] = 0, n_jobs: Union[int, NoneType] = None, adjacency: Union[scipy.sparse.base.spmatrix, NoneType] = None, key_added_ext: Union[str, NoneType] = None, copy: bool = False, **kwds) Force-directed graph drawing [Islam11]_ [Jacomy14]_ [Chippada18]_. An alternative to tSNE that often preserves the topology of the data better. This requires to run :func:`~scanpy.pp.neighbors`, first. The default layout ('fa', `ForceAtlas2`) [Jacomy14]_ uses the package |fa2|_ [Chippada18]_, which can be installed via `pip install fa2`. `Force-directed graph drawing`_ describes a class of long-established algorithms for visualizing graphs. It has been suggested for visualizing single-cell data by [Islam11]_. Many other layouts as implemented in igraph [Csardi06]_ are available. Similar approaches have been used by [Zunder15]_ or [Weinreb17]_. .. |fa2| replace:: `fa2` .. _fa2: https://github.com/bhargavchippada/forceatlas2 .. _Force-directed graph drawing: https://en.wikipedia.org/wiki/Force-directed_graph_drawing Parameters ---------- adata Annotated data matrix. layout 'fa' (`ForceAtlas2`) or any valid `igraph layout <http://igraph.org/c/doc/igraph-Layout.html>`__. Of particular interest are 'fr' (Fruchterman Reingold), 'grid_fr' (Grid Fruchterman Reingold, faster than 'fr'), 'kk' (Kamadi Kawai', slower than 'fr'), 'lgl' (Large Graph, very fast), 'drl' (Distributed Recursive Layout, pretty fast) and 'rt' (Reingold Tilford tree layout). root Root for tree layouts. random_state For layouts with random initialization like 'fr', change this to use different intial states for the optimization. If `None`, no seed is set. adjacency Sparse adjacency matrix of the graph, defaults to `adata.uns['neighbors']['connectivities']`. key_added_ext By default, append `layout`. proceed Continue computation, starting off with 'X_draw_graph_`layout`'. init_pos `'paga'`/`True`, `None`/`False`, or any valid 2d-`.obsm` key. Use precomputed coordinates for initialization. If `False`/`None` (the default), initialize randomly. copy Return a copy instead of writing to adata. **kwds Parameters of chosen igraph layout. See e.g. `fruchterman-reingold`_ [Fruchterman91]_. One of the most important ones is `maxiter`. .. _fruchterman-reingold: http://igraph.org/python/doc/igraph.Graph-class.html#layout_fruchterman_reingold Returns ------- Depending on `copy`, returns or updates `adata` with the following field. **X_draw_graph_layout** : `adata.obsm` Coordinates of graph layout. E.g. for layout='fa' (the default), the field is called 'X_draw_graph_fa'
这是tsne的一种代替方案,由fa2库完成,绘图用的是Force-directed_graph_drawing,说明文档中给出了相关的链接。
那么我们看看umap的结构是怎么样的:
sc.tl.umap(adata) sc.pl.umap(adata, color=['leiden'])
umap都做了,再看看tsne的结果吧:
sc.tl.tsne(adata) sc.pl.tsne(adata, color=['leiden'])
果然tsne要慢很多,等了好一会。。。
可选操作:图形去噪
为了去噪,用扩散映射空间来表示它(而不是PCA空间)。计算几个扩散分量内的距离相当于图像去噪——我们只取几个第一个光谱分量。这与使用PCA去噪数据矩阵非常相似。注意,对于PAGA、聚类或伪时间估计,这都不是必要的步骤。
sc.tl.diffmap(adata) computing Diffusion Maps using n_comps=15(=n_dcs) computing transitions finished (0:00:00) eigenvalues of transition matrix [1/. 1/. 1/. 0.9997476 0.9994019 0.9990253 0.9986317 0.99379796 0.9936346 0.9925993 0.99070626 0.9898019 0.98812026 0.9873447 0.9869766 ] finished: added 'X_diffmap', diffmap coordinates (adata.obsm) 'diffmap_evals', eigenvalues of transition matrix (adata.uns) (0:00:00)
sc.pp.neighbors(adata, n_neighbors=10, use_rep='X_diffmap') sc.tl.draw_graph(adata) sc.pl.draw_graph(adata, color='leiden', legend_loc='on data')
虽然还是有些凌乱,但还是清爽了一些的:
Clustering and PAGA
其实我们已经聚类过了啊,不然哪来的标签呢。
# sc.tl.louvain(adata, resolution=1.0) # sc.tl.leiden(adata) sc.tl.paga(adata, groups='leiden') running PAGA finished: added 'paga/connectivities', connectivities adjacency (adata.uns) 'paga/connectivities_tree', connectivities subtree (adata.uns) (0:00:00) adata.var_names Out[304]: Index(['AP006222.2', 'MEGF6', 'RP11-46F15.2', 'GPR153', 'RP5-1113E3.3', 'RP4-734G22.3', 'MASP2', 'RP3-467K16.2', 'RP4-680D5.2', 'AKR7A3', ... 'MT-CO1', 'MT-CO2', 'MT-ATP6', 'MT-CO3', 'MT-ND3', 'MT-ND4', 'MT-ND5', 'MT-CYB', 'AC145212.2', 'AC011043.1'], dtype='object', length=1000) adata Out[308]: AnnData object with n_obs × n_vars = 5025 × 1000 obs: 'n_counts_all', 'leiden' var: 'gene_ids', 'feature_types', 'n_counts' uns: 'log1p', 'pca', 'neighbors', 'leiden', 'draw_graph', 'leiden_colors', 'umap', 'diffmap_evals', 'paga', 'leiden_sizes' obsm: 'X_pca', 'X_draw_graph_fr', 'X_umap', 'X_tsne', 'X_diffmap' varm: 'PCs' sc.pl.paga(adata, color=['leiden', 'MEGF6', 'MT-CO2', 'AKR7A3']) --> added 'pos', the PAGA positions (adata.uns['paga'])
我们看到一个框架图,其实是一个无向的网络图,点的颜色代表不同分群,点的大小代表该群细胞数的大小,连线只有一种颜色,粗细代表两个群之间连接更密切。这里并没有一个拟时的概念,轨迹是一种相互关系,可以是时间的,也可以不是,而往往以人类的直觉,时间都是单向的,但是每个细胞都有自己的分化方向。一个简单点的框架图,让我们重新思考细胞分化这一基本事实!但是下面我们会在这个基础上推断出一个拟时结构。
adata.obs['leiden'].cat.categories Out[306]: Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35'], dtype='object')
遇到核心函数,就算没有什么疑问也是要看说明文档的啊:
help(sc.tl.paga) Help on function paga in module scanpy.tools._paga: paga(adata: anndata._core.anndata.AnnData, groups: Union[str, NoneType] = None, use_rna_velocity: bool = False, model: scanpy._compat.Literal_ = 'v1.2', copy: bool = False) Mapping out the coarse-grained connectivity structures of complex manifolds [Wolf19]_. By quantifying the connectivity of partitions (groups, clusters) of the single-cell graph, partition-based graph abstraction (PAGA) generates a much simpler abstracted graph (*PAGA graph*) of partitions, in which edge weights represent confidence in the presence of connections. By tresholding this confidence in :func:`~scanpy.pl.paga`, a much simpler representation of the manifold data is obtained, which is nonetheless faithful to the topology of the manifold. The confidence should be interpreted as the ratio of the actual versus the expected value of connetions under the null model of randomly connecting partitions. We do not provide a p-value as this null model does not precisely capture what one would consider "connected" in real data, hence it strongly overestimates the expected value. See an extensive discussion of this in [Wolf19]_. .. note:: Note that you can use the result of :func:`~scanpy.pl.paga` in :func:`~scanpy.tl.umap` and :func:`~scanpy.tl.draw_graph` via `init_pos='paga'` to get single-cell embeddings that are typically more faithful to the global topology. Parameters ---------- adata An annotated data matrix. groups Key for categorical in `adata.obs`. You can pass your predefined groups by choosing any categorical annotation of observations. Default: The first present key of `'leiden'` or `'louvain'`. use_rna_velocity Use RNA velocity to orient edges in the abstracted graph and estimate transitions. Requires that `adata.uns` contains a directed single-cell graph with key `['velocity_graph']`. This feature might be subject to change in the future. model The PAGA connectivity model. copy Copy `adata` before computation and return a copy. Otherwise, perform computation inplace and return `None`. Returns ------- **connectivities** : :class:`numpy.ndarray` (adata.uns['connectivities']) The full adjacency matrix of the abstracted graph, weights correspond to confidence in the connectivities of partitions. **connectivities_tree** : :class:`scipy.sparse.csr_matrix` (adata.uns['connectivities_tree']) The adjacency matrix of the tree-like subgraph that best explains the topology. Notes ----- Together with a random walk-based distance measure (e.g. :func:`scanpy.tl.dpt`) this generates a partial coordinatization of data useful for exploring and explaining its variation. .. currentmodule:: scanpy See Also -------- pl.paga pl.paga_path pl.paga_compare
我们看到use_rna_velocity参数,scanpy也可以用rna速率数据啊。
假装我们已经做好了细胞定义:
adata.obs['leiden_anno'] = adata.obs['leiden'] adata.obs['leiden_anno'].cat.categories = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10/Ery', '11', '12', '13', '14', '15', '16/Stem', '17', '18', '19/Neu', '20/Mk', '21', '22/Baso', '23', '24/Mo','25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35'] sc.tl.paga(adata, groups='leiden_anno') sc.pl.paga(adata, threshold=0.03, show=False)
用PAGA的结果重新计算嵌入,也就是使得细胞的结构与PAGA的结构一致:
sc.tl.draw_graph(adata, init_pos='paga') sc.pl.draw_graph(adata, color=['leiden_anno','MEGF6', 'MT-CO2', 'AKR7A3'], legend_loc='on data') drawing single-cell graph using layout 'fa' WARNING: Package 'fa2' is not installed, falling back to layout 'fr'.To use the faster and better ForceAtlas2 layout, install package 'fa2' (`pip install fa2`). finished: added 'X_draw_graph_fr', graph_drawing coordinates (adata.obsm) (0:00:18)
当然这个图并不是我们期望,看上去依然略显混乱,根据waring的提示也许我应该pip install fa2
,于是我们就pip install fa2
,再来看一下结果:
forceatlas2(https://github.com/bhargavchippada/forceatlas2)将Gephi的Force Atlas 2布局算法移植到python2和python3(带有NetworkX和igraph的包装)。这是最快的python实现,大部分功能已经完成。它还支持Barnes Hut的最大加速近似值。ForceAtlas2是一种非常快速的面向力的图形布局算法。它用于在2D中对一个加权的无向图进行空间化(边缘权值定义了连接的强度)。实现基于本文和相应的gephi-java代码。与networkx的fruchterman reingold算法(spring layout)相比,它确实非常快,并且可以很好地扩展到大量节点(>10000)。
回到我们的help(sc.tl.draw_graph),看到这layout的选项支持不同的igraph layout主题:
layout 'fa' (`ForceAtlas2`) or any valid `igraph layout <http://igraph.org/c/doc/igraph-Layout.html>`__. Of particular interest are 'fr' (Fruchterman Reingold), 'grid_fr' (Grid Fruchterman Reingold, faster than 'fr'), 'kk' (Kamadi Kawai', slower than 'fr'), 'lgl' (Large Graph, very fast), 'drl' (Distributed Recursive Layout, pretty fast) and 'rt' (Reingold Tilford tree layout).
sc.tl.draw_graph(adata, init_pos='paga',layout= 'lgl') sc.pl.draw_graph(adata, color=['leiden_anno','MEGF6', 'MT-CO2', 'AKR7A3'], legend_loc='on data')
sc.tl.draw_graph(adata, init_pos='paga',layout= 'kk') sc.pl.draw_graph(adata, color=['leiden_anno','MEGF6', 'MT-CO2', 'AKR7A3'], legend_loc='on data')
我们还是用fa吧。我们不禁会想,monocle的layout是什么呢,为什么看起来那么直白?
pl.figure(figsize=(8, 2)) for i in range(28): pl.scatter(i, 1, c=sc.pl.palettes.zeileis_28[i], s=200) pl.show()
zeileis_colors = np.array(sc.pl.palettes.zeileis_28) new_colors = np.array(adata.uns['leiden_anno_colors']) new_colors[[16]] = zeileis_colors[[12]] # Stem colors / green new_colors[[10, 17, 5, 3, 15, 6, 18, 13, 7, 12]] = zeileis_colors[[5, 5, 5, 5, 11, 11, 10, 9, 21, 21]] # Ery colors / red new_colors[[20, 8]] = zeileis_colors[[17, 16]] # Mk early Ery colors / yellow new_colors[[4, 0]] = zeileis_colors[[2, 8]] # lymph progenitors / grey new_colors[[22]] = zeileis_colors[[18]] # Baso / turquoise new_colors[[19, 14, 2]] = zeileis_colors[[6, 6, 6]] # Neu / light blue new_colors[[24, 9, 1, 11]] = zeileis_colors[[0, 0, 0, 0]] # Mo / dark blue new_colors[[21, 23]] = zeileis_colors[[25, 25]] # outliers / grey adata.uns['leiden_anno_colors'] = new_colors sc.pl.paga_compare( adata, threshold=0.03, title='', right_margin=0.2, size=10, edge_width_scale=0.5, legend_fontsize=12, fontsize=12, frameon=False, edges=True, save=True)
接下来我们在这个细胞图谱上绘制拟时信息,虽说是拟时推断,却需要指定一个亚群作为起点。但是真正的样本中有时并不存在一个明确的起点,有时是同时发生发育的。尽管每个生命从长远来看所有的细胞都来自一个细胞。这个起点如何确定,就像牛顿的宇宙第一推动者的问题一样,必须跳出数据本身来思考它。这里我们为了完成任务还是选择一个发育的起点吧。
adata.uns['iroot'] = np.flatnonzero(adata.obs['leiden_anno'] == '16/Stem')[0] sc.tl.dpt(adata) adata Out[332]: AnnData object with n_obs × n_vars = 5025 × 1000 obs: 'n_counts_all', 'leiden', 'leiden_anno', 'dpt_pseudotime' var: 'gene_ids', 'feature_types', 'n_counts' uns: 'log1p', 'pca', 'neighbors', 'leiden', 'draw_graph', 'leiden_colors', 'umap', 'diffmap_evals', 'paga', 'leiden_sizes', 'leiden_anno_sizes', 'leiden_anno_colors', 'iroot' obsm: 'X_pca', 'X_draw_graph_fa', 'X_umap', 'X_tsne', 'X_diffmap', 'X_draw_graph_lgl', 'X_draw_graph_kk' varm: 'PCs' adata.obs['dpt_pseudotime'] Out[333]: AAACCCAAGCGTATGG-1 0.661545 AAACCCAGTCCTACAA-1 0.670825 AAACCCATCACCTCAC-1 0.000000 AAACGCTAGGGCATGT-1 0.839952 AAACGCTGTAGGTACG-1 0.816287 TTTGTTGCAGGTACGA-1 0.826858 TTTGTTGCAGTCTCTC-1 0.792751 TTTGTTGGTAATTAGG-1 0.329366 TTTGTTGTCCTTGGAA-1 0.905214 TTTGTTGTCGCACGAC-1 0.771131 Name: dpt_pseudotime, Length: 5025, dtype: float32
我们不禁思考,既然每个生命从长远来看所有的细胞都来自一个细胞,是不是在一套数据集中可以设置一个遥远的点(实际本不存在的)作为发育的起点呢?这样是不是更能反映轨迹推断的实际呢?据我所知好像还没有这样的算法出现。
sc.tl.draw_graph(adata, init_pos='paga') sc.pl.draw_graph(adata, color=['leiden_anno', 'dpt_pseudotime'], legend_loc='on data')
可以自定义发育路径
paths = [('erythrocytes', [16, 12, 7, 13, 18, 6, 5, 10]), ('neutrophils', [16, 0, 4, 2, 14, 19]), ('monocytes', [16, 0, 4, 11, 1, 9, 24])] adata.obs['distance'] = adata.obs['dpt_pseudotime'] adata.obs['clusters'] = adata.obs['leiden_anno'] adata.uns['clusters_colors'] = adata.uns['leiden_anno_colors'] gene_names = adata.var_names[1:10] adata Out[338]: AnnData object with n_obs × n_vars = 5025 × 1000 obs: 'n_counts_all', 'leiden', 'leiden_anno', 'dpt_pseudotime', 'distance', 'clusters' var: 'gene_ids', 'feature_types', 'n_counts' uns: 'log1p', 'pca', 'neighbors', 'leiden', 'draw_graph', 'leiden_colors', 'umap', 'diffmap_evals', 'paga', 'leiden_sizes', 'leiden_anno_sizes', 'leiden_anno_colors', 'iroot', 'clusters_colors' obsm: 'X_pca', 'X_draw_graph_fa', 'X_umap', 'X_tsne', 'X_diffmap', 'X_draw_graph_lgl', 'X_draw_graph_kk' varm: 'PCs'
_, axs = pl.subplots(ncols=3, figsize=(6, 2.5), gridspec_kw={'wspace': 0.05, 'left': 0.12}) pl.subplots_adjust(left=0.05, right=0.98, top=0.82, bottom=0.2) for ipath, (descr, path) in enumerate(paths): _, data = sc.pl.paga_path( adata, path, gene_names, show_node_names=False, ax=axs[ipath], ytick_fontsize=12, left_margin=0.15, n_avg=50, annotations=['distance'], show_yticks=True if ipath==0 else False, show_colorbar=False, color_map='Greys', groups_key='clusters', color_maps_annotations={'distance': 'viridis'}, title='{} path'.format(descr), return_data=True, show=False) data.to_csv('./write/paga_path_{}.csv'.format(descr)) pl.savefig('./figures/paga_path_paul15.pdf') pl.show()
仔细看看这个图有没有想起来monocle的,基因在不同命运中的变化情况:
paga(https://scanpy-tutorials.readthedocs.io/en/latest/paga-paul15.html)
原创文章,作者:kepupublish,如若转载,请注明出处:https://blog.ytso.com/211993.html