scanpy教程：预处理与聚类

分享是一种态度

作者 | 周运来

男，

一个长大了才会遇到的帅哥，

稳健，潇洒，大方，靠谱。

一段生信缘，一棵技能树，

一枚大型测序工厂的螺丝钉，

一个随机森林中提灯觅食的津门旅客。

scanpy 是一个用于分析单细胞转录组（single cell rna sequencing)数据的python库，文章2018发表在Genome Biology(https://genomebiology.biomedcentral.com/)。其实它的许多分析思路借鉴了以seurat为中心的R语言单细胞转录数据分析生态的，scanpy以一己之力在python生态构建了单细胞转录组数据分析框架。我相信借助python的工业应用实力，其扩展性大于R语言分析工具。当然，选择走一遍scanpy的原因，不是因为它的强大，只是因为喜欢。

在我搬这块砖之前，中文世界关于scanpy的介绍已经很多了，这当然是好事。不知道谁会以怎样的方式遇见谁，所以，还是让我们开始吧。

所做的第一步就是配置好python环境，我建议是用conda来构建，这样软件管理起来很方便。然后是安装scanpy这个库，当然可能会遇到一些问题，但是花点时间总是可以Google掉的。在Windows、mac、linux平台scanpy都是可以运行的。

在学习新的库时，文档是不可不看的。有统计表明，程序员读代码的时间一般三倍于写代码的时间。所以这基本上是一次阅读体验。我们假装可爱的读者朋友，已经配置好scanpy的环境，也许花了两三天的时间。会不会python没关系，英语不好没关系，安装个某道词典就可以了。

三天后。。。

我们打开scanpy的教程官网：https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html ，对，就是这么直白。开始在我们的编辑器里复制粘贴教程代码，并尝试运行，并理解结果。

首先我们导入需要的库，读入10X标准的数据filtered_feature_bc_matrix，当然也可以是h5格式的文件，或者一个二维的表达谱。

读入数据

# -*- coding: utf-8 -*-
"""
Spyder Editor
https://scanpy.readthedocs.io/en/stable/
This is a temporary script file.
"""
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns 

sc.settings.verbosity = 3             # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.settings.set_figure_params(dpi=80)
sc.logging.print_versions()

scanpy==1.4.5.1 
anndata==0.7.1 
umap==0.3.10 
numpy==1.16.5 
scipy==1.3.1 
pandas==0.25.1 
scikit-learn==0.21.3 
statsmodels==0.10.1 
python-igraph==0.8.0

不用担心这些库，我们都会慢慢熟悉的，这里python-igraph要注意一下，不是igraph。这些库在《利用python进行数据分析》这本书里面都有介绍，特别是pandas和numpy，以后的日子里我们会看到它们几乎是构成scanpy数据结构的核心。

results_file = 'E:/learnscanpy/write/pbmc3k.h5ad' 
help(sc.read_10x_mtx)
adata = sc.read_10x_mtx(
    'E:/learnscanpy/data/filtered_feature_bc_matrix',  # the directory with the `.mtx` file
    var_names='gene_symbols',                  # use gene symbols for the variable names (variables-axis index)
    cache=True) 
adata.var_names_make_unique()  # this is unnecessary if using `var_names='gene_ids'` in `sc.read_10x_mtx`

这样就把数据读进来了，他的名字叫做adata，那么这是个什么东西呢？我认为搞清楚这个基本上学会一半scanpy了。

adata
Out[21]: 
AnnData object with n_obs × n_vars = 5025 × 33694 
    var: 'gene_ids', 'feature_types'

adata.var["gene_ids"]
Out[22]: 
RP11-34P13.3    ENSG00000243485
FAM138A         ENSG00000237613
OR4F5           ENSG00000186092
RP11-34P13.7    ENSG00000238009
RP11-34P13.8    ENSG00000239945

AC233755.2      ENSG00000277856
AC233755.1      ENSG00000275063
AC240274.1      ENSG00000271254
AC213203.1      ENSG00000277475
FAM231B         ENSG00000268674
Name: gene_ids, Length: 33694, dtype: object

adata.var["feature_types"]
Out[23]: 
RP11-34P13.3    Gene Expression
FAM138A         Gene Expression
OR4F5           Gene Expression
RP11-34P13.7    Gene Expression
RP11-34P13.8    Gene Expression

AC233755.2      Gene Expression
AC233755.1      Gene Expression
AC240274.1      Gene Expression
AC213203.1      Gene Expression
FAM231B         Gene Expression
Name: feature_types, Length: 33694, dtype: object

adata.to_df()
Out[24]: 
                    RP11-34P13.3  FAM138A  ...  AC213203.1  FAM231B
AAACCCAAGCGTATGG-1           0.0      0.0  ...         0.0      0.0
AAACCCAGTCCTACAA-1           0.0      0.0  ...         0.0      0.0
AAACCCATCACCTCAC-1           0.0      0.0  ...         0.0      0.0
AAACGCTAGGGCATGT-1           0.0      0.0  ...         0.0      0.0
AAACGCTGTAGGTACG-1           0.0      0.0  ...         0.0      0.0
                         ...      ...  ...         ...      ...
TTTGTTGCAGGTACGA-1           0.0      0.0  ...         0.0      0.0
TTTGTTGCAGTCTCTC-1           0.0      0.0  ...         0.0      0.0
TTTGTTGGTAATTAGG-1           0.0      0.0  ...         0.0      0.0
TTTGTTGTCCTTGGAA-1           0.0      0.0  ...         0.0      0.0
TTTGTTGTCGCACGAC-1           0.0      0.0  ...         0.0      0.0

[5025 rows x 33694 columns]

第一句AnnData object with n_obs × n_vars = 5025 × 33694 ，告诉我们adata是一个AnnData 对象，就像seurat也是一个对象一样。什么叫对象呢？对象就是一个实体、物体，它是一种存在而不是一种动作。当然，我们可以对它做一些操作，一个对象可以通过具体的属性为人们感知。

具体地，anndata(https://anndata.readthedocs.io/en/stable/)是这样一种构象：

单细胞转录组的核心就是一个cell X gene的二维表，但是分群后要存储cell的分群结果，特征选择是要记录gene的信息，降维后要存储降维后的结果。所以，这张表.X的对象cell相关的信息记录在.obs中，属性gene的信息记录在.var中，其他的信息在.uns中。那么每一部分是什么呢？

type(adata.var["gene_ids"])
Out[205]: pandas.core.series.Series

哦，原来是pandas的Series，下面是这个数据结构的详细介绍，这个数据结构能存储的信息一点也不亚于seurat啊。

Type:        AnnData
String form:
AnnData object with n_obs × n_vars = 5025 × 33694 
    var: 'gene_ids', 'feature_types'
Length:      5025
File:        d:/program files (x86)/anconda/lib/site-packages/anndata/_core/anndata.py
Docstring:  
An annotated data matrix.

:class:`~anndata.AnnData` stores a data matrix :attr:`X` together with annotations
of observations :attr:`obs` (:attr:`obsm`, :attr:`obsp`),
variables :attr:`var` (:attr:`varm`, :attr:`varp`),
and unstructured annotations :attr:`uns`.

.. figure:: https://falexwolf.de/img/scanpy/anndata.svg
   :width: 350px

An :class:`~anndata.AnnData` object `adata` can be sliced like a
:class:`~pandas.DataFrame`,
for instance `adata_subset = adata[:, list_of_variable_names]`.
:class:`~anndata.AnnData`’s basic structure is similar to R’s ExpressionSet
[Huber15]_. If setting an `.h5ad`-formatted HDF5 backing file `.filename`,
data remains on the disk but is automatically loaded into memory if needed.
See this `blog post`_ for more details.

.. _blog post: http://falexwolf.de/blog/171223_AnnData_indexing_views_HDF5-backing/

Parameters
----------
X
    A #observations × #variables data matrix. A view of the data is used if the
    data type matches, otherwise, a copy is made.
obs
    Key-indexed one-dimensional observations annotation of length #observations.
var
    Key-indexed one-dimensional variables annotation of length #variables.
uns
    Key-indexed unstructured annotation.
obsm
    Key-indexed multi-dimensional observations annotation of length #observations.
    If passing a :class:`~numpy.ndarray`, it needs to have a structured datatype.
varm
    Key-indexed multi-dimensional variables annotation of length #variables.
    If passing a :class:`~numpy.ndarray`, it needs to have a structured datatype.
layers
    Key-indexed multi-dimensional arrays aligned to dimensions of `X`.
dtype
    Data type used for storage.
shape
    Shape tuple (#observations, #variables). Can only be provided if `X` is `None`.
filename
    Name of backing file. See :class:`File`.
filemode
    Open mode of backing file. See :class:`File`.

See Also
--------
read_h5ad
read_csv
read_excel
read_hdf
read_loom
read_zarr
read_mtx
read_text
read_umi_tools

Notes
-----
:class:`~anndata.AnnData` stores observations (samples) of variables/features
in the rows of a matrix.
This is the convention of the modern classics of statistics [Hastie09]_
and machine learning [Murphy12]_,
the convention of dataframes both in R and Python and the established statistics
and machine learning packages in Python (statsmodels_, scikit-learn_).

Single dimensional annotations of the observation and variables are stored
in the :attr:`obs` and :attr:`var` attributes as :class:`~pandas.DataFrame`/ s.
This is intended for metrics calculated over their axes.
Multi-dimensional annotations are stored in :attr:`obsm` and :attr:`varm`,
which are aligned to the objects observation and variable dimensions respectively.
Square matrices representing graphs are stored in :attr:`obsp` and :attr:`varp`,
with both of their own dimensions aligned to their associated axis.
Additional measurements across both observations and variables are stored in
:attr:`layers`.

Indexing into an AnnData object can be performed by relative position
with numeric indices (like pandas’ :attr:`~pandas.DataFrame.iloc`),
or by labels (like :attr:`~pandas.DataFrame.loc`).
To avoid ambiguity with numeric indexing into observations or variables,
indexes of the AnnData object are converted to strings by the constructor.

Subsetting an AnnData object by indexing into it will also subset its elements
according to the dimensions they were aligned to.
This means an operation like `adata[list_of_obs, :]` will also subset :attr:`obs`,
:attr:`obsm`, and :attr:`layers`.

Subsetting an AnnData object returns a view into the original object,
meaning very little additional memory is used upon subsetting.
This is achieved lazily, meaning that the constituent arrays are subset on access.
Copying a view causes an equivalent “real” AnnData object to be generated.
Attempting to modify a view (at any attribute except X) is handled
in a copy-on-modify manner, meaning the object is initialized in place.
Here’s an example::

    batch1 = adata[adata.obs["batch"] == "batch1", :]
    batch1.obs["value"] = 0  # This makes batch1 a “real” AnnData object

At the end of this snippet: `adata` was not modified,
and `batch1` is its own AnnData object with its own data.

Similar to Bioconductor’s `ExpressionSet` and :mod:`scipy.sparse` matrices,
subsetting an AnnData object retains the dimensionality of its constituent arrays.
Therefore, unlike with the classes exposed by :mod:`pandas`, :mod:`numpy`,
and `xarray`, there is no concept of a one dimensional AnnData object.
AnnDatas always have two inherent dimensions, :attr:`obs` and :attr:`var`.
Additionally, maintaining the dimensionality of the AnnData object allows for
consistent handling of :mod:`scipy.sparse` matrices and :mod:`numpy` arrays.

.. _statsmodels: http://www.statsmodels.org/stable/index.html
.. _scikit-learn: http://scikit-learn.org/

绘制高表达的基因：

sc.pl.highest_expr_genes(adata, n_top=20)

本着本质的好奇心，我们不禁要问一句：这张图是用什么画的？一定要养成阅读文档的习惯：

help(sc.pl.highest_expr_genes)
Help on function highest_expr_genes in module scanpy.plotting._qc:

highest_expr_genes(adata: anndata._core.anndata.AnnData, n_top: int = 30, show: Union[bool, NoneType] = None, save: Union[str, bool, NoneType] = None, ax: Union[matplotlib.axes._axes.Axes, NoneType] = None, gene_symbols: Union[str, NoneType] = None, log: bool = False, **kwds)
   Fraction of counts assigned to each gene over all cells.

   Computes, for each gene, the fraction of counts assigned to that gene within
   a cell. The `n_top` genes with the highest mean fraction over all cells are
   plotted as boxplots.

   This plot is similar to the `scater` package function `plotHighestExprs(type
   = "highest-expression")`, see `here
   <https://bioconductor.org/packages/devel/bioc/vignettes/scater/inst/doc/vignette-qc.html>`__. Quoting
   from there:

       *We expect to see the “usual suspects”, i.e., mitochondrial genes, actin,
       ribosomal protein, MALAT1. A few spike-in transcripts may also be
       present here, though if all of the spike-ins are in the top 50, it
       suggests that too much spike-in RNA was added. A large number of
       pseudo-genes or predicted genes may indicate problems with alignment.*
       -- Davis McCarthy and Aaron Lun

   Parameters
   ----------
   adata
       Annotated data matrix.
   n_top
       Number of top
   show
        Show the plot, do not return axis.
   save
       If `True` or a `str`, save the figure.
       A string is appended to the default filename.
       Infer the filetype if ending on {`'.pdf'`, `'.png'`, `'.svg'`}.
   ax
       A matplotlib axes object. Only works if plotting a single component.
   gene_symbols
       Key for field in .var that stores gene symbols if you do not want to use .var_names.
   log
       Plot x-axis in log scale
   **kwds
       Are passed to :func:`~seaborn.boxplot`.

   Returns
   -------
   If `show==False` a :class:`~matplotlib.axes.Axes`.

可以看出这个功能借鉴了知名R包scater的一个函数：plotHighestExprs，绘图用的是python里面两个响当当的seaborn和matplotlib库。那我们不禁要自己微调一下了。

current_palette = sns.color_palette("dark")
sns.palplot(current_palette)

 sc.pl.highest_expr_genes(adata, n_top=20,palette="dark" )
normalizing counts per cell
    finished (0:00:00)

借助seaborn的绘图系统图我们还可以这样设置颜色：

with sns.color_palette("PuBuGn_d"):
    sc.pl.highest_expr_genes(adata, n_top=20, )

help(sns.set_style)
sns.set_style("white")
sc.pl.highest_expr_genes(adata, n_top=20, )

cell QC

提到cell QC ，我们需要明白在10的体系里面什么是cell。有的同学说了，我上机6000个细胞，为什么cellranger检出的细胞不是6000，有时候差的还不小。因为在cellranger的逻辑里，cell不过是每个umi对barcode的支持程度。它说的cell是基于序列比对的，尽管这个理应和上机的cell保持一致，但是这种一致性是通过序列比对来保持的。其实我们说的cell QC指的是对barcode的质控，使留下的barcode能表征下游分析的cell。

sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=0)
mito_genes = adata.var_names.str.startswith('MT-')
# for each cell compute fraction of counts in mito genes vs. all genes
# the `.A1` is only necessary as X is sparse (to transform to a dense array after summing)
adata.obs['percent_mito'] = np.sum(
    adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1
# add the total counts per cell as observations-annotation to adata
adata.obs['n_counts'] = adata.X.sum(axis=1).A1

参数的意思就是字面的意思，有不明白的可以直接help相应的函数，看它的说明文档。我基本上就是靠说明文档来学习的。下面让我们看看过滤后的adata发生了哪些变化。

adata
Out[9]: 
AnnData object with n_obs × n_vars = 4960 × 33694 
    obs: 'n_genes', 'percent_mito', 'n_counts'
    var: 'gene_ids', 'feature_types', 'n_cells'

 adata.obs['percent_mito']
Out[11]: 
AAACCCAAGCGTATGG-1    0.106588
AAACCCAGTCCTACAA-1    0.056101
AAACCCATCACCTCAC-1    0.532847
AAACGCTAGGGCATGT-1    0.106913
AAACGCTGTAGGTACG-1    0.078654

TTTGTTGCAGGTACGA-1    0.071226
TTTGTTGCAGTCTCTC-1    0.055394
TTTGTTGGTAATTAGG-1    0.183441
TTTGTTGTCCTTGGAA-1    0.081037
TTTGTTGTCGCACGAC-1    0.070987
Name: percent_mito, Length: 4960, dtype: float32

adata.var['n_cells']
Out[12]: 
RP11-34P13.3     0
FAM138A          0
OR4F5            0
RP11-34P13.7    43
RP11-34P13.8     0
                ..
AC233755.2       1
AC233755.1       1
AC240274.1      83
AC213203.1       0
FAM231B          0
Name: n_cells, Length: 33694, dtype: int32

细胞属性obs中有了percent_mito；有了基因属性的var。

像seurat一样，我们看看cell三个属性的小提琴图：

sc.pl.violin(adata, ['n_genes', 'n_counts', 'percent_mito'],
             jitter=0.4, multi_panel=True)

sc.pl.scatter(adata, x='n_counts', y='percent_mito')
sc.pl.scatter(adata, x='n_counts', y='n_genes')

显然这是分别画的两张图，我们能不能把它们组合到一起呢？显然是可以的：

help(sns.FacetGrid)
adata.obs['percent_mito1'] = adata.obs['percent_mito']*10000
adata.obs.melt(id_vars=['n_counts','percent_mito'], var_name='myVariable', value_name='myValue')
import matplotlib.pyplot as plt
sns.set(style="ticks")
g = sns.FacetGrid(adata.obs.melt(id_vars=['n_counts','percent_mito'], var_name='myVariable', value_name='myValue'),col = 'myVariable')
g.map(plt.scatter,'n_counts',"myValue", alpha=.7)
g.add_legend();

有不少人问这两张图怎么看。就拿n_counts与n_genes的散点图（scatter）来说吧：每个点代表一个细胞，斜率代表随着count的增加gene的增加程度。count和gene一般呈现线性关系，斜率越大也就是较少的count就可以检出较多的gene，说明这批细胞基因的丰度偏低（普遍贫穷）；反之，斜率较小，说明这批细胞基因丰度较高（少数富有）。有的时候不是一条可拟合的线，或者是两条可拟合的线，也反映出细胞的异质性。总之，他就是一个散点图，描述的是两个变量的关系。

根据统计的n_genes 和线粒体基因含量percent_mito 进一步过滤：

adata = adata[adata.obs.n_genes < 2500, :]
adata = adata[adata.obs.percent_mito < 0.1, :]

过滤掉基因数大于2500，线粒体基因含量大于0.1的barcode。这个过滤方法就像filter一样，操作的是adata.obs这张表。

adata.obs
Out[82]: 
                    n_genes  percent_mito  n_counts  percent_mito1
AAACGCTGTGTGATGG-1     2348      0.099821    6131.0     998.205811
AAACGCTTCTAGGCAT-1     2470      0.065564    8465.0     655.640869
AAAGAACCACCGTCTT-1      420      0.021944     638.0     219.435730
AAAGAACTCACCTGGG-1     2479      0.074067   10801.0     740.672119
AAAGAACTCGGAACTT-1     2318      0.079713    6837.0     797.133240
                    ...           ...       ...            ...
TTTGGTTTCCGGTAAT-1     2277      0.079324    9291.0     793.240723
TTTGTTGCAGGTACGA-1     1986      0.071226    8101.0     712.257751
TTTGTTGCAGTCTCTC-1     2485      0.055394    9604.0     553.935852
TTTGTTGTCCTTGGAA-1     1998      0.081037    5479.0     810.366882
TTTGTTGTCGCACGAC-1     2468      0.070987    7438.0     709.868225

[2223 rows x 4 columns]

特征提取

cellQC可以看做是对细胞的过滤，即保留可用的细胞。但是每个细胞又有成千的基因（细胞的特征），所以一般需要做特征的选择和提取（也就是降维）。什么意思呢？就是比如每个人都有很多特征，年龄、性别、身高、国籍、学历、生日、爱好、血型。。。我要对一百个人分类，所有的特征都拿来比较，可能得不到准确的划分。而用几个关键的特征可以分出某个国家某个年龄段的人。特征的选择是选择一部分特征，比如选择高变基因；特征提取是提取主要的特征结构，如主成分分析。

在特征提取之前要保证细胞之间是有可比性的，一般用的是归一化的方法，得到高变基因之后，为了使同一个基因在不同细胞之间具有可比性采用标准化。

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata

adata
AnnData object with n_obs × n_vars = 2223 × 33694 
    obs: 'n_genes', 'percent_mito', 'n_counts', 'percent_mito1'
    var: 'gene_ids', 'feature_types', 'n_cells'
    uns: 'log1p'

sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pl.highly_variable_genes(adata)
adata = adata[:, adata.var.highly_variable]

#时刻关注我们数据的变化
adata
Out[117]: 
View of AnnData object with n_obs × n_vars = 2223 × 2208 
    obs: 'n_genes', 'percent_mito', 'n_counts', 'percent_mito1'
    var: 'gene_ids', 'feature_types', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p'

回归每个细胞的总计数和线粒体基因表达百分比的影响，将数据缩放到单位方差。

sc.pp.regress_out(adata, ['n_counts', 'percent_mito'])
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver='arpack')

adata
Out[121]: 
AnnData object with n_obs × n_vars = 2223 × 2208 
    obs: 'n_genes', 'percent_mito', 'n_counts', 'percent_mito1'
    var: 'gene_ids', 'feature_types', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'pca'
    obsm: 'X_pca'
    varm: 'PCs

adata.obsm['X_pca']
adata.varm['PCs']

sc.pl.pca(adata, color='CST3')
sc.pl.pca_variance_ratio(adata, log=True)
adata.write(results_file)

特征选择我们用高变基因，特征提取我们用pca。这里就有一个都的问题：如何选择高变基因以及pc的维度？

聚类

scanpy提供leiden和louvain两种图聚类算法，图聚类在开始之前要先找邻居：

sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
adata

Out[128]: 
AnnData object with n_obs × n_vars = 2223 × 2208 
    obs: 'n_genes', 'percent_mito', 'n_counts', 'percent_mito1'
    var: 'gene_ids', 'feature_types', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'pca', 'neighbors'
    obsm: 'X_pca'
    varm: 'PCs'

sc.tl.leiden(adata)

adata.obs['leiden']
Out[188]: 
AAACGCTGTGTGATGG-1    10
AAACGCTTCTAGGCAT-1     6
AAAGAACCACCGTCTT-1     4
AAAGAACTCACCTGGG-1     2
AAAGAACTCGGAACTT-1     6
                      ..
TTTGGTTTCCGGTAAT-1     3
TTTGTTGCAGGTACGA-1     1
TTTGTTGCAGTCTCTC-1     3
TTTGTTGTCCTTGGAA-1     7
TTTGTTGTCGCACGAC-1     3
Name: leiden, Length: 2223, dtype: category
Categories (12, object): [0, 1, 2, 3, ..., 8, 9, 10, 11]
#sc.tl.louvain(adata)
#sc.tl.paga(adata)

和Seurat等人一样，scanpy推荐Traag *等人(2018)提出的Leiden graph-clustering方法(基于优化模块化的社区检测)。注意，Leiden集群直接对cell的邻域图进行聚类，我们在sc.pp.neighbors已经计算过了。
读一下leiden的说明文档：

 help(sc.tl.leiden)
Help on function leiden in module scanpy.tools._leiden:

leiden(adata: anndata._core.anndata.AnnData, resolution: float = 1, *, restrict_to: Union[Tuple[str, Sequence[str]], NoneType] = None, random_state: Union[int, mtrand.RandomState, NoneType] = 0, key_added: str = 'leiden', adjacency: Union[scipy.sparse.base.spmatrix, NoneType] = None, directed: bool = True, use_weights: bool = True, n_iterations: int = -1, partition_type: Union[Type[leidenalg.VertexPartition.MutableVertexPartition], NoneType] = None, copy: bool = False, **partition_kwargs) -> Union[anndata._core.anndata.AnnData, NoneType]
    Cluster cells into subgroups [Traag18]_.

    Cluster cells using the Leiden algorithm [Traag18]_,
    an improved version of the Louvain algorithm [Blondel08]_.
    It has been proposed for single-cell analysis by [Levine15]_.

    This requires having ran :func:`~scanpy.pp.neighbors` or
    :func:`~scanpy.external.pp.bbknn` first.

    Parameters
    ----------
    adata
        The annotated data matrix.
    resolution
        A parameter value controlling the coarseness of the clustering.
        Higher values lead to more clusters.
        Set to `None` if overriding `partition_type`
        to one that doesn’t accept a `resolution_parameter`.
    random_state
        Change the initialization of the optimization.
    restrict_to
        Restrict the clustering to the categories within the key for sample
        annotation, tuple needs to contain `(obs_key, list_of_categories)`.
    key_added
        `adata.obs` key under which to add the cluster labels.
    adjacency
        Sparse adjacency matrix of the graph, defaults to
        `adata.uns['neighbors']['connectivities']`.
    directed
        Whether to treat the graph as directed or undirected.
    use_weights
        If `True`, edge weights from the graph are used in the computation
        (placing more emphasis on stronger edges).
    n_iterations
        How many iterations of the Leiden clustering algorithm to perform.
        Positive values above 2 define the total number of iterations to perform,
        -1 has the algorithm run until it reaches its optimal clustering.
    partition_type
        Type of partition to use.
        Defaults to :class:`~leidenalg.RBConfigurationVertexPartition`.
        For the available options, consult the documentation for
        :func:`~leidenalg.find_partition`.
    copy
        Whether to copy `adata` or modify it inplace.
    **partition_kwargs
        Any further arguments to pass to `~leidenalg.find_partition`
        (which in turn passes arguments to the `partition_type`).

    Returns
    -------
    `adata.obs[key_added]`
        Array of dim (number of samples) that stores the subgroup id
        (`'0'`, `'1'`, ...) for each cell.
    `adata.uns['leiden']['params']`
        A dict with the values for the parameters `resolution`, `random_state`,
        and `n_iterations`.

seurat的findclusters中聚类的算法亦是有Leiden的，关于leiden和louvain两者的区别，可以看看From Louvain to Leiden: guaranteeing well-connected communities(https://www.nature.com/articles/s41598-019-41695-z)

algorithm   
Algorithm for modularity optimization (
1 = original Louvain algorithm;
2 = Louvain algorithm with multilevel refinement; 
3 = SLM algorithm;
4 = Leiden algorithm). Leiden requires the leidenalg python.

为了可视化聚类的结果我们还是先做一下umap降维吧，然后看看分群结果。

sc.tl.umap(adata)
sc.pl.umap(adata, color=['leiden', 'CST3', 'NKG7'])

adata
Out[139]: 
AnnData object with n_obs × n_vars = 2223 × 2208 
    obs: 'n_genes', 'percent_mito', 'n_counts', 'percent_mito1', 'leiden'
    var: 'gene_ids', 'feature_types', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'pca', 'neighbors', 'leiden', 'paga', 'leiden_sizes', 'umap', 'leiden_colors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'

差异分析

分群后我们得到以数值为标识的亚群编号，我们急于知道每一个数值背后的含义，意义还得从基因上找。差异分析就是为了这目的的。

 sc.tl.rank_genes_groups(adata, 'leiden', method='t-test')
ranking genes
D:/Program Files (x86)/anconda/lib/site-packages/scanpy/tools/_rank_genes_groups.py:252: RuntimeWarning: invalid value encountered in log2
  rankings_gene_logfoldchanges.append(np.log2(foldchanges[global_indices]))
    finished: added to `.uns['rank_genes_groups']`
    'names', sorted np.recarray to be indexed by group ids
    'scores', sorted np.recarray to be indexed by group ids
    'logfoldchanges', sorted np.recarray to be indexed by group ids
    'pvals', sorted np.recarray to be indexed by group ids
    'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:00)

为了使我们的文章图文并茂一些，来看看't-test'检验的每个亚群的差异基因排序：

sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
sc.settings.verbosity = 2  # reduce the verbosity

help(sc.tl.rank_genes_groups)
Help on function rank_genes_groups in module scanpy.tools._rank_genes_groups:

rank_genes_groups(adata: anndata._core.anndata.AnnData, groupby: str, use_raw: bool = True, groups: Union[scanpy._compat.Literal_, Iterable[str]] = 'all', reference: str = 'rest', n_genes: int = 100, rankby_abs: bool = False, key_added: Union[str, NoneType] = None, copy: bool = False, method: scanpy._compat.Literal_ = 't-test_overestim_var', corr_method: scanpy._compat.Literal_ = 'benjamini-hochberg', layer: Union[str, NoneType] = None, **kwds) -> Union[anndata._core.anndata.AnnData, NoneType]
    Rank genes for characterizing groups.

    Parameters
    ----------
    adata
        Annotated data matrix.
    groupby
        The key of the observations grouping to consider.
    use_raw
        Use `raw` attribute of `adata` if present.
    layer
        Key from `adata.layers` whose value will be used to perform tests on.
    groups
        Subset of groups, e.g. [`'g1'`, `'g2'`, `'g3'`], to which comparison
        shall be restricted, or `'all'` (default), for all groups.
    reference
        If `'rest'`, compare each group to the union of the rest of the group.
        If a group identifier, compare with respect to this group.
    n_genes
        The number of genes that appear in the returned tables.
    method
        The default 't-test_overestim_var' overestimates variance of each group,
        `'t-test'` uses t-test, `'wilcoxon'` uses Wilcoxon rank-sum,
        `'logreg'` uses logistic regression. See [Ntranos18]_,
        `here <https://github.com/theislab/scanpy/issues/95>`__ and `here
        <http://www.nxn.se/valent/2018/3/5/actionable-scrna-seq-clusters>`__,
        for why this is meaningful.
    corr_method
        p-value correction method.
        Used only for `'t-test'`, `'t-test_overestim_var'`, and `'wilcoxon'`.
    rankby_abs
        Rank genes by the absolute value of the score, not by the
        score. The returned scores are never the absolute values.
    key_added
        The key in `adata.uns` information is saved to.
    **kwds
        Are passed to test methods. Currently this affects only parameters that
        are passed to :class:`sklearn.linear_model.LogisticRegression`.
        For instance, you can pass `penalty='l1'` to try to come up with a
        minimal set of genes that are good predictors (sparse solution meaning
        few non-zero fitted coefficients).

    Returns
    -------
    **names** : structured `np.ndarray` (`.uns['rank_genes_groups']`)
        Structured array to be indexed by group id storing the gene
        names. Ordered according to scores.
    **scores** : structured `np.ndarray` (`.uns['rank_genes_groups']`)
        Structured array to be indexed by group id storing the z-score
        underlying the computation of a p-value for each gene for each
        group. Ordered according to scores.
    **logfoldchanges** : structured `np.ndarray` (`.uns['rank_genes_groups']`)
        Structured array to be indexed by group id storing the log2
        fold change for each gene for each group. Ordered according to
        scores. Only provided if method is 't-test' like.
        Note: this is an approximation calculated from mean-log values.
    **pvals** : structured `np.ndarray` (`.uns['rank_genes_groups']`)
        p-values.
    **pvals_adj** : structured `np.ndarray` (`.uns['rank_genes_groups']`)
        Corrected p-values.

    Notes
    -----
    There are slight inconsistencies depending on whether sparse
    or dense data are passed. See `here <https://github.com/theislab/scanpy/blob/master/scanpy/tests/test_rank_genes_groups.py>`__.

    Examples
    --------
import scanpy as sc
adata = sc.datasets.pbmc68k_reduced()
sc.tl.rank_genes_groups(adata, 'bulk_labels', method='wilcoxon')

    # to visualize the results
sc.pl.rank_genes_groups(adata)

scanpy差异分析的方法没有seurat丰富了，除了t-test，还有wilcoxon和logreg。

查看差异分析的结果：

sc.get.rank_genes_groups_df(adata, group="0")
Out[148]: 
       scores     names  logfoldchanges         pvals     pvals_adj
0   15.179968      IL7R             NaN  1.472705e-43  1.711438e-42
1   13.268551     RGS10             NaN  6.462721e-35  5.595956e-34
2   11.459023     LRRN3             NaN  9.208903e-27  5.465930e-26
3   10.671810       CD7             NaN  3.213348e-24  1.665510e-23
4    8.615908     INTS6             NaN  1.053249e-16  3.856673e-16
..        ...       ...             ...           ...           ...
95   1.755937    DUSP16             NaN  7.978160e-02  1.094148e-01
96   1.724632      GALM             NaN  8.523017e-02  1.163811e-01
97   1.713933  KCNQ1OT1             NaN  8.719264e-02  1.188403e-01
98   1.706563      PASK             NaN  8.851861e-02  1.202764e-01
99   1.696579      SUCO             NaN  9.047560e-02  1.227089e-01

[100 rows x 5 columns]

指定哪两组进行差异分析：

sc.tl.rank_genes_groups(adata, 'leiden', groups=['0'], reference='1', method='wilcoxon')
sc.pl.rank_genes_groups(adata, groups=['0'], n_genes=20)

sc.get.rank_genes_groups_df(adata, group="0")
Out[169]: 
       scores     names  logfoldchanges         pvals     pvals_adj
0   19.284122     EFNB2        2.265986  7.301753e-83  4.030568e-81
1   19.205927  C11orf96             NaN  3.301754e-82  1.350050e-80
2   19.158936     NPDC1       -2.627818  8.152369e-82  2.686632e-80
3   18.702019      EGR4             NaN  4.766577e-78  6.190942e-77
4   18.405581     FBXL8       -0.640331  1.185097e-75  1.104091e-74
..        ...       ...             ...           ...           ...
95  11.469543     AP4S1       -3.958649  1.876464e-30  3.019849e-30
96  11.459700    FBXO33        3.057788  2.102432e-30  3.378580e-30
97  11.423295   LDLRAD4        0.052481  3.198732e-30  5.121683e-30
98  11.368316      H1F0             NaN  6.013658e-30  9.587116e-30
99  11.310737    ZCWPW1        0.455119  1.161100e-29  1.836468e-29

[100 rows x 5 columns]

差异基因的可视化

marker_genes = ['IL7R', 'CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ', 'CD14',
                'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
                'FCGR3A', 'MS4A7', 'FCER1A', 'CST3', 'PPBP']

pd.DataFrame(adata.uns['rank_genes_groups']['names']).head(5)
result = adata.uns['rank_genes_groups']
groups = result['names'].dtype.names
pd.DataFrame(
    {group + '_' + key[:1]: result[key][group]
    for group in groups for key in ['names', 'pvals']}).head(5)

        0_n           0_p
0     EFNB2  7.301753e-83
1  C11orf96  3.301754e-82
2     NPDC1  8.152369e-82
3      EGR4  4.766577e-78
4     FBXL8  1.185097e-75

sc.pl.violin(adata, ['CST3', 'NKG7', 'PPBP'], groupby='leiden')

new_cluster_names = [
    'CD4 T', 'CD14 Monocytes',
    'B', 'CD8 T',
    'NK', 'FCGR3A Monocytes',
    'Dendritic', 'Megakaryocytes',
   'NewCluster1','NewCluster2','NewCluster3','NewCluster4' ]
adata.rename_categories('leiden', new_cluster_names)

/#Omitting rank_genes_groups/scores as old categories do not match.
/#Omitting rank_genes_groups/names as old categories do not match.
/#Omitting rank_genes_groups/logfoldchanges as old categories do not match.
/#Omitting rank_genes_groups/pvals as old categories do not match.
/#Omitting rank_genes_groups/pvals_adj as old categories do not match.

ax = sc.pl.dotplot(adata, marker_genes, groupby='leiden')

ax = sc.pl.stacked_violin(adata, marker_genes, groupby='leiden', rotation=90)

像seurat一样，scanpy也有丰富的数据可视化的方式，太多以至于需要单独再写另一篇文章，这里就不介绍了。

保存分析结果以供下一次调用

adata.write(results_file, compression='gzip')  # `compression='gzip'` saves disk space, but slows down writing and subsequent reading

adata.X = None
adata.write('E:/learnscanpy/write/pbmc3k_withoutX.h5ad', compression='gzip')
adata.obs[['n_counts', 'leiden']].to_csv(
     'E:/learnscanpy/write/pbmc3k_corrected_lei_groups.csv')

/# Export single fields of the annotation of observations
/# adata.obs[['n_counts', 'louvain_groups']].to_csv(
/#     './write/pbmc3k_corrected_louvain_groups.csv')

/# Export single columns of the multidimensional annotation
/# adata.obsm.to_df()[['X_pca1', 'X_pca2']].to_csv(
/#     './write/pbmc3k_corrected_X_pca.csv')

/# Or export everything except the data using `.write_csvs`.
/# Set `skip_data=False` if you also want to export the data.
/# adata.write_csvs(results_file[:-5], )

今天，我们学习了scanpy的一般流程，我们发现不管工具如何变,单细胞转录组的数据分析的大框架是没有变化的，几个分析的工具也是相互借鉴的。但是python的就不值得一学了吗？

SCANPY: large-scale single-cell gene expression data analysis（https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1382-0）

原创文章，作者：kepupublish，如若转载，请注明出处：https://blog.ytso.com/212007.html