文献分析基于监督学习的细胞类型注释策略 Evaluation of some aspects in supervised cell type identification for single-ce

原文pdf连接

摘要

	Progress	Challenge	Demand
Background

Solve	What	How	Effect
	通过实际数据分析评估不同的策略组合	参考数据的影响以及参考数据的处理策略	提供了使用监督细胞分型方法的指南和经验法则
Result	Study design Methods under comparison 3种现成的：random forest, SVM with linear kernel, and SVM with radial basis function kernel 2种基于scRNA相关性的方法： scmap and CHETAH 2种监督深度学习方法：multi-layer perceptron (MLP) and graph-embedded deep neural network (GEDFN) 2种半监督深度学习：ItClust with transfer-learning and MARS with meta-learning concepts 虽然还有其他方法，但基于已有的研究，SVM with rejection, scmap, and CHEAH是他们中最好的纳入GEDFN方法是为了研究基因网络信息是否有帮助 ItClust 只是用ref数据得到非监督聚类的参考值 Feature selection methods 关键是很多基因不是类型特异的，应该去除 3中非监督变量选择：Seurat， FEAST， F-test 不选择在ref中选择，在tar中选择在ref中不选择，在tar中选择在ref中选择，在tar中不选择 Datasets 人PBMC 10X lupus patients 10X，Smart-seq2，CEL-seq2 pbmc1 fresh 10X，Smart-seq2，CEL-seq2 pbmc2 fresh 人胰腺 3个小鼠脑 Drop-seq frontal cortex “Mouse brain FC” Drop-seq hippocampus regions “Mouse brain HC” 10X prefrontal cortex region “Mouse brain pFC” DroNc-seq cortex samples “Mouse brain cortex” 10X frontal cortex regions s “Mouse brain Allen” 经过分选的人PBMC数据 ref和tar来自不同的平台会怎样？ ref和tar来自不同的样本状态会怎样？来自不同的实验室？来自不同的组织区域？来自不同的生理状态？研究：整合多个数据是否提高性能？研究：去除噪声细胞是否提升性能？ Evaluation metrics Accuracy：正确注释的在全体细胞中占比 Adjusted Rand Index：聚类相似性 Macro F1：只用于在细胞类型比例不平衡时评估recall rate 运行时间 Summary of the study design
	F-test on reference datasets + MLP 特征选择和MLP的组合
	Impact of the reference data size 基于学习的方法细胞数越多排名越高（MLP，SVM）
	Impact of number of cell types 一个组织中有少数主要类型，有许多子类，子类的比较相似，不好区分
	Impact of cell type annotations 上述注释结果来自marker，现在以分选数据作为金标准
	Impact of data preprocessing 评估去除batch效应或者数据插补带来的影响先评估三种插补方法：没有明显提升，结论为不必要在评估batch效应去除：Harmony and fastMNN 批量效应不会影响预测性能，可能不需要校正，我们直接将数据集连接起来进行以下分析。
	Drop-seq frontal cortex “Mouse brain FC” Drop-seq hippocampus regions “Mouse brain HC” 10X prefrontal cortex region “Mouse brain pFC” DroNc-seq cortex samples “Mouse brain cortex” 10X frontal cortex regions s “Mouse brain Allen” Condition effect 个体差异：不同样本的差异条件差异：技术差异 Comparing individual effect（只有样本不同）, region effect, and dataset effect in mouse brain data 将tar数据固定为“Mouse brain FC” individual effect ：相同数据来源“Mouse brain FC” biological effect（区域效应）：“Mouse brain HC” dataset effect：“Mouse brain cortex”和“Mouse brain pFC” Comparing batch effect and clinical difference in Human PBMC 选择疾病的“Human PBMC lupus”作为tar数据 individual effect：同批次不同个体 batch effect：不同批次 clinical difference：不同批次不同生理状态（不同处理）人数据差是因为包括了子类，子类多信号弱




Conclusion& Discussion
Method