看一下数据是什么样子(一般是公司给的):

10×:

10X Genomics的单细胞测序数据为例,当公司完成测序后,通常会提供一组原始数据文件。拿到数据后,我们可以在提供的文件夹(如“Data”文件夹)中找到四个示例数据集(如 R8081、R8082、R8083 等)。这些数据集的结构基本相同,主要包含FASTQ格式的原始测序数据。

服务公司通常会使用 Cell Ranger 进行标准化分析,将 FASTQ 文件比对到参考基因组(如 hg38hg19)。在进行数据处理时,需要明确基因组版本,因为这会影响后续的分析和文章撰写。一般来说,hg38 是目前常用的基因组版本

单细胞数据的关键文件

当 Cell Ranger 处理完成后,输出的结果文件夹中会包含多个文件和子文件夹。其中,最重要的文件位于 filtered_feature_bc_matrix 目录下

该目录包含三个核心文件:

  1. barcodes.tsv.gz —— 细胞条形码信息,每一行代表一个单细胞的唯一标识。
  2. features.tsv.gz —— 基因特征文件,列出了所有被检测到的基因名称。
  3. matrix.mtx.gz —— 表达矩阵,存储了细胞(barcode)与基因(features)之间的表达量数据,通常是一个稀疏矩阵(Sparse Matrix)。

这些文件共同构成了单细胞测序数据的基础信息,分析时通常只需要这三个文件即可。

BD:

BD(Becton Dickinson)也有自己的一套单细胞测序平台。经过预处理后,BD 会提供一个 RSEC 校正后的 Molecular Profile(分子特征)CSV 文件

由于该 CSV 文件较大,我们一般不会直接打开查看,但它的基本格式与 10X Genomics 输出的表达矩阵类似:

  • 行(row)代表基因
  • 列(column)代表细胞
  • 矩阵中的数值代表该细胞对该基因的表达水平(UMI count)

单细胞数据的稀疏矩阵特性

无论是 BD 还是 10X Genomics,它们的单细胞 RNA 测序(scRNA-seq)数据都呈现高度稀疏矩阵(Sparse Matrix)的特点。这意味着:

  1. 大部分值为零:由于 mRNA 转录的随机性以及技术噪声,在单个细胞中,很多基因根本不会被检测到,因此数据矩阵中大多数数值为 0。
  2. 每个细胞通常检测到几千个基因:尽管整张矩阵可能包含 2~3 万个基因,但每个单细胞通常只会表达其中的几千个。
  3. 表达量数据为整数(count data):通常是 UMI 计数值(Unique Molecular Identifier Count),如 1、5、10、50 等,而不是连续变量。这是因为 scRNA-seq 通过 UMI 去除 PCR 扩增偏倚,最终测得的是基因的真实拷贝数。

Smart-seq:

Smart-seq 是另一种单细胞 RNA 测序(scRNA-seq)技术,与 10X Genomics 和 BD 不同,它的测序方法更接近于传统的 bulk RNA-seq

Smart-seq 数据的基本格式

  1. 细胞命名方式不同
    • 在 10X Genomics 数据中,每个细胞通常使用条形码(Barcode)进行标记。
    • 在 Smart-seq 数据中,每个细胞有一个更具体的名称,而不是一个简单的条形码。
  2. 行(row)代表基因,列(column)代表细胞,与其他 scRNA-seq 数据类似。
  3. 数据存储格式为 RPKM/TPM/FPKM
    • 由于 Smart-seq 采用的是全长转录组测序(Full-length Transcript Sequencing),而不是 3' 端测序,因此它的基因表达数据通常经过标准化,表现为 RPKM(Reads Per Kilobase per Million Mapped Reads)或 TPM(Transcripts Per Million)
    • 这与 10X Genomics 使用的 UMI 计数(count-based data)不同。

Smart-seq 数据的处理方式

  • 在 R 语言环境中,我们可以使用 SeuratScanpy 进行 Smart-seq 数据分析。
  • 由于 Smart-seq 检测到的基因数量较多,因此它的处理方式更接近 bulk RNA-seq,需要额外的标准化步骤。

计算资源需求

  • 处理 Smart-seq 原始 FASTQ 数据需要进行 比对(Alignment)定量(Quantification),通常使用 STAR + RSEMHISAT2 + StringTie
  • 由于计算量较大,通常需要 Linux 服务器高性能计算集群(HPC) 进行数据分析。
  • 10X Genomics 提供的 Cell Ranger 只能运行在 Linux 系统上,Windows 和 macOS 用户无法直接运行该流程。

软件安装:

####安装R语言和Rstudio###

#R software: R: The R Project for Statistical Computing

#Rstudio:https://www.rstudio.com

####设置镜像,提高软件包安装速度###

options(BioC_mirror="https://mirrors.tuna.tsinghua.edu.cn/bioconductor")

options("repos" = c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))

####安装软件包安装的协助包###

###从CRAN安装

install.packages("stringr")###处理字符文件

install.packages("future")##并行计算

install.packages("future.apply")##并行计算

install.packages("ggplot2")###安装作图软件

install.packages("Seurat")###单细胞处理包

install.packages("dplyr")###管道符号

install.packages("ggpubr")###作图与统计

install.packages("shinythemes")###Shiny服务器

install.packages("ggsci")###配色图

library(stringr)

library(future)

library(future.apply)

library(ggplot2)

library(Seurat)

library(dplyr)

library(ggpubr)

####从Bioconductor 安装软件

install.packages("BiocManager")####Bioconductor网站安装生物信息学软件包

library(BiocManager)

BiocManager::install("limma")

BiocManager::install("clusterProfiler")###进行GO分析

BiocManager::install("ComplexHeatmap")#,force = T)###heatmap作图

BiocManager::install("monocle")##使用此版本monocle2

BiocManager::install("org.Hs.eg.db")

BiocManager::install("infercnv",force = T)

BiocManager::install(c("AUCell", "RcisTarget"))

BiocManager::install(c("GENIE3")) # Optional. Can be replaced by GRNBoost

## Optional (but highly recommended):

# To score the network on cells (i.e. run AUCell):

BiocManager::install(c("zoo", "mixtools", "rbokeh"))

# For various visualizations and perform t-SNEs:

BiocManager::install(c("DT", "NMF", "ComplexHeatmap", "R2HTML", "Rtsne"))

# To support paralell execution (not available in Windows):

BiocManager::install(c("doMC", "doRNG"))

#如果出现Update all/some/none? [a/s/n]: 输入a,回车键

#如果出现Do you want to install from sources the packages which need compilation? (Yes/no/cancel) :输入no,回车键

library(infercnv)

library(limma)

library(clusterProfiler)

library(ComplexHeatmap)

###从Github安装软件

install.packages("devtools")####首先从github网站安装生物信息学软件包##

#library(devtools)

install_github("sqjin/CellChat")

###安装DoubletFinder检测Doublets###

#来自:https://github.com/ddiez/DoubletFinder

devtools::install_github("sqjin/CellChat")

devtools::install_github('chris-mcginnis-ucsf/DoubletFinder')

devtools::install_github("ggjlab/scHCL")

devtools::install_github("immunogenomics/harmony")

devtools::install_github("aertslab/SCopeLoomR", build_vignettes = TRUE)

devtools::install_github("aertslab/SCENIC")

devtools::install_github("sqjin/CellChat")##cell-cell interaction

library(DoubletFinder)

####存在并行计算时###

library(future)

library(future.apply)

plan("multiprocess", workers = 2) ###compute cores计算核心

options(future.globals.maxSize = 8000 * 1024^2)##最大设置内存占用大小

 

 

处理10×数据:

####首先认识数据结构####

###10XGenomics 的filtered_feature_bc_matrix

##https://www.ncbi.nlm.n[ih.gov/geo/query/acc.cgi?acc=GSE134520

###BD RSEC_MolsPerCell.csv

###SMART-seq RPKM.txt

rm(list = ls())

#install.packages("Seurat")

####首先我们先处理10X Genomics的数据

library(Seurat)##version 4.0.5

library(dplyr)

library(future)

library(future.apply)

library(DoubletFinder)

plan("multicore", workers = 3) #三个线程

options(future.globals.maxSize = 10000 * 1024^2)#10g内存

#getwd()

setwd("~/Documents_PC/scRNA-seq/Data")

setwd("./10X/")

setwd("~/Documents_PC/scRNA-seq/Data/10X")

list.files(path = "./",pattern = "^SRR")#看一下路径下的文件

####SRR780####

SRR780<-Read10X(data.dir = "SRR80_outs/filtered_feature_bc_matrix/")

SRR780[1:5,1:5]#取前五行和前五列看看

SRR780_object<- CreateSeuratObject(counts = SRR780, project = "SRR780", min.cells = 0, min.features = 200)#一些筛选标记,所有基因导入,一个细胞里必须有200个基因才行

SRR780_object[["percent.mt"]] <- PercentageFeatureSet(SRR780_object, pattern = "^MT-")#所有的线粒体基因都是以MT打头的,看看线粒体基因

hist(SRR780_object[["percent.mt"]]$percent.mt)

VlnPlot(SRR780_object, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)#画一个小提琴图

SRR780_object

plot1 <- FeatureScatter(SRR780_object, feature1 = "nCount_RNA", feature2 = "percent.mt")

plot2 <- FeatureScatter(SRR780_object, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")#画两个图看看质量怎么样,比较高counts和比较高gene数都要去掉

CombinePlots(plots = list(plot1,plot2))

SRR780_object

##select the cells with 300 genes at least and 4000 at most, the percent of mitochondrion genes is less than 10%

SRR780_val<- subset(SRR780_object, subset = nFeature_RNA > 200 & nFeature_RNA < 7000 & percent.mt < 20)#这个就是筛选过程

SRR780_val@meta.data[1:5,]#显示前五行

SRR780_val@meta.data[1:5,]

dim(SRR780_val@meta.data)

colnames(SRR780_val)

rownames(SRR780_val@meta.data)

colname<-paste("SRR780_",colnames(SRR780_val),sep="")

colname

SRR780_val<-RenameCells(object = SRR780_val,colname)

#rm(list = c("BC2","bc2_object"))

SRR780_val@meta.data[1:5,]

SRR780_val$tissue_type<-"Normal"#增加一列

SRR780_val@meta.data[1:5,]

SRR780_val <- NormalizeData(SRR780_val, normalization.method = "LogNormalize", scale.factor = 10000)

SRR780_val <- FindVariableFeatures(SRR780_val, selection.method = "vst", nfeatures = 2000)#找细胞间表达差异大的基因,top2000

SRR780_val@assays$RNA@var.features

length(SRR780_val@assays$RNA@var.features)

###scaling the data###

all.genes <- rownames(SRR780_val)

all.genes[1000:10010]

SRR780_val <- ScaleData(SRR780_val,vars.to.regress = c("percent.mt"))

#去除线粒体基因的一个占比,剩下的基因进行归一化

SRR780_val

###perform linear dimensional reduction###

SRR780_val <- RunPCA(SRR780_val, features = VariableFeatures(object = SRR780_val))#PCA降维

print(SRR780_val[["pca"]], dims = 1:5, nfeatures = 5)

#dev.off()

VizDimLoadings(SRR780_val, dims = 1:2, reduction = "pca")

ElbowPlot(SRR780_val,ndims = 50)

####cluster the cells###

SRR780_val <- FindNeighbors(SRR780_val, dims = 1:30)#取到20或者拐点之后,区别不大,在30维度区间上计算哪些细胞之间是更相似的,聚集在一块

####Run non-linear dimensional reduction (UMAP/tSNE)###

SRR780_val <- RunUMAP(SRR780_val, dims = 1:30)

SRR780_val<-RunTSNE(SRR780_val,dims=1:30)

SRR780_val <- FindClusters(SRR780_val, resolution = 1)#这块是1,根据基因表达来定

DimPlot(SRR780_val, reduction = "umap",label = T)

#画一个细胞分群图,靠的近在功能和基因表达上会相近

DimPlot(SRR780_val, reduction = "tsne",label = T)

table(Idents(SRR780_val))

getwd()

SRR780.markers <- FindAllMarkers(SRR780_val, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)#找高表达差异基因,在细胞里面比例25%

write.csv(SRR780.markers,"./SRR80_outs/SRR780.markers.csv")

marker <- FindMarkers(SRR780_val, ident.1 = "3",ident.2 = "7",only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)

######biomarkers###

###epithelial cell

FeaturePlot(SRR780_val,features = c("EPCAM","CDH1","KRT7","KRT5","UPK1A","KRT20"),reduction = "tsne",label = T)

##TNK cell

FeaturePlot(SRR780_val,features = c("CD3D","CD8A","CD4","NCAM1"),reduction = "tsne",label = T)

##endothelial cell

FeaturePlot(SRR780_val,features = c("VWF","PECAM1"),reduction = "tsne",label = T)

####fibroblasts and pericytes

FeaturePlot(SRR780_val,features = c("RGS5","COL1A1","PDGFRA","PDGFRB","DES"),reduction = "tsne",label = T)

###B cell and plasma

FeaturePlot(SRR780_val,features = c("CD19","JCHAIN","CD79A"),reduction = "tsne",label = T)

###myeloid cell

FeaturePlot(SRR780_val,features = c("CD74","CD14","FCGR3A","S100A8"),reduction = "tsne",label = T)

###mast cell

FeaturePlot(SRR780_val,features = c("CPA3"),reduction = "tsne",label = T)

不同类型的细胞及其标志基因

细胞类型

标志基因

上皮细胞(epithelial cell)

EPCAM, CDH1, KRT7, KRT5, UPK1A, KRT20

T/NK细胞(TNK cell)

CD3D, CD8A, CD4, NCAM1

内皮细胞(endothelial cell)

VWF, PECAM1

成纤维细胞及周细胞(fibroblasts and pericytes)

RGS5, COL1A1, PDGFRA, PDGFRB, DES

B细胞和浆细胞(B cell and plasma cell)

CD19, JCHAIN, CD79A

髓系细胞(myeloid cell)

CD74, CD14, FCGR3A, S100A8

肥大细胞(mast cell)

CPA3

SRR780_val<-RenameIdents(SRR780_val,"3"="Epithelial","10"="Epithelial","1"="Fibroblast","4"="Pericytes","9"="Pericytes","13"="MAST","5"="B",

                              "11"="Plasma_B","6"="T/NK","2"="T/NK","8"="Myeloid","0"="Endothelial","12"="Endothelial","7"="Unknown")

DimPlot(SRR780_val,reduction = "tsne",label = FALSE)

DimPlot(SRR780_val,reduction = "umap",label = T)

table(Idents(SRR780_val))###查看有多少个细胞

SRR780_val@meta.data[1:5,]

table(Idents(SRR780_val))

Idents(SRR780_val)<-SRR780_val$RNA_snn_res.0.6

table(Idents(SRR780_val))

SRR780_val$cell_cluster_origin<-Idents(SRR780_val)

SRR780_val@meta.data[1:5,]

saveRDS(SRR780_val,file = "./SRR80_outs/SRR780_val.rds")

#save.image(file="./SRR780_outs/SRR780_val.RData")

#load("./SRR780_outs/SRR780_val.RData")

SRR780_val$cluster<-Idents(SRR780_val)

table(Idents(SRR780_val))

####auto cellular annotation#####

#install.packages("shinythemes")

#devtools::install_github("ggjlab/scHCL")###郭国冀老师团队

library(scHCL)#自动注释软件包

hcl_result <- scHCL(scdata = SRR780_val@assays$RNA@data, numbers_plot = 1)###time comsuming take several mins

hcl_result$scHCL#预测结果

cellassign<-table(Idents(SRR780_val),hcl_result$scHCL_probility$`Cell type`)

cellassign

cellassign<-table(Idents(SRR780_val),hcl_result$scHCL)

cellassign

#SRR780_val$RNA_snn_res.0.6

cellassign<-table(SRR780_val$RNA_snn_res.0.6,hcl_result$scHCL_probility$`Cell type`)

cellassign

library(stringr)

cell_predict<-data.frame(hcl_result$scHCL_probility)

SRR780_val@meta.data[1:5,]

table(rownames(SRR780_val@meta.data)==cell_predict$Cell)

str_split(cell_predict$Cell.type,"[.]",simplify = T)[,1]

cell_predict$cell_predict<-str_split(cell_predict$Cell.type,"[.]",simplify = T)[,1]

table(SRR780_val$RNA_snn_res.0.6,cell_predict$cell_predict)

SRR780_val$cell_predict<-cell_predict$cell_predict

SRR780_val@meta.data[1:3,]

table(Idents(SRR780_val))

SRR780_val$cluster<-Idents(SRR780_val)

####用singleR软件来注释细胞####  有空闲时跑

#devtools::install_github("dviraran/SingleR")

BiocManager::install("SingleR")

BiocManager::install("celldex")

library(SingleR)

library(celldex)

sc_data<-as.matrix(SRR780_val@assays$RNA@counts)

hpca.se <- HumanPrimaryCellAtlasData()

hpca.se$label.main

pred.hesc <- SingleR(test = sc_data, ref = hpca.se, assay.type.test=1,

                     labels = hpca.se$label.main)

pred.hesc$labels###预测的结果

####Doublets identification####

library(DoubletFinder)

###完整的分析:前面和建立单细胞Seurat对象一样,因为前面跑过了,这里都省略

if (TRUE){

SRR780<-Read10X(data.dir = "SRR80_outs//filtered_feature_bc_matrix/")

SRR780[1:5,1:5]

SRR780_object<- CreateSeuratObject(counts = SRR780, project = "SRR780", min.cells = 0, min.features = 200)

SRR780_object[["percent.mt"]] <- PercentageFeatureSet(SRR780_object, pattern = "^MT-")

hist(SRR780_object[["percent.mt"]]$percent.mt)

VlnPlot(SRR780_object, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

plot1 <- FeatureScatter(SRR780_object, feature1 = "nCount_RNA", feature2 = "percent.mt")

plot2 <- FeatureScatter(SRR780_object, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")

CombinePlots(plots = list(plot1,plot2))

SRR780_object

##select the cells with 300 genes at least and 4000 at most, the percent of mitochondrion genes is less than 10%

SRR780_val<- subset(SRR780_object, subset = nFeature_RNA > 200 & nFeature_RNA < 7000 & percent.mt < 20)

SRR780_val@meta.data[1:5,]

colname<-paste("SRR780_",colnames(SRR780_val),sep="")

colname

SRR780_val<-RenameCells(object = SRR780_val,colname)

#rm(list = c("BC2","bc2_object"))

SRR780_val@meta.data[1:5,]

SRR780_val$tissue_type<-"Normal"

SRR780_val <- NormalizeData(SRR780_val, normalization.method = "LogNormalize", scale.factor = 10000)

SRR780_val <- FindVariableFeatures(SRR780_val, selection.method = "vst", nfeatures = 2000)

###scaling the data###

all.genes <- rownames(SRR780_val)

SRR780_val <- ScaleData(SRR780_val,vars.to.regress = c("percent.mt"))

SRR780_val

###perform linear dimensional reduction###

SRR780_val <- RunPCA(SRR780_val, features = VariableFeatures(object = SRR780_val))

print(SRR780_val[["pca"]], dims = 1:5, nfeatures = 5)

#dev.off()

VizDimLoadings(SRR780_val, dims = 1:2, reduction = "pca")

ElbowPlot(SRR780_val,ndims = 50)

####cluster the cells###

SRR780_val <- FindNeighbors(SRR780_val, dims = 1:30)

####Run non-linear dimensional reduction (UMAP/tSNE)###

SRR780_val <- RunUMAP(SRR780_val, dims = 1:30)

SRR780_val<-RunTSNE(SRR780_val,dims=1:30)

SRR780_val <- FindClusters(SRR780_val, resolution = 0.6)

}

SRR780_val@meta.data[1:5,]

####因为前面跑过,所以直接从这步开始计算###

saveRDS(SRR780_val,file="SRR780.rds")

SRR780_val_db<-SRR780_val

sweep.res.list_SRR780_val <- paramSweep_v3(SRR780_val_db, PCs = 1:30)

sweep.stats_SRR780_val <- summarizeSweep(sweep.res.list_SRR780_val, GT = FALSE)

bcmvn_SRR780_val<-find.pK(sweep.stats_SRR780_val)

pK_value <- as.numeric(as.character(bcmvn_SRR780_val$pK[bcmvn_SRR780_val$BCmetric == max(bcmvn_SRR780_val$BCmetric)]))

annotations <- SRR780_val_db@meta.data$RNA_snn_res.0.6

homotypic.prop <- modelHomotypic(annotations)  ####get the estimated number

nExp_poi <- round(homotypic.prop*length(SRR780_val_db@meta.data$orig.ident))

nExp_poi.adj <- round(nExp_poi*(1-homotypic.prop))

pN_value <- 0.25

pANN_value <- paste0("pANN_",pN_value,"_",pK_value,'_',nExp_poi)

SRR780_val_db <- doubletFinder_v3(SRR780_val_db, PCs = 1:30, pN = pN_value, pK = pK_value, nExp = nExp_poi, reuse.pANN = FALSE)

SRR780_val_db <- doubletFinder(SRR780_val_db, PCs = 1:30, pN = pN_value, pK = pK_value, nExp = nExp_poi.adj, reuse.pANN = pANN_value)

SRR780_val_db@meta.data[1:5,]

SRR780_val$doublets<-SRR780_val_db$DF.classifications_0.25_0.3_83

SRR780_val@meta.data[1:5,]

####saveRDS###

saveRDS(SRR780_val,file = "./SRR80_outs/SRR780_val.rds")

DimPlot(SRR780_val,reduction = "tsne",group.by = "doublets")

DimPlot(SRR780_val,reduction = "tsne",group.by = "cluster",label = T)

DimPlot(SRR780_val,reduction = "tsne",label = T)

 

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐