发布日期:2024-11-04 15:18 点击次数:158
图片
前言Hello小伙伴们大家好,我是生信技能树的小学徒”我才不吃蛋黄“。今天是胃癌单细胞数据集GSE163558复现系列第六期。第五期我们通过绘制饼图、堆积柱状图、气泡图等,比较不同分组之间细胞比例差异。本期,我们将下载胃癌bulk数据(TCGA-STAD),筛选胃癌相关差异表达基因,根据差异基因list,使用AddModuleScore_UCell函数给上皮细胞亚群Seurat对象添加恶性评分,以此协助区分正常上皮和恶性上皮。
1.背景介绍众所周知,胃癌细胞是胃上皮细胞发生了恶性病变。在第三期对细胞分群注释时,我们并未对上皮细胞进行细分,这里面就包括非恶性上皮和恶性上皮细胞。那么该如何识别恶性上皮细胞?本文作者根据TCGA数据库中胃癌和正常胃组织之间的差异表达基因,定义了每个上皮细胞的恶性和非恶性评分,从而区分恶性上皮和非恶性上皮细胞。当然,我们还可以利用copykat和infercnv从单细胞转录组数据推断肿瘤拷贝数变异,从而区分上皮恶性与否(后期更新,敬请期待)。
2.数据下载在正式分析之前,我们先从TCGA数据库获取胃癌患者的表达数据和临床数据。 首先加载工作目录,创建新的文件夹,加载R包:
rm(list=ls())getwd()setwd("/home/data/t020505/GSE163558-GC代码公众号复现版")dir.create("6-TCGA_STAD")setwd('6-TCGA_STAD/')source('../scRNA_scripts/mycolors.R')library(tidyverse)library(data.table) #如果需要下载TCGA其他癌种,请更换projproj = "TCGA-STAD"dir.create("input")
在R中利用代码一键下载TCGA数据库胃癌患者的表达数据及临床数据,并将下载好了数据存放在input文件夹:
if(T){ download.file(url = paste0("https://gdc.xenahubs.net/download/",proj, ".htseq_counts.tsv.gz"),destfile = paste0("input/",proj,".htseq_counts.tsv.gz")) ##表达数据 download.file(url = paste0("https://gdc.xenahubs.net/download/",proj, ".GDC_phenotype.tsv.gz"),destfile = paste0("input/",proj,".GDC_phenotype.tsv.gz")) ##临床数据 download.file(url = paste0("https://gdc.xenahubs.net/download/",proj, ".survival.tsv"),destfile = paste0("input/",proj,".survival.tsv")) ##生存数据}clinical = read.delim(paste0("input/",proj,".GDC_phenotype.tsv.gz"),fill = T,header = T,sep = "\t")surv = read.delim(paste0("input/",proj,".survival.tsv"),header = T) head(surv) #生存数据os和os.time
生存数据包括os和os.time:
图片
3.数据整理在下载完数据之后,我们需要对数据进行初步的整理。
3.1 处理表达矩阵首先处理表达矩阵:
dat = read.table(paste0("input/",proj,".htseq_counts.tsv.gz"),check.names = F,row.names = 1,header = T)dat <- data.table::fread(paste0("input/",proj,".htseq_counts.tsv.gz"), data.table = F) #全部是symbola = dat[60483:60488,]dat = dat[1:60483,]rownames(dat) = dat$Ensembl_IDa = data = a[,-1]exp = a
去除在所有样本里表达量都为零的基因:
exp = exp[rowSums(exp)>0,]nrow(exp)
仅保留在一半以上样本里表达的基因:
exp = exp[apply(exp, 1, function(x) sum(x > 0) > 0.5*ncol(exp)), ]nrow(exp)
表达矩阵行名ID转换:
library(stringr)head(rownames(exp))library(AnnoProbe)rownames(exp) = substr(rownames(exp), 1, 15)##rownames(exp) = str_split(rownames(exp),"\\.",simplify = T)[,1];head(rownames(exp))#annoGene(rownames(exp),ID_type = "ENSEMBL") re = annoGene(rownames(exp),ID_type = "ENSEMBL");head(re)#if(!require(tinyarray))devtools::install_github("xjsun1221/tinyarray",upgrade = FALSE,dependencies = TRUE)library(tinyarray)exp = trans_array(exp,ids = re,from = "ENSEMBL",to = "SYMBOL")exp[1:4,1:4]3.2 整理分组信息
然后整理分组信息:
根据样本ID的第14-15位,给样本分组(tumor和normal):
01A:癌症组织01B:福尔马林浸泡样本02A:复发组织06A:转移组织11A:癌旁组织table(str_sub(colnames(exp),14,15)) table(str_sub(colnames(exp),14,16)) Group = ifelse(as.numeric(str_sub(colnames(exp),14,15)) < 10,'tumor','normal') Group = factor(Group,levels = c("normal","tumor"))table(Group)group = make_tcga_group(exp)table(group)
保存数据:
save(exp,Group,proj,clinical,surv,file = paste0(proj,".Rdata"))4.数据分析4.1 TCGA差异分析
首先清除系统环境变量,加载TCGA数据,加载差异分析R包“edgeR”:
rm(list = ls())load("TCGA-STAD.Rdata")table(Group)library(edgeR)
开始进行胃癌和癌旁组织差异基因分析:
dge <- DGEList(counts=exp,group=Group)dge$samples$lib.size <- colSums(dge$counts)dge <- calcNormFactors(dge) design <- model.matrix(~Group)dge <- estimateGLMCommonDisp(dge, design)dge <- estimateGLMTrendedDisp(dge, design)dge <- estimateGLMTagwiseDisp(dge, design)fit <- glmFit(dge, design)fit <- glmLRT(fit) DEG2=topTags(fit, n=Inf)class(DEG2)DEG2=as.data.frame(DEG2)head(DEG2)
图片
添加change列标记基因上调下调,在这里,我们对差异分析筛选的阈值为:“| log2 fold change |>1, pvalue <0.05”,最终在胃癌中筛选到了2213个上调基因,142个下调基因。
logFC_t = 1pvalue_t = 0.05k1 = (DEG2$PValue < pvalue_t)&(DEG2$logFC < -logFC_t)k2 = (DEG2$PValue < pvalue_t)&(DEG2$logFC > logFC_t)DEG2$change = ifelse(k1,"DOWN",ifelse(k2,"UP","NOT"))head(DEG2)table(DEG2$change)
图片
图片
保存差异分析结果:
save(DEG2,Group,file = paste0(proj,"_DEG.Rdata"))4.2.差异基因可视化
差异基因可视化选择了热图和火山图:
library(ggplot2)library(tinyarray)dat = log2(cpm(exp)+1)pca.plot = draw_pca(dat,Group);pca.plotsave(pca.plot,file = paste0(proj,"_pcaplot.Rdata"))cg2 = rownames(DEG2)[DEG2$change !="NOT"]h2 = draw_heatmap(dat[cg2,],Group)v2 = draw_volcano(DEG2,pkg = 2,logFC_cutoff = logFC_t)library(patchwork)h2 + v2 +plot_layout(guides = 'collect') &theme(legend.position = "none")ggsave(paste0(proj,"_heat_vo.png"),width = 16,height = 9)
图片
4.3 提取上皮细胞亚群设置随机种子,读取第三期细胞分群注释后的单细胞文件:
set.seed(12345)sce.all=readRDS( "../3-Celltype/sce_celltype.rds")table(sce.all$celltype)
提取出其中的上皮细胞:
sce1 = sce.all[, sce.all$celltype %in% c( 'Epithelial' )]
然后对上皮细胞亚群走Seurat V5标准流程(同第二期):
names(sce1@assays$RNA@layers)sce1[["RNA"]]$counts LayerData(sce1, assay = "RNA", layer = "counts")dim(sce1[["RNA"]]$counts )as.data.frame(sce1@assays$RNA$counts[1:10, 1:2])head(sce1@meta.data, 10)table(sce1$orig.ident) sce = sce1sce <- NormalizeData(sce, normalization.method = "LogNormalize", scale.factor = 1e4)GetAssay(sce,assay = "RNA")sce <- FindVariableFeatures(sce, selection.method = "vst", nfeatures = 2000) sce <- ScaleData(sce) sce <- RunPCA(object = sce, pc.genes = VariableFeatures(sce)) DimHeatmap(sce, dims = 1:12, cells = 100, balanced = TRUE)ElbowPlot(sce)
图片
DimHeatmap图片
ElbowPlot分辨率设置为0.1:
sce <- FindNeighbors(sce, dims = 1:15)sce <- FindClusters(sce, resolution = 0.1)table(sce@meta.data$RNA_snn_res.0.1)
数据降维:
set.seed(321)sce <- RunTSNE(object = sce, dims = 1:15, do.fast = TRUE)sce <- RunUMAP(object = sce, dims = 1:5, do.fast = TRUE)
分群可视化:
p = DimPlot(sce,reduction = "tsne",label=T,cols = mycolors)
图片
然后保存上皮细胞rds文件:
saveRDS(sce,file = "Epithelial.rds")4.4 AddModuleScore_UCell函数恶性评分
AddModuleScore_UCell函数可以实现基因集合打分,通过评估单细胞数据集中特定基因特征的表达模式,我们可以量化生物学信号的强度,并鉴定细胞的亚型和状态。在胃癌组织中取上调排名前50的差异基因用作恶性评分,下调排名前50的差异基因用作正常评分。
#sce = readRDS('Epithelial.rds')head(DEG2)table(DEG2$change)UP = DEG2[DEG2$change == 'UP' & DEG2$FDR <0.01,]DOWN = DEG2[DEG2$change == 'DOWN' & DEG2$FDR <0.01,]##用作恶性评分sorted_up = UP[order(UP$FDR),]top_50_upgene = rownames(sorted_up[1:50, ]) ##用作正常评分sorted_down = DOWN[order(DOWN$FDR),]top_50_downgene = rownames(sorted_down[1:50, ])library(UCell)marker <- list(top_50_upgene,top_50_downgene)#将基因整成listnames(marker)[1] <- 'tumor'names(marker)[2] <- 'normal'score <- AddModuleScore_UCell(sce,features=marker)
计算每个上皮细胞的恶性评分减去非恶性评分,并将评分从小到大排序以拟合生长曲线。然后,根据生长曲线拐点附近的最大间隙,将细胞分为得分较高的恶性细胞(≥0.02)或得分较低的非恶性细胞(≤0.02)。
sce2 = scoredf = sce2@meta.datadf$score = df$tumor_UCell - df$normal_UCelltable(df$score)dim(sce)df$score1 = 'unknown'df[df$score > -0.02,]$score1 = 'malignant'table(df$score1)df[df$score < -0.02,]$score1 = 'nonmalignant'table(df$score1)sce2@meta.data = dfDimPlot(sce2, reduction = "tsne",cols = mycolors,pt.size = 0.8, group.by = "score1",label = T,label.box = T)
图片
然后绘制气泡图和堆积性柱状图可视化分组细胞比例(同第五期 胃癌单细胞数据集GSE163558复现(五):细胞比例):
tb=table(sce2$tissue, sce2$score1)head(tb)library (gplots) balloonplot(tb)
图片
bar_data <- as.data.frame(tb)bar_per <- bar_data %>% group_by(Var1) %>% mutate(sum(Freq)) %>% mutate(percent = Freq / `sum(Freq)`)head(bar_per) #write.csv(bar_per,file = "celltype_by_group_percent.csv")col =c("#3176B7","#F78000","#3FA116","#CE2820","#9265C1", "#885649","#DD76C5","#BBBE00","#41BED1")colnames(bar_per)library(ggthemes)p1 = ggplot(bar_per, aes(x = percent, y = Var1)) + geom_bar(aes(fill = Var2) , stat = "identity") + coord_flip() + theme(axis.ticks = element_line(linetype = "blank"), legend.position = "top", panel.grid.minor = element_line(colour = NA,linetype = "blank"), panel.background = element_rect(fill = NA), plot.background = element_rect(colour = NA)) + labs(y = " ", fill = NULL)+labs(x = 'Relative proportion(%)')+ scale_fill_manual(values=col)+ theme_few()+ theme(plot.title = element_text(size=12,hjust=0.5))p1
图片
p1tb=table(sce2$sample, sce2$score1)head(tb) malignant nonmalignant adjacent_nontumor 84 92 GC 2805 4 GC_liver_metastasis 39 0 GC_lymph_metastasis 82 0 GC_ovary_metastasis 578 0 GC_peritoneum_metastasis 258 0library (gplots) balloonplot(tb)
图片
bar_data <- as.data.frame(tb)bar_per <- bar_data %>% group_by(Var1) %>% mutate(sum(Freq)) %>% mutate(percent = Freq / `sum(Freq)`)head(bar_per) colnames(bar_per)library(ggthemes)p2 = ggplot(bar_per, aes(x = percent, y = Var1)) + geom_bar(aes(fill = Var2) , stat = "identity") + coord_flip() + theme(axis.ticks = element_line(linetype = "blank"), legend.position = "top", panel.grid.minor = element_line(colour = NA,linetype = "blank"), panel.background = element_rect(fill = NA), plot.background = element_rect(colour = NA)) + labs(y = " ", fill = NULL)+labs(x = 'Relative proportion(%)')+ scale_fill_manual(values=col)+ theme_few()+ theme(plot.title = element_text(size=12,hjust=0.5)) + theme(axis.text.x = element_text(angle = 45, hjust = 1))p2p1+p2ggsave(filename="prop.pdf",width = 12,height = 8)saveRDS(sce2,file = "tumorEpithelial.rds")
图片
p2图片
p1+p24.5 提取恶性上皮亚群在添加恶性评分之后,我们可以在从上皮细胞中提取恶性上皮亚群,继续走Seurat V5标准流程:
table(sce2$score1)tumor = sce2[, sce2$score1 %in% c( 'malignant' )]set.seed(1234567)tumor[["RNA"]]$counts LayerData(tumor, assay = "RNA", layer = "counts")tumor <- JoinLayers(tumor)dim(tumor[["RNA"]]$counts )as.data.frame(tumor@assays$RNA$counts[1:10, 1:2])head(tumor@meta.data, 10)table(tumor$orig.ident) tumor <- NormalizeData(tumor, normalization.method = "LogNormalize", scale.factor = 1e4)GetAssay(tumor,assay = "RNA")tumor <- FindVariableFeatures(tumor, selection.method = "vst", nfeatures = 2000) tumor <- ScaleData(tumor) tumor <- RunPCA(object = tumor, pc.genes = VariableFeatures(tumor)) DimHeatmap(tumor, dims = 1:12, cells = 100, balanced = TRUE)ElbowPlot(tumor)
图片
tumor <- FindNeighbors(tumor, dims = 1:15)tumor <- FindClusters(tumor, resolution = 0.1)table(tumor@meta.data$RNA_snn_res.0.1) set.seed(321)tumor <- RunTSNE(object = tumor, dims = 1:15, do.fast = TRUE)tumor <- RunUMAP(object = tumor, dims = 1:5, do.fast = TRUE)mycolors <-c('#E64A35','#4DBBD4' ,'#01A187' ,'#3C5588' ,'#F29F80' , '#8491B6','#91D0C1','#7F5F48','#AF9E85','#4F4FFF','#CE3D33', '#739B57','#EFE685','#446983','#BB6239','#5DB1DC','#7F2268','#6BD66B','#800202','#D8D8CD','pink')p2 = DimPlot(tumor,reduction = "tsne",label=T,cols = mycolors)+ theme_few()p2
图片
p2保存恶性上皮细胞rds文件
table(tumor$seurat_clusters) 0 1 2 3 4 2275 579 482 314 196 tumor$celltype = paste0('G',tumor$seurat_clusters)table(tumor$celltype) G0 G1 G2 G3 G4 2275 579 482 314 196 saveRDS(tumor,file = "malignant.rds")4.6 富集分析
提取恶性上皮细胞之后,我们接着从单细胞层面筛选出了特征基因,并在此基础上进行了富集分析。
首先读取恶性上皮细胞rds文件,加载R包:
#sce =readRDS('malignant.rds') sce = tumor###热图表示GO富集分析library(grid)library(ggplot2)library(gridExtra)library(clusterProfiler)library(org.Hs.eg.db)library(enrichplot)library(ggplot2)library(org.Hs.eg.db)library(GOplot)
然后使用Seurat内置函数FindAllMarkers筛选特征基因:
markers <- FindAllMarkers(object = sce, only.pos = TRUE, min.pct = 0.25, thresh.use = 0.25)write.csv(markers,file='markers.csv')#markers = read.csv('markers.csv',row.names = 1)table(markers$cluster)library(dplyr) gene = markers[markers$p_val <0.01 & markers$avg_log2FC >2,]$geneentrezIDs = bitr(gene, fromType = "SYMBOL", toType = "ENTREZID", OrgDb= "org.Hs.eg.db", drop = TRUE)gene<- entrezIDs$ENTREZIDmarker_new = markers[markers$gene %in% entrezIDs$SYMBOL,]marker_new = marker_new[!duplicated(marker_new$gene),]identical(marker_new$gene, entrezIDs$SYMBOL)p = identical(marker_new$gene, entrezIDs$SYMBOL);pif(!p) entrezIDs = entrezIDs[match(marker_new$gene,entrezIDs$SYMBOL),]marker_new$ID = entrezIDs$ENTREZID
GO富集分析:
gcSample=split(marker_new$ID, marker_new$cluster) ###参数可以更改,看看?compareClusterxx <- compareCluster(gcSample, fun = "enrichGO", OrgDb = "org.Hs.eg.db", ont = "BP", pAdjustMethod = "BH", pvalueCutoff = 0.01, qvalueCutoff = 0.05)p <- dotplot(xx)library(ggthemes)p + theme(axis.text.x = element_text( angle = 45, vjust = 0.5, hjust = 0.5))+theme_few()
图片
还可以使用热图展示GO富集:
a = xx@compareClusterResultcolnames(a)a$p = -log10(a$pvalue)b = a[,c(1,3,11)]library(reshape2)colnames(b)library(tidyr)wide_data <- pivot_wider(b, names_from = Cluster, values_from = p)wide_data = as.data.frame(wide_data)rownames(wide_data) = wide_data$Descriptionwide = wide_data[,-1]wide = wide[c(1:13),]rownames(wide) <- sapply(rownames(wide), function(x) paste0(substr(x, 1, 40), "..."))wide = as.matrix(wide)library(pheatmap)B <- pheatmap(wide, show_rownames = T, show_colnames = T, col = colorRampPalette(c("white","firebrick3"))(255), annotation_names_row = T,#不显示行注释信息 annotation_names_col = T ,#不显示列注释信息 column_title = NULL,#不显示列标题 cluster_rows = F, cluster_cols = F, scale = 'column', row_title = NULL)#不显示行标题scale = "row"B
图片
B结语本期,我们根据TCGA数据库中胃癌和正常胃组织之间的差异表达基因,定义了每个上皮细胞的恶性和非恶性评分。下一期,我们将在本期基础上,分析恶性上皮细胞G0-G4的Marker基因并绘制热图和小提琴图,此外,我们还将使用AddModuleScore_UCell函数计算细胞的增殖和迁移评分。干货满满,欢迎大家持续追更,谢谢!
图片
本站仅提供存储服务,所有内容均由用户发布,如发现有害或侵权内容,请点击举报。上一篇:RNAseq|Mime代码版-终极101 种机器学习算法组合构建最优预后模型
下一篇:没有了