• 最新版本: 1.9.0

  • 更新日期: 20260208

  • 可以对着视频教程学习和使用:R包fastGEO-GEO数据库快速下载探针注释

  • 没有安装包的需要付费获取安装包:

    • 直接添加微信转账100, 发送fastR和fastGEO的压缩包
    • 以前是20, 老用户依然可免费获取
    • 很多资料后面都会慢慢涨价, 早买会便宜很多!
    • 购买后可进入交流群,后续版本更新和问题答疑都在群里进行

R包介绍

为什么开发这个包

  • 你在下载和分析GEO数据库或者使用GEOquery包时是否遇到以下问题?

    • getGEO总是下载文件失败
    • 网页中series文件或者补充文件下载总是失败
    • 无法确定表达矩阵是否已经log化
    • 不会进行探针ID到基因SYBBOL的注释转换
    • GPL注释文件下载总是中断
    • GPL注释文件不包含基因SYMBOL
    • GPL注释文件的SYMBOL存在一对多和多对一的情况
    • GPL注释文件仅能查看但没有下载按钮
    • 下载的表达谱基因包含日期
    • 检索到很多GSE数据集, 但是逐个看太费时间
  • 如果有以上这些问题, 那么我开发的R包fastGEO大概率能解决你的问题!

实现的功能

  • 一行代码实现对芯片表达谱数据的下载、标准化、样本间矫正、探针注释、去除重复基因、恢复日期基因等等操作
  • 最终你会直接得到行名是基因列名是样本的标准化之后的表达谱, 以及对应的样本临床信息
  • 探针注释是fastGRO开发的主要目的, 可以实现对多种情况的GPL平台进行下载和处理
  • 对于文件下载失败的问题提供多种解决办法
  • 提供简单的下游分析和可视化快捷函数, 如PCA、差异分析、火山图、热图、差异箱线图、去批次等分析
  • 支持多个数据集批量进行下载与注释
  • 快速执行PCA、差异分析、火山图、热图, 展示合并图, 快速了解数据情况
  • V1.9更新自动检索数据集并提取数据集信息, 结合AI极大地提高数据集筛选效率

适用场景

  • 首先, GEO网站在国外, 下载网速受限制是难以避免的, 会科学上网会更好一点, 但偶尔还是下载失败, 需要多尝试
  • fastGEO下载表达谱是针对于芯片数据, 不支持高通量数据, 因为NGS数据不存放在series里, 而是在补充文件里
  • 探针注释能够成功的前提是GPL平台要提供了注释文件, 并且文件包含基因ID信息
    • SYMBOL/ENSEMBL/ENTREZ/REFSEQ等等都可以
    • 如果没有提供基因信息, 那也不能无中生有
  • 对于任何测序类型, 都支持一行代码获取样本临床信息

功能演示

  • 安装在下一节介绍
rm(plot_boxplot)
Warning message in rm(plot_boxplot):
"object 'plot_boxplot' not found"
# 加载R包
library(fastGEO)
Loading required package: fastR

Loading required package: ggplot2

Warning message:
"package 'ggplot2' was built under R version 4.4.3"
Loading required package: GEOquery

Loading required package: Biobase

Loading required package: BiocGenerics


Attaching package: ‘BiocGenerics’


The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs


The following objects are masked from ‘package:base’:

    anyDuplicated, aperm, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, setdiff, table, tapply, union, unique, unsplit, which.max, which.min


Welcome to Bioconductor

    Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")', and for
    packages 'citation("pkgname")'.


Setting options(‘download.file.method.GEOquery’=‘auto’)

Setting options('GEOquery.inmemory.gpl'=FALSE)

Loading required package: limma


Attaching package: ‘limma’


The following object is masked from ‘package:BiocGenerics’:

    plotMA


The following object is masked from ‘package:fastR’:

    strsplit2


Loading required package: AnnoProbe

AnnoProbe v 0.1.0  welcome to use AnnoProbe!
If you use AnnoProbe in published research, please acknowledgements:
We thank Dr.Jianming Zeng(University of Macau), and all the members of his bioinformatics team, biotrainee, for generously sharing their experience and codes.

Loading required package: stringr

Loading required package: httr


Attaching package: ‘httr’


The following object is masked from ‘package:Biobase’:

    content


Loading required package: XML

Loading required package: openxlsx

Warning message:
"package 'openxlsx' was built under R version 4.4.2"
Loading required package: dplyr


Attaching package: ‘dplyr’


The following object is masked from ‘package:Biobase’:

    combine


The following objects are masked from ‘package:BiocGenerics’:

    combine, intersect, setdiff, union


The following object is masked from ‘package:fastR’:

    case_when


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Loading required package: rvest


========================================
欢迎使用 fastGEO (版本 1.9.0)
- 作者: 生信摆渡
- 微信: bioinfobaidu1
- 帮助文档: help(package = fastGEO)
========================================


完整流程

# 为了演示, 先将处理好的文件删除
final_file = "test/00_GEO_data_GSE18105/GSE18105_GPL570_annoted.RData"
if(file.exists(final_file)) unlink(final_file) 
# invisible(sapply(list.files("../../functions", full.names = T), source))
# 下载数据
obj = download_GEO("GSE18105", out_dir = "test/00_GEO_data_GSE18105")
>>>>> 识别到本地注释文件, 将优先使用!

>>>>> getGEO2已经执行过, 将跳过, 设置'force_getGEO2 = TRUE'强制重新运行!

>>>>> 执行探针-基因注释, 平台: GPL570 ...

>>>>> 使用R包内置的注释文件...

>>>>> 去除重复的基因注释, 使用方法: max...

>>>>> 处理结束!

# 获取分组
# 注意这里尽量先指定实验组再指定对照组, 因为后续分析在版本1.8.0之后会默认将第一个分组作为实验组进行分析
group_list = get_GEO_group(obj, group_name = "characteristics_ch1.1", "cancer, homogenized" = "Cancer", "normal, homogenized" = "Normal")
group = group_list$group 
expM = group_list$expM
SID = group_list$SID
table(group)
group
Cancer Normal 
    17     17 
# PCA
set_image(6, 6) # 仅在jupyter里使用, 其他软件可忽略
run_PCA(expM, group)
Warning message:
"[1m[22mUsing `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
[36mℹ[39m Please use `linewidth` instead."

# 加载本地函数, 用于我做测试, 用户不用运行
# invisible(sapply(list.files("../../functions", full.names = T), source))
# 差异分析
DEG_tb = run_DEG_limma(expM, group = group, Case = "Cancer", Control = "Normal", out_name = "test/00_GEO_data_GSE18105/GSE18105_DEG")
Warning message:
"Parameter 'Case' is deprecated since fastGEO 1.8.0, please use 'case' instead!"
Warning message:
"Parameter 'Control' is deprecated since fastGEO 1.8.0, please use 'control' instead!"
Number of Up regulated genes: 1006/2102(47.86%)

Number of Down regulated genes: 1096/2102(52.14%)

Number of Not Change genes: 20779/22881

  • 版本1.8.0之后参数名均改为小写了, 但是仍保留首字母大写的写法, 不会报错只会提醒你, 记得改一下吧
DEG_tb = run_DEG_limma(expM, group = group, case = "Cancer", control = "Normal", out_name = "test/00_GEO_data_GSE18105/GSE18105_DEG")
Number of Up regulated genes: 1006/2102(47.86%)

Number of Down regulated genes: 1096/2102(52.14%)

Number of Not Change genes: 20779/22881

  • 而且这个版本也不需要指定分组, 会默认将第一类分组作为实验组, 这在get_GEO_group的时候有提到过, 自己要清楚默认情况对不对
DEG_tb = run_DEG_limma(expM, group, out_name = "test/00_GEO_data_GSE18105/GSE18105_DEG")
Not specify 'case' and 'control', will detect automatically!

!!! Please make sure contrast: Cancer vs. Normal

Number of Up regulated genes: 1006/2102(47.86%)

Number of Down regulated genes: 1096/2102(52.14%)

Number of Not Change genes: 20779/22881

  • 也增加了差异基因的比例和结果报告的输出, 更方便写文章了
readLines("test/00_GEO_data_GSE18105/GSE18105_DEG_report.txt")

‘结果表明,与Normal对照组相比,Cancer组共有2,102个差异表达基因,其中上调有1,006(47.86%),下调有1,096(52.14%)’

# 加载本地函数, 用于我做测试, 用户不用运行
# invisible(sapply(list.files("../../functions", full.names = T), source))
# 火山图
plot_volcano_limma(DEG_tb, out_name = "test/00_GEO_data_GSE18105/GSE18105_volcano")



  • 版本1.8.0的火山图也更新了一点, 对指定标注的基因的点进行强调显示
  • 点的color和fill可以通过label.point.color和label.point.fill进行自定义
# 火山图
plot_volcano_limma(DEG_tb, out_name = "test/00_GEO_data_GSE18105/GSE18105_volcano", label.point.color = "green", label.point.fill = "yellow")



# 觉得标签的太挤?可通过xlims和ylims调整
plot_volcano_limma(DEG_tb, out_name = "test/00_GEO_data_GSE18105/GSE18105_volcano", xlims = c(-6,  6))



# 想自己指定基因?指定label即可
# 自定义label基因, 每组随机挑选5个
set.seed(1234)
gene_select = as.character(aggregate(rownames(DEG_tb), by = list(DEG_tb$DEG), function(x) sample(x, 5))[, 2])
gene_select
  1. 'CPM'
  2. 'STX16'
  3. 'TTC27'
  4. 'FFAR4'
  5. 'RP11-31K23.2'
  6. 'MCM2'
  7. 'GNA11'
  8. 'PIK3IP1'
  9. 'RFC3'
  10. 'CAPN9'
  11. 'TFPI'
  12. 'ASCC3'
  13. 'DMXL2'
  14. 'GK5'
  15. 'SLC2A1'
plot_volcano_limma(DEG_tb, out_name = "test/00_GEO_data_GSE18105/GSE18105_volcano", label = gene_select)



# 热图
plot_heatmap_DEG(expM, DEG_tb, group)



# detach("package:fastGEO", unload = TRUE)
# 查看帮助文档, 在RStudio里可以点击查看每一个函数的帮助文档
help(package = fastGEO)
Documentation for package 'fastGEO'


Information on package ‘fastGEO’

Description:

Package:            fastGEO
Title:              Quickly download and annotate GEO data chip datasets
Version:            1.9.0
Authors@R:          person("Jiahao", "Wang", , "wjhyyds@sina.cn", role = c("aut", "cre"), comment = c(ORCID = "YOUR-ORCID-ID"))
Description:        You can use the fastqGEO to easily download GEO chip expression profiles and clinical data and automate probe-gene
                    annotation.
License:            `use_mit_license()`, `use_gpl3_license()` or friends to pick a license
Encoding:           UTF-8
Roxygen:            list(markdown = TRUE)
RoxygenNote:        7.3.2
Depends:            R (>= 3.5.0), fastR (>= 1.5.0), ggplot2 (>= 3.5.2), GEOquery (>= 2.72.0), limma (>= 3.60.4), AnnoProbe (>= 0.1.0),
                    stringr (>= 1.5.1), httr (>= 1.4.7), XML (>= 3.99.0.17), openxlsx (>= 4.2.7.1), dplyr (>= 1.1.4), rvest (>= 1.0.4)
Imports:            data.table, ggplot2, GEOquery, limma, readr
Suggests:           org.Hs.eg.db, org.Mm.eg.db, clusterProfiler
LazyData:           true
NeedsCompilation:   no
Packaged:           2026-02-08 09:25:42 UTC; Jia
Author:             Jiahao Wang [aut, cre] (YOUR-ORCID-ID)
Maintainer:         Jiahao Wang <wjhyyds@sina.cn>
RemoteType:         local
RemoteUrl:          W:\02_Study\R_build\build\02_fastGEO\version\1.9.0\fastGEO_1.9.0.tar.gz
Built:              R 4.4.1; ; 2026-02-08 09:25:44 UTC; windows

Index:

analyse_GEO                                  Comprehensive GEO Expression Data Analysis Pipeline
anno_GEO                                     Annotate GEO Expression Matrix
anno_GEO_AnnoProbe                           Annotate GEO Data Using AnnoProbe Package
anno_GEO_online                              Annotate GEO Data Using Online Annotation
anno_NGS                                     Annotate NGS Expression Matrix
anno_obj                                     GEO Platform Annotation Reference Object
check_expr_log_normalized                    Automatically detect and normalize log transformation status of expression profile
count_to_exp                                 Convert Counts to Normalized Expression
data_date_to_gene                            Convert table for date to gene convertion
date_to_gene                                 Convert Date-Based IDs to Gene Symbols
download_GEO                                 Download and Process GEO Dataset
download_GEO_file                            Download GEO Files (Series Matrix or Supplementary Files)
download_GEO_multi                           Download and Process GEO Dataset multiply
download_GEO_series_multi                    Download GEO series multiply
extract_GPL_anno                             Extract Gene Symbol Annotations from GEO Platform (GPL) Data
extract_GSE_info_single                      Extract Metadata from Single GSE HTML File
getGEO2                                      Download and Process GEO Dataset
get_GEO_group                                Create Sample Groups from GEO Data
get_GEO_pheno                                Get GEO Phenotype Data
kill_curl                                    Force Terminate All curl Processes
plot_boxplot                                 Create Boxplots for Gene Expression
plot_heatmap_DEG                             Heatmap of DE Genes
plot_volcano_limma                           Volcano Plot for DE Results
read_GPL                                     Read GEO Platform (GPL) File
read_GPL_file                                Read and Process GPL Annotation File
read_GPL_url                                 Download and Process GPL Annotation from GEO Website
read_gds_result                              Read GSE Accessions from GDS Result TXT
run_ComBat                                   Batch Effect Correction with ComBat
run_DEG_deseq2                               Differential Expression Analysis with DESeq2
run_DEG_limma                                Differential Expression Analysis with limma
run_PCA                                      Perform and Visualize Principal Component Analysis
run_fastGEO_app                              启动 fastGEO 的 Shiny 应用
search_GEO                                   Search GEO, Retrieve GSE Accessions, and Extract Metadata
table_GEO_clinical                           Tabulate Clinical Characteristics from GEO Sample Annotations
  • 每一个函数的帮助文档可以使用?函数名进行自行查看, 比如?download_GEO

analyse_GEO

  • 快速执行PCA、差异分析、火山图、热图, 展示合并图, 快速了解数据情况
# 加载本地函数, 用于我做测试, 用户不用运行
# invisible(sapply(list.files("../../functions", full.names = T), source))
set_image(15, 6)
analyse_GEO(expM, group, out_name = "test/00_GEO_data_GSE18105/GSE18105_pipeline")
Not specify 'case' and 'control', will detect automatically!

!!! Please make sure contrast: Cancer vs. Normal

Step1. PCA ...

Step2. DEG ...

Number of Up regulated genes: 1006/2102(47.86%)

Number of Down regulated genes: 1096/2102(52.14%)

Number of Not Change genes: 20779/22881

Step3. volcano ...

Step4. heatmap ...



download_GEO

  • 主函数, 输入GSE号, 实现自动化下载和处理数据
# 查看参数
args(download_GEO)
function (GSE, out_dir = "00_GEO_data", anno_file = NULL, method = "max", 
    timeout = 3600 * 24, download_method = "getGEO", force = FALSE, 
    force_getGEO2 = FALSE, symbol_name = NULL) 
NULL
  • GSE: GSE号
  • out_dir: 文件保存目录, 默认为当前目录的00_GEO_data文件夹, 可自行修改
  • anno_file: 注释文件, 现在已经内置了注释文件, 可以不用指定
  • method: 重复基因的处理方法
    • max: 默认, 取最大表达量做为该基因的表达水平, 相对较快且合理
    • mean: 取均值, 太慢
    • remove: 直接去除重复基因, 最快, 但不推荐
    • sum: 求和, 用的比较少, 暂不推荐
  • download_method: 下载文件的方式, 如果存在本地curl则用本地的, 不存在则使用getGEO的curl
  • V1.9更新: download_method默认为getGEO, 即不使用curl进行下载, 因为GEO数据库不允许这样了
  • symbol_name: 注释文件中含有基因信息的那一列的列名
# 这种下载文件的方式可能也比较慢, 看运气
# 如果下载失败, 看下面的download_GEO_file函数
obj = download_GEO("GSE294904", out_dir = "test/00_GEO_data_GSE294904")
>>>>> 访问数据集: GSE294904...

>>>>> 下载尝试次数: 1 ... FALSE

下载成功!

>>>>> 处理测序平台: GPL570

>>>>> 识别到6个样本, 34个注释信息

>>>>> 输出样本注释表至: test/00_GEO_data_GSE294904/GSE294904_GPL570_sample_anno.csv


Attaching package: ‘xlsx’


The following objects are masked from ‘package:openxlsx’:

    createWorkbook, loadWorkbook, read.xlsx, saveWorkbook, write.xlsx


>>>>> 识别到数据未log化, 执行log(expM+1)

>>>>> 执行样本间矫正...

>>>>> 处理成功文件保存至: test/00_GEO_data_GSE294904/GSE294904_GPL570_raw.RData

>>>>> 执行探针-基因注释, 平台: GPL570 ...

>>>>> 使用R包内置的注释文件...

>>>>> 去除重复的基因注释, 使用方法: max...

>>>>> 处理结束!

# 结果包含以下文件
cbind(dir("test/00_GEO_data_GSE294904/"))
A matrix: 10 × 1 of type chr
GSE294904.RData
GSE294904_GPL570.RData
GSE294904_GPL570_annoted.RData
GSE294904_GPL570_expM.csv
GSE294904_GPL570_expM_annoted.csv
GSE294904_GPL570_Normalization.pdf
GSE294904_GPL570_raw.RData
GSE294904_GPL570_sample_anno.csv
GSE294904_GPL570_sample_anno.xlsx
GSE294904_series_matrix.txt.gz
  • GSE294904_GPL570.RData是最终保存的文件, 也是返回的obj的内容, 可以load进行使用
  • sample_anno是样本临床信息, 可打开快速浏览和筛选样本
# 保存的对象
names(obj)
  1. 'GPL'
  2. 'expM'
  3. 'expM_raw'
  4. 'sample_anno'
obj$GPL

‘GPL570’

# 完美的表达谱
hd(obj$expM)
Type: data.frame  Dim: 22881 × 6 
         GSM8928113 GSM8928114 GSM8928115 GSM8928116 GSM8928117
A1BG       7.664415   7.834532   8.045193   8.715155   8.702467
A1BG-AS1   4.182379   6.486369   3.594159   4.888389   5.456329
A1CF       3.311157   6.515058   6.397253   7.288225   6.671245
A2M        5.580024   6.934858   5.895105   6.041618   6.338498
A2M-AS1    2.398807   5.313860   5.330637   5.697162   5.843399
boxplot(obj$expM[, 1:5])



# 样本信息
hd(obj$sample_anno)
Type: data.frame  Dim: 6 × 34 
                         title geo_accession                status submission_date last_update_date
GSM8928113 NCI-ADR-RES control    GSM8928113 Public on Apr 30 2025     Apr 17 2025      Apr 30 2025
GSM8928114 NCI-ADR-RES treated    GSM8928114 Public on Apr 30 2025     Apr 17 2025      Apr 30 2025
GSM8928115         GCS control    GSM8928115 Public on Apr 30 2025     Apr 17 2025      Apr 30 2025
GSM8928116         GCS treated    GSM8928116 Public on Apr 30 2025     Apr 17 2025      Apr 30 2025
GSM8928117       asGCS control    GSM8928117 Public on Apr 30 2025     Apr 17 2025      Apr 30 2025
  • 如果提供的数据集是高通量数据, 则只会返回样本注释, 表达谱需要自己去补充文件里下载处理
sample_anno = download_GEO("GSE173955")
>>>>> 访问数据集: GSE173955...

>>>>> 下载尝试次数: 1 ... FALSE

下载成功!

>>>>> 处理测序平台: GPL18460

>>>>> 识别到18个样本, 43个注释信息

>>>>> 输出样本注释表至: 00_GEO_data/GSE173955_GPL18460_sample_anno.csv

>>>>> 未识别到表达谱, 提供的数据集可能是高通量测序, 请确认!

    https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE173955

>>>>> 高通量数据的表达谱请自行从补充文件下载处理!

>>>>> 将仅返回样本注释表!

hd(sample_anno)
Type: data.frame  Dim: 18 × 43 
                                                title geo_accession                status submission_date last_update_date
GSM5283449 Alzheimer's Disease_Biological replicate 1    GSM5283449 Public on Dec 23 2021     May 05 2021      Dec 23 2021
GSM5283450 Alzheimer's Disease_Biological replicate 2    GSM5283450 Public on Dec 23 2021     May 05 2021      Dec 23 2021
GSM5283451 Alzheimer's Disease_Biological replicate 3    GSM5283451 Public on Dec 23 2021     May 05 2021      Dec 23 2021
GSM5283452 Alzheimer's Disease_Biological replicate 4    GSM5283452 Public on Dec 23 2021     May 05 2021      Dec 23 2021
GSM5283453 Alzheimer's Disease_Biological replicate 5    GSM5283453 Public on Dec 23 2021     May 05 2021      Dec 23 2021

download_GEO_file

  • 专门用来下载文件的
  • 如果download_GEO下载series文件失败, 可先尝试这种方式, 然后再运行download_GEO
  • 还可以用来下载所有的补充文件或其中一个
# 如果download_GEO下载文件失败, 可下面这样专门下载series文件
download_GEO_file("GSE294904", out_dir = "test/00_GEO_data_GSE294904/")
All file already downloaded in test/00_GEO_data_GSE294904/ 
# 显示已经下载好了, 因为上面的download_GEO我下载成功了, 嘿嘿
# 我们可以换个输出目录再演示一下
download_GEO_file("GSE294904", out_dir = "test/00_GEO_data_GSE294904-2/")
All file already downloaded in test/00_GEO_data_GSE294904-2/ 
# 结果包含以下文件
cbind(dir("test/00_GEO_data_GSE294904-2/"))
A matrix: 1 × 1 of type chr
GSE294904_series_matrix.txt.gz
# 有时候还是比较慢的, 很离谱, 不过直接在浏览器下载却出奇的快
# 所以也支持跳转到浏览器进行下载, 逆天!
# 设置 method = "browser" 就行了
# 同时也可以设置输出目录, 程序会自动粘贴输出目录, 然后在浏览器里直接复制这个目录, 进行保存文件
# 所以浏览器的文件下载要设置每次询问下载位置
download_GEO_file("GSE294904", out_dir = "test/00_GEO_data_GSE294904-4/", method = "browser")
All file already downloaded in test/00_GEO_data_GSE294904-4/ 
# 很快啊, 几秒就完成了
cbind(dir("test/00_GEO_data_GSE294904-3/"))
A matrix: 2 × 1 of type chr
GSE294904_RAW.tar
GSE294904_series_matrix.txt.gz
# 也可以用来下载补充文件
download_GEO_file(GSE = "GSE294904", 
                  file = "GSE294904_RAW.tar", 
                  out_dir = "test/00_GEO_data_GSE294904-3/", 
                  method = "browser")
All file already downloaded in test/00_GEO_data_GSE294904-3/ 

download_GEO_series_multi

  • 批量从浏览器下载series文件, 因为浏览器下载比在R里下载要稳定很多很多, 所以又提供了这个快捷函数
# 在浏览器下载series文件
datasets = c("GSE179285", "GSE75214", "GSE87466")
download_GEO_series_multi(datasets, out_dir = "test/download_GEO_series_multi")
All file already downloaded in test/download_GEO_series_multi 
All file already downloaded in test/download_GEO_series_multi 
All file already downloaded in test/download_GEO_series_multi 

download_GEO_multi

  • 同样的, 可以提供多个GSE批量下载、预处理、注释, 其中下载建议采用上面的浏览器的方法
  • 另一个能正常运行的重要前提就是注释这一步能够通过, 如果不能处理注释的话, 还是需要使用后续的一系列的方法先解决然后再次批量运行
download_GEO_multi(datasets, out_dir = "test/download_GEO_series_multi")
>>>>> 准备并行下载...

  • 结果很快就出来了, 这里不是循环处理, 而是并行处理, 同时对每个数据集执行download_GEO操作, 速度快很多!

  • 后面的V1.8.0就没有更新的内容了, 跟老版本一样

  • 更新维护这个R包真的消耗我巨大的精力, 但是看到能让数据分析方便很多很多, 成就感还是满满的!

get_GEO_pheno

  • 快速获取样本临床信息
  • V1.9之后, 可以舍弃这个函数了, 主函数download_GEO将兼容这个功能!
# 芯片数据
sample_anno = get_GEO_pheno("GSE294904", out_dir = "test/00_GEO_data_GSE294904")
>>>>> 访问数据集: GSE294904...

>>>>> 下载尝试次数: 1 ... FALSE

下载成功!

>>>>> 处理测序平台: GPL570

>>>>> 识别到6个样本, 34个注释信息

>>>>> 输出样本注释表至: test/00_GEO_data_GSE294904/GSE294904_GPL570_sample_anno.csv

>>>>> 识别到数据未log化, 执行log(expM+1)

>>>>> 执行样本间矫正...

>>>>> 处理成功文件保存至: test/00_GEO_data_GSE294904/GSE294904_GPL570_raw.RData

hd(sample_anno)
Type: NULL  Dim: 0 
NULL
# 其他测序类型也支持
sample_anno = get_GEO_pheno("GSE121950", out_dir = "test/00_GEO_data_GSE121950")
>>>>> 访问数据集: GSE121950...

>>>>> 下载尝试次数: 1 ... FALSE

下载成功!

>>>>> 处理测序平台: GPL11154

>>>>> 识别到6个样本, 41个注释信息

>>>>> 输出样本注释表至: test/00_GEO_data_GSE121950/GSE121950_GPL11154_sample_anno.csv

>>>>> 未识别到表达谱, 提供的数据集可能是高通量测序, 请确认!

    https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE121950

>>>>> 高通量数据的表达谱请自行从补充文件下载处理!

>>>>> 将仅返回样本注释表!

hd(sample_anno)
Type: data.frame  Dim: 6 × 41 
                                                title geo_accession                status submission_date last_update_date
GSM3450401               villus_induced abortion_rep1    GSM3450401 Public on Nov 27 2018     Oct 29 2018      Nov 27 2018
GSM3450402               villus_induced abortion_rep2    GSM3450402 Public on Nov 27 2018     Oct 29 2018      Nov 27 2018
GSM3450403               villus_induced abortion_rep3    GSM3450403 Public on Nov 27 2018     Oct 29 2018      Nov 27 2018
GSM3450404 villus_recurrent spontaneous abortion_rep1    GSM3450404 Public on Nov 27 2018     Oct 29 2018      Nov 27 2018
GSM3450405 villus_recurrent spontaneous abortion_rep2    GSM3450405 Public on Nov 27 2018     Oct 29 2018      Nov 27 2018

getGEO2

  • 用于下载和预处理数据的函数

  • 不过这个函数不需要直接使用, 因为download_GEO和get_GEO_pheno会进行自动调用

  • 这里主要讲解下过程和原理

  • getGEO2有默认的下载方式, 其中curl是更推荐的一种方式, 但仍存在超时的情况, V1.9之后已舍弃curl下载的功能

  • 所以我这里优先使用本地的curl进行下载, 如果你还没有安装则需安装一下, 并添加系统环境变量

  • 下载完后也会自动保存样本临床信息

  • 自动判断是否需要进行log2(expM + 1)处理

  • 自动进行样本间矫正

  • 最终返回临床信息和矫正后的表达谱, 但此时探针ID还没有进行注释

# 相当于 download_GEO 的中间步骤
obj = getGEO2("GSE294904", out_dir = "test/00_GEO_data_GSE294904")
>>>>> 访问数据集: GSE294904...

>>>>> 下载尝试次数: 1 ... FALSE

下载成功!

>>>>> 处理测序平台: GPL570

>>>>> 识别到6个样本, 34个注释信息

>>>>> 输出样本注释表至: test/00_GEO_data_GSE294904/GSE294904_GPL570_sample_anno.csv

>>>>> 识别到数据未log化, 执行log(expM+1)

>>>>> 执行样本间矫正...

>>>>> 处理成功文件保存至: test/00_GEO_data_GSE294904/GSE294904_GPL570_raw.RData

obj$GPL
NULL
hd(obj$expM)
Type: NULL  Dim: 0 
NULL
hd(obj$sample_anno)
Type: NULL  Dim: 0 
NULL

read_GPL 系列

  • 下载完之后就需要进行探针ID注释了, 这里是最麻烦的一步, 因为探针注释文件格式千变万化 -_-
  • 所以第一步就是下载和读取GPL注释文件
  • 这几个函数也是提供给大家, 可以自行构建注释表, 因为总注释表我更新肯定不及时, 而且平台太多了, 总是遇到没有注释过的平台
read_GPL
  • 直接读取本地的GPL文件, 这种情况前提是你要能下载下来
  • 先不做处理
# https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL570
# 下载至 test/GPL570/GPL570-55999.txt
GPL570_raw = read_GPL("test/GPL570/GPL570-55999.txt")
hd(GPL570_raw)
Type: data.frame  Dim: 54675 × 16 
         ID GB_ACC SPOT_ID Species Scientific Name Annotation Date
1 1007_s_at U48705                    Homo sapiens     Oct 6, 2014
2   1053_at M87338                    Homo sapiens     Oct 6, 2014
3    117_at X51757                    Homo sapiens     Oct 6, 2014
4    121_at X69699                    Homo sapiens     Oct 6, 2014
5 1255_g_at L36861                    Homo sapiens     Oct 6, 2014
extract_GPL_anno
  • 从读取好的注释数据框里提取探针ID列和包含基因名的列
  • 这是最麻烦的一点, 这一列有很多不同的写法, 我们总结了以下几类列名
colnames_SYMBOL = c("Gene symbol", "Gene Symbol", "GENE_SYMBOL", "GeneSymbol", "Symbol", "SYMBOL", "gene_assignment")
colnames_ENTREZ = c("ENTREZ_GENE_ID")
colnames_REFSEQ = c("GB_ACC")
colnames_ENSEMBL = c("GENE_ID", "SPOT_ID")
colnames_candicate = c(colnames_SYMBOL, colnames_ENTREZ, colnames_REFSEQ, colnames_ENSEMBL)
  • 只要存在这些列就说明是可以注释的, 只需要使用bitr函数进行ID转换就行了
load2("./test/00_GEO_data_GSE294904/GSE294904.RData")
Loading objects:
  GSE294904
GPL = GSE294904$GPL
GPL

‘GPL570’

expM_raw = GSE294904$expM_raw
hd(expM_raw)
Type: matrix  Dim: 5 × 5 
          GSM8928113 GSM8928114 GSM8928115 GSM8928116 GSM8928117
1007_s_at  10.014355  10.135966  11.008001  11.071630  10.268858
1053_at    11.633831  11.610780  11.346647  11.354608  10.919310
117_at      6.661878   6.857272   4.313413   3.496718   5.036547
121_at     10.278252  10.507507  10.323595  10.588387   9.398323
1255_g_at   2.666673   3.210481   3.041658   5.902110   2.303889
GPL570_anno = extract_GPL_anno(GPL570_raw)
	Found annotation column: Gene Symbol 
hd(GPL570_anno)
Type: data.frame  Dim: 54675 × 2 
         ID SYMBOL
1 1007_s_at   DDR1
2   1053_at   RFC2
3    117_at  HSPA6
4    121_at   PAX8
5 1255_g_at GUCA1A
  • 这样我们就得到了ID转换表了, 已经成功一大半了
read_GPL_file
  • 从本地GPL读取注释文件(read_GPL实现)
  • 随后使用extract_GPL_anno提取探针ID列和包含基因名的列
  • 返回包含两列的注释表: ID列和SYMBOL列
# 读取和提取一步到位
anno_tb = read_GPL_file("./test/GPL570/GPL570-55999.txt", out_dir = "test/GPL570")
	Processing: GPL570 ...
	Found annotation column: Gene Symbol 
	Annotation file Save to C:/software/R-4.4.1/library/fastGEO/data/anno_obj.RData 

hd(anno_tb[[1]])
Type: data.frame  Dim: 54675 × 2 
         ID SYMBOL
1 1007_s_at   DDR1
2   1053_at   RFC2
3    117_at  HSPA6
4    121_at   PAX8
5 1255_g_at GUCA1A
  • 成功生成注释表后, 将自动与总注释表进行合并
  • 这也是这里返回的是列表的原因, 因为总的注释数据就是一个列表, 每个元素就是一个平台的注释数据框
  • 合并完之后会自动保存到fastGEO的安装目录下的data目录
  • 下次就不需要再为这个平台准备注释文件了, 因为download_GEO 默认首先会从这里读取总注释数据
  • 所以自己遇到我还没整理的平台可以运行一下这几个函数, 成功后再次运行 download_GEO可能就好了
read_GPL_url
  • 很多小伙伴会遇到一个问题, 就是GPL注释文件虽然可以查看, 但是不能下载
  • 而且查看久了浏览器会卡死, 因为几万行的数据浏览器是扛不住的
  • 后来我发现这个查看的网页的注释内容是固定且完整的, 所以将网页下载下来并提取注释表格就行了
  • 嘿, 我真他娘的天才!还真可以!
  • 你只需要快速将网页的链接粘贴下来, 然后就能关闭网页了, 不然等会浏览器就崩溃了
  • 拿到链接之后直接扔给read_GPL_url就能获取处理完之后的注释表格了
  • 所以这里包含从HTML提取注释表格并执行extract_GPL_anno两个步骤
url = "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?view=data&acc=GPL10379&id=5393&db=GeoDb_blob45"
anno_tb = read_GPL_url(url, out_dir = "test/GPL1037")
	Processing: GPL10379 ...
	File downloaded successfully!
	Found annotation column: ORF 
	New annotation file save to: C:/software/R-4.4.1/library/fastGEO/data/anno_obj.RData 
hd(anno_tb[[1]], 10:20, 2)
Type: data.frame  Dim: 52378 × 2 
                 ID SYMBOL
20 100147785_TGI_at       
21 100138290_TGI_at CEP350
22 100135090_TGI_at       
23 100310799_TGI_at       
24 100123722_TGI_at       
25 100311433_TGI_at       
26 100137544_TGI_at       
27 100142142_TGI_at       
28 100129675_TGI_at       
29 100307002_TGI_at       
30 100144445_TGI_at SEMA6A

anno_GEO 系列

  • 那问题又来了, 如果平台没有提供GPL注释文件或不包含基因名信息怎么办?
  • 好在我发现还有其他的获取注释文件的方法
    • anno_GEO_AnnoProbe: Jimmy Zeng写的R包, 整理了很多R包的注释信息, 尤其是将一些只提供碱基序列的探针给比对了一遍
    • anno_GEO_online: 少数情况下, 有些GSE自带注释文件, 可以直接在线下载读取
  • 所以我们有三种注释的方式, 自己准备注释文件(或者我的包已经自带的)、AnnoProbe和online三种
anno_GEO_AnnoProbe
  • 使用AnnoProbe包获取注释信息
anno_GEO_online
  • 使用GSE里的在线注释文件获取注释信息
anno_GEO
  • 准备好注释文件后就可以对表达谱进行注释了
  • 由于存在多个探针对应同一个基因的情况, 所以需要进行去重
  • 前面有讲过, 通过method参数指定重复基因的处理方法
    • max: 默认, 取最大表达量做为该基因的表达水平, 相对较快且合理
    • mean: 取均值, 太慢
    • remove: 直接去除重复基因, 最快, 但不推荐
    • sum: 求和, 用的比较少, 暂不推荐
anno_NGS
  • 这个是独立的功能, 用于对行名是ENSEMBL ID的二代测序表达谱的注释
  • 二代测序表达谱数据一般存放在补充文件

date_to_gene

  • 最后还有一个功能, 就是我发现有些表达谱的基因名是日期, 不排除是Excel搞的鬼: R语言-将日期恢复为基因
  • 所以最后再检查以下是否包含日期名, 如果有则自动进行转换

run_fastGEO_app

  • 最后还有个更逆天的功能, 你可以在浏览器执行这些操作
  • 其实就是使用Shiny基于我的这些函数搭建的一个分析平台
  • 其实我并不会Shiny的语法, 主要是是ChatGPT帮我写的, 我稍微改了一下
  • 大家可以体验一下😅, 目前还是要在R里面操作, 更灵活一点, 这个是搞着玩的暂时就是
# 执行下面的函数打开Shiny App
# 我这里就不运行了, 只展示截图
# run_fastGEO_app()

下游分析

get_GEO_group
  • 根据样本信息提取并整理表达谱和分组信息
  • 获取表达谱和分组信息后就可以做一系列的分析了
  • 指定作为分组依据的临床信息表的列名, 及每个分组的匹配模式和分组名称即可
  • 可指定两组或多组
obj = download_GEO("GSE18105", out_dir = "test/00_GEO_data_GSE18105")
>>>>> GSE18105已处理完毕, 将直接返回结果, 设置'force = TRUE'强制重新运行!

expM = obj$expM
sample_anno = obj$sample_anno
hd(expM)
Type: data.frame  Dim: 22881 × 111 
         GSM452552 GSM452553 GSM452554 GSM452555 GSM452556
A1BG      3.884167  3.805943  3.779392  3.961285  3.980595
A1BG-AS1  4.755635  5.021991  4.992824  5.002680  4.811076
A1CF      7.603906  8.907048  8.035191  8.603941  7.788007
A2M      10.972595  7.664711  9.703405  9.684616  9.976007
A2M-AS1   4.349903  4.127930  4.240315  4.537712  4.449675
hd(sample_anno)
Type: data.frame  Dim: 111 × 39 
                             title geo_accession                status submission_date last_update_date
GSM452552 patient 100, cancer, LCM     GSM452552 Public on Feb 04 2010     Sep 14 2009      Aug 28 2018
GSM452553 patient 101, cancer, LCM     GSM452553 Public on Feb 04 2010     Sep 14 2009      Aug 28 2018
GSM452554 patient 102, cancer, LCM     GSM452554 Public on Feb 04 2010     Sep 14 2009      Aug 28 2018
GSM452555 patient 103, cancer, LCM     GSM452555 Public on Feb 04 2010     Sep 14 2009      Aug 28 2018
GSM452556 patient 104, cancer, LCM     GSM452556 Public on Feb 04 2010     Sep 14 2009      Aug 28 2018
table_GEO_clinical(obj$sample_anno)
$characteristics_ch1

           metastasis: metastasis metastasis: metastatic recurrence                  metastasis: none 
                               26                                18                                67 

$characteristics_ch1.1

tissue: cancer, homogenized         tissue: cancer, LCM tissue: normal, homogenized 
                         17                          77                          17 
# 两组
group_list = get_GEO_group(obj, group_name = "characteristics_ch1.1", "cancer, homogenized" = "Cancer", "normal, homogenized" = "Normal")
group = group_list$group 
table(group)
group
Cancer Normal 
    17     17 
SID = group_list$SID
hd(SID)
Type: character  Dim: 34 
[1] "GSM452646" "GSM452647" "GSM452648" "GSM452649" "GSM452650"
expM = group_list$expM
hd(expM)
Type: data.frame  Dim: 22881 × 34 
         GSM452646 GSM452647 GSM452648 GSM452649 GSM452650
A1BG      4.218375  4.102014  4.080164  4.017133  3.958630
A1BG-AS1  4.813436  4.965083  4.854558  4.685886  4.796381
A1CF      8.866299  7.978485  8.947088  9.373371  9.222967
A2M       6.513392  8.517808 11.518616  8.276835  7.645857
A2M-AS1   3.880756  3.910804  5.114558  4.081589  3.817199
sample_anno = group_list$sample_anno
hd(sample_anno)
Type: data.frame  Dim: 34 × 39 
                                     title geo_accession                status submission_date last_update_date
GSM452646 patient 006, cancer, homogenized     GSM452646 Public on Feb 04 2010     Sep 14 2009      Aug 28 2018
GSM452647 patient 011, cancer, homogenized     GSM452647 Public on Feb 04 2010     Sep 14 2009      Aug 28 2018
GSM452648 patient 024, cancer, homogenized     GSM452648 Public on Feb 04 2010     Sep 14 2009      Aug 28 2018
GSM452649 patient 027, cancer, homogenized     GSM452649 Public on Feb 04 2010     Sep 14 2009      Aug 28 2018
GSM452650 patient 028, cancer, homogenized     GSM452650 Public on Feb 04 2010     Sep 14 2009      Aug 28 2018
# 多组
group_list = get_GEO_group(obj, group_name = "characteristics_ch1.1", 
                            "cancer, homogenized" = "Cancer", 
                            "normal, homogenized" = "Normal",
                            "cancer, LCM" = "Cancer_LCM"
                           )
table(group_list$group)


Cancer Cancer_LCM Normal
17 77 17

run_PCA
  • 执行PCA分析并绘图
  • 默认添加椭圆、默认不添加样本标签
  • 可输出图片(PDF/PNG), 返回ggplot对象, 可自行修改
group = group_list$group 
expM = group_list$expM
SID = group_list$SID
table(group)
group
    Cancer Cancer_LCM     Normal 
        17         77         17 
set_image(6, 6)
run_PCA(expM, group, out_name = "./test/00_GEO_data_GSE18105/01_PCA")



# 修改属性
run_PCA(expM, group, 
        title = "PCA Plot", 
        legend.title = "Group", 
        text.size = 30, 
        pt.size = 5, 
        key.size = 5, 
        color = c("red", "green", "blue")
       )



  • 添加label, 适合样本较少的情况, 可以更清晰地展示离群样本名
  • 这里每组只提取前5个样本展示
SID2 = unlist(lapply(group_list$SID_list, function(x) x[1:5]))
group2 = group[match(SID2, SID)]
expM2 = expM[, SID2]
table(group2)                     
group2
    Cancer Cancer_LCM     Normal 
         5          5          5 
run_PCA(expM2, group2, label = TRUE)
Warning message in geom_text_repel(aes(label = Sample), linewidth = ifelse(is.null(label.size), :
"[1m[22mIgnoring unknown parameters: `linewidth`"

run_DEG_limma
  • 快速进行limma差异分析, 适合标准化的芯片数据或高通量数据
  • 可设置阈值, 默认 |log2FC| > 1, Padj < 0.05
  • 可设置不使用P值矫正
  • 可添加基因label, 可自定义基因
DEG_tb = run_DEG_limma(expM, group, Case = "Cancer", Control = "Normal", out_name = "./test/00_GEO_data_GSE18105/02_DEG")
Warning message:
"Parameter 'Case' is deprecated since fastGEO 1.8.0, please use 'case' instead!"
Warning message:
"Parameter 'Control' is deprecated since fastGEO 1.8.0, please use 'control' instead!"
Number of Up regulated genes: 1006/2102(47.86%)

Number of Down regulated genes: 1096/2102(52.14%)

Number of Not Change genes: 20779/22881

head(DEG_tb)
A data.frame: 6 × 7
logFC AveExpr t P.Value adj.P.Val B DEG
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
FOXQ1 5.716354 8.777030 13.702008 2.117724e-15 8.075941e-12 24.97073 Up
CLDN1 4.640807 8.502369 19.330354 6.543012e-20 1.497107e-15 34.74020 Up
LGR5 4.485429 7.354803 10.223319 6.637827e-12 1.421583e-09 17.14899 Up
KRT23 4.445303 6.613692 8.937107 1.917386e-10 1.589992e-08 13.84479 Up
DPEP1 4.260363 8.240490 10.122520 8.572521e-12 1.662270e-09 16.89826 Up
CEMIP 4.147864 8.048877 12.218476 5.487202e-14 5.879765e-11 21.82719 Up
# 最低阈值
DEG_tb2 = run_DEG_limma(expM, group, Case = "Cancer", Control = "Normal", log2FC = 0.5, padj = FALSE, out_name = "./test/00_GEO_data_GSE18105/03_DEG")
Warning message:
"Parameter 'Case' is deprecated since fastGEO 1.8.0, please use 'case' instead!"
Warning message:
"Parameter 'Control' is deprecated since fastGEO 1.8.0, please use 'control' instead!"
Number of Up regulated genes: 2969/5693(52.15%)

Number of Down regulated genes: 2724/5693(47.85%)

Number of Not Change genes: 17188/22881

plot_volcano_limma
  • 根据差异表格绘制火山图
group_list = get_GEO_group(obj, group_name = "characteristics_ch1.1", 
                            "cancer, homogenized" = "Cancer", 
                            "normal, homogenized" = "Normal",
                            "cancer, LCM" = "Cancer_LCM"
                           )
table(group_list$group)


Cancer Cancer_LCM Normal
17 77 17

group = group_list$group 
expM = group_list$expM
SID = group_list$SID
table(group)
group
    Cancer Cancer_LCM     Normal 
        17         77         17 
DEG_tb = run_DEG_limma(expM, group, Case = "Cancer", Control = "Normal", out_name = "./test/00_GEO_data_GSE18105/02_DEG")
Warning message:
"Parameter 'Case' is deprecated since fastGEO 1.8.0, please use 'case' instead!"
Warning message:
"Parameter 'Control' is deprecated since fastGEO 1.8.0, please use 'control' instead!"
Number of Up regulated genes: 1006/2102(47.86%)

Number of Down regulated genes: 1096/2102(52.14%)

Number of Not Change genes: 20779/22881

plot_volcano_limma(DEG_tb, out_name = "./test/00_GEO_data_GSE18105/04_volcano")



# 显示阈值
plot_volcano_limma(DEG_tb, caption = TRUE)



# 选择 top logFC 前10的基因进行显示
plot_volcano_limma(DEG_tb, caption = TRUE, method = "logFC")



# 不显示基因名
plot_volcano_limma(DEG_tb, caption = TRUE, label = FALSE)



# 自定义label基因, 每组随机挑选5个
set.seed(1234)
gene_slect = as.character(aggregate(rownames(DEG_tb), by = list(DEG_tb$DEG), function(x) sample(x, 5))[, 2])
gene_slect
  1. 'CPM'
  2. 'STX16'
  3. 'TTC27'
  4. 'FFAR4'
  5. 'RP11-31K23.2'
  6. 'MCM2'
  7. 'GNA11'
  8. 'PIK3IP1'
  9. 'RFC3'
  10. 'CAPN9'
  11. 'TFPI'
  12. 'ASCC3'
  13. 'DMXL2'
  14. 'GK5'
  15. 'SLC2A1'
plot_volcano_limma(DEG_tb, caption = TRUE, label = gene_slect)



  • 如果差异分析的阈值发生改变, 这里也要进行一样的设置
# 最低阈值
DEG_tb2 = run_DEG_limma(expM, group, Case = "Cancer", Control = "Normal", log2FC = 0.5, padj = FALSE, out_name = "./test/00_GEO_data_GSE18105/03_DEG")
Warning message:
"Parameter 'Case' is deprecated since fastGEO 1.8.0, please use 'case' instead!"
Warning message:
"Parameter 'Control' is deprecated since fastGEO 1.8.0, please use 'control' instead!"
Number of Up regulated genes: 2969/5693(52.15%)

Number of Down regulated genes: 2724/5693(47.85%)

Number of Not Change genes: 17188/22881

plot_volcano_limma(DEG_tb2, log2FC = 0.5, padj = FALSE)



plot_heatmap_DEG
  • 根据差异分析结果和表达谱绘制差异基因热图
# 仅支持输入两种分组
group2 = group[group != "Cancer_LCM"]
expM2 = expM[, group != "Cancer_LCM"]
table(group2)
group2
Cancer Normal 
    17     17 
plot_heatmap_DEG(expM2, DEG_tb, group2, out_name = "./test/00_GEO_data_GSE18105/04_heatmap")



set_image(7, 10)
plot_heatmap_DEG(expM2, DEG_tb, group2, ntop = 30, w = 15)



plot_boxplot
gene_select = c(head(rownames(DEG_tb), 3), tail(rownames(DEG_tb), 3))
gene_select
  1. 'FOXQ1'
  2. 'CLDN1'
  3. 'LGR5'
  4. 'GCG'
  5. 'ZG16'
  6. 'CLCA4'
set_image(14, 6)
plot_boxplot(expM2, group2, 
             genes = gene_select,
             out_name = "./test/00_GEO_data_GSE18105/06_boxplot", w = 10, h = 5)



set_image(14, 6)
plot_boxplot(expM2, group2, 
             genes = gene_select, 
             breaks = c("Normal", "Cancer"), # 修改x轴顺序
             ylab = "log2(FPKM + 1)", # 修改y轴标题
             color = c("blue", "red"), # 修改颜色
             w = 10, h = 5)



# 多组
set_image(14, 6)
plot_boxplot(expM, group, 
             breaks = c("Normal", "Cancer", "Cancer_LCM"), # 修改x轴顺序
             comparisons = list(c("Cancer", "Normal"), c("Cancer_LCM", "Normal")), # 手动指定分组
             genes = gene_select)



run_DEG_deseq2
  • 快速使用DESeq2进行差异分析, 适合count数据
  • 有count数据优先使用DESeq2
load2("./test/TCGA/00_TCGA_data.RData")
Loading objects:
  expM
  expM_FPKM
  group
  sample_anno
hd(expM)
Type: data.frame  Dim: 59427 × 562 
          TCGA-60-2712-01A-01R-0851-07 TCGA-56-7221-01A-11R-2045-07 TCGA-21-A5DI-01A-31R-A26W-07 TCGA-43-7657-01A-31R-2125-07
5_8S_rRNA                            0                            0                            0                            0
5S_rRNA                              0                            0                            0                            4
7SK                                  0                            3                            0                            0
A1BG                                 7                            1                            0                            6
A1BG-AS1                            81                           34                           31                           41
          TCGA-94-7033-01A-11R-1949-07
5_8S_rRNA                            0
5S_rRNA                              4
7SK                                  0
A1BG                                 5
A1BG-AS1                            24
table(group)
group
  LUAD Normal 
   511     51 
sum(group == "Normal")

51

# 取子集测试
set.seed(1234)
SID_tumor = sample(colnames(expM)[group == "LUAD"], sum(group == "Normal"))
expM2 = expM[, c(SID_tumor, colnames(expM)[group == "Normal"])]
group2 = group[match(colnames(expM2), colnames(expM))]
table(group2)
group2
  LUAD Normal 
    51     51 
DEG_tb_count = run_DEG_deseq2(expM2, group2, Case = "LUAD", Control = "Normal", out_name = "./test/TCGA/TCGA_LUAD_DEG")
estimating size factors

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing

-- replacing outliers and refitting for 3453 genes
-- DESeq argument 'minReplicatesForReplace' = 7 
-- original counts are preserved in counts(dds)

estimating dispersions

fitting model and testing

Number of Up regulated genes: 6032 
Number of Down regulated genes: 11035 
Number of Not Change regulated genes: 37154 
count_to_exp
  • 用于将count矩阵转为标准化的表达谱, 可用于绘制PCA、热图和箱线图等
  • 不依赖于基因长度, 省去计算TPM/FPKM的麻烦
  • 使用 DESeq2 的 VST 标准化方法
expM_vst = count_to_exp(expM2)
hd(expM_vst)
Type: data.frame  Dim: 59427 × 102 
          TCGA-58-A46K-01A-11R-A24H-07 TCGA-68-A59J-01A-21R-A26W-07 TCGA-66-2753-01A-01R-0980-07 TCGA-77-8136-01A-11R-2247-07
5_8S_rRNA                     1.885298                     1.885298                     1.885298                     1.885298
5S_rRNA                       3.184227                     1.885298                     2.586642                     1.885298
7SK                           1.885298                     1.885298                     1.885298                     1.885298
A1BG                          3.452268                     3.881663                     3.250346                     4.269259
A1BG-AS1                      5.402377                     5.804480                     5.639623                     6.755954
          TCGA-46-3765-01A-01R-0980-07
5_8S_rRNA                     1.885298
5S_rRNA                       1.885298
7SK                           1.885298
A1BG                          3.959024
A1BG-AS1                      5.297581
expM_FPKM2 = expM_FPKM[, colnames(expM2)]
hd(expM_FPKM2)
Type: data.frame  Dim: 59427 × 102 
          TCGA-58-A46K-01A-11R-A24H-07 TCGA-68-A59J-01A-21R-A26W-07 TCGA-66-2753-01A-01R-0980-07 TCGA-77-8136-01A-11R-2247-07
5_8S_rRNA                    0.0000000                    0.0000000                   0.00000000                    0.0000000
5S_rRNA                      1.7187452                    0.0000000                   0.63133714                    0.0000000
7SK                          0.0000000                    0.0000000                   0.00000000                    0.0000000
A1BG                         0.1257833                    0.1797658                   0.08161232                    0.2777468
A1BG-AS1                     0.8414901                    0.9455328                   0.87018715                    1.5912964
          TCGA-46-3765-01A-01R-0980-07
5_8S_rRNA                    0.0000000
5S_rRNA                      0.0000000
7SK                          0.0000000
A1BG                         0.1615653
A1BG-AS1                     0.5865966
  • 测试vst算法与TCGA官网提供的FPKM的相关性
  • 结果显示Sperman相关性达到0.984, 说明相关性极高, 可以等价
# 加载必需的包(确保版本兼容)
if (!require("dplyr", quietly = TRUE)) {
  install.packages("dplyr")
  library(dplyr)
}
if (!require("ggplot2", quietly = TRUE)) {
  install.packages("ggplot2")
  library(ggplot2)
}

# 确保两个矩阵的基因和样本匹配
common_genes <- intersect(rownames(expM_vst), rownames(expM_FPKM2))
common_samples <- intersect(colnames(expM_vst), colnames(expM_FPKM2))

expM_vst_filtered <- expM_vst[common_genes, common_samples]
expM_FPKM_filtered <- expM_FPKM2[common_genes, common_samples]

# 转换为长格式(不使用rownames_to_column)
# 处理VST数据:手动添加基因名列
vst_df <- as.data.frame(expM_vst_filtered)
vst_df$gene <- rownames(vst_df)  # 手动将行名转为gene列
vst_long <- tidyr::pivot_longer(vst_df, cols = -gene, 
                               names_to = "sample", values_to = "vst_value")

# 处理FPKM数据:手动添加基因名列
fpkm_df <- as.data.frame(expM_FPKM_filtered)
fpkm_df$gene <- rownames(fpkm_df)  # 手动将行名转为gene列
fpkm_long <- tidyr::pivot_longer(fpkm_df, cols = -gene, 
                                names_to = "sample", values_to = "fpkm_value")

# 合并数据
combined_data <- inner_join(vst_long, fpkm_long, by = c("gene", "sample"))

# 计算相关性
cor_pearson <- cor(combined_data$vst_value, combined_data$fpkm_value, method = "pearson")
cor_spearman <- cor(combined_data$vst_value, combined_data$fpkm_value, method = "spearman")
hd(combined_data)
Type: tbl_df  Dim: 6061554 × 4 
       gene                       sample vst_value fpkm_value
1 5_8S_rRNA TCGA-58-A46K-01A-11R-A24H-07  1.885298          0
2 5_8S_rRNA TCGA-68-A59J-01A-21R-A26W-07  1.885298          0
3 5_8S_rRNA TCGA-66-2753-01A-01R-0980-07  1.885298          0
4 5_8S_rRNA TCGA-77-8136-01A-11R-2247-07  1.885298          0
5 5_8S_rRNA TCGA-46-3765-01A-01R-0980-07  1.885298          0
# 绘制散点图
set_image(6, 6)
set.seed(1234)
combined_data = data.frame(combined_data)
combined_data2 = combined_data[sample(1:nrow(combined_data), 10000), ]
# 计算相关性
cor_pearson <- cor(combined_data2$vst_value, combined_data2$fpkm_value, method = "pearson")
cor_spearman <- cor(combined_data2$vst_value, combined_data2$fpkm_value, method = "spearman")
ggplot(combined_data2, aes(x = fpkm_value, y = vst_value)) +
  geom_point(alpha = 0.3, size = 1) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(
    x = "FPKM Expression",
    y = "VST Normalized Expression",
    title = paste0("Correlation between VST and FPKM\n",
                  "Pearson: ", round(cor_pearson, 3), " | ",
                  "Spearman: ", round(cor_spearman, 3))
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))
[1m[22m`geom_smooth()` using formula = 'y ~ x'

  • 绘制的热图也看不出差异
plot_heatmap_DEG(expM_vst, DEG_tb_count, group2)



plot_heatmap_DEG(expM_FPKM2, DEG_tb_count, group2)



run_ComBat
  • 去除多个数据集之间的批次效应, 并绘制去批次前后的PCA图
# 下载处理 GSE108109 芯片数据
obj = download_GEO("GSE108109", out_dir = "./test/run_ComBat")
>>>>> 识别到本地注释文件, 将优先使用!

>>>>> 访问数据集: GSE108109...

>>>>> 下载尝试次数: 1 ... FALSE

下载成功!

>>>>> 处理测序平台: GPL19983

>>>>> 识别到111个样本, 34个注释信息

>>>>> 输出样本注释表至: ./test/run_ComBat/GSE108109_GPL19983_sample_anno.csv

>>>>> 执行样本间矫正...

>>>>> 处理成功文件保存至: ./test/run_ComBat/GSE108109_GPL19983_raw.RData

>>>>> 执行探针-基因注释, 平台: GPL19983 ...

>>>>> 使用R包内置的注释文件...

>>>>> 去除重复的基因注释, 使用方法: max...

>>>>> 处理结束!

obj_list = get_GEO_group(obj, group_name = "source_name_ch1",
                         "Membranous Nephropathy" = "MN",
                         "Living donor" = "Control"
                        )
expM_GSE108109 = obj_list$expM
group_GSE108109 = obj_list$group
#下载处理 GSE216841 高通量
sample_anno = get_GEO_pheno("GSE216841")
>>>>> 访问数据集: GSE216841...

>>>>> 下载尝试次数: 1 ... FALSE

下载成功!

>>>>> 处理测序平台: GPL20301

>>>>> 识别到34个样本, 40个注释信息

>>>>> 输出样本注释表至: 00_GEO_data/GSE216841_GPL20301_sample_anno.csv

>>>>> 未识别到表达谱, 提供的数据集可能是高通量测序, 请确认!

    https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE216841

>>>>> 高通量数据的表达谱请自行从补充文件下载处理!

>>>>> 将仅返回样本注释表!

hd(sample_anno)
Type: data.frame  Dim: 34 × 40 
                      title geo_accession                status submission_date last_update_date
GSM6696604 Normal control 1    GSM6696604 Public on Jan 25 2023     Oct 28 2022      Jan 25 2023
GSM6696605 Normal control 2    GSM6696605 Public on Jan 25 2023     Oct 28 2022      Jan 25 2023
GSM6696606 Normal control 3    GSM6696606 Public on Jan 25 2023     Oct 28 2022      Jan 25 2023
GSM6696607 Normal control 4    GSM6696607 Public on Jan 25 2023     Oct 28 2022      Jan 25 2023
GSM6696608 Normal control 5    GSM6696608 Public on Jan 25 2023     Oct 28 2022      Jan 25 2023
expM_raw = read.faster("./test/run_ComBat/GSE216841_MNd_ncounts_annot.txt.gz")
expM_raw = col2rownames(expM_raw)
hd(expM_raw)
Type: data.frame  Dim: 20321 × 37 
                ensembl_gene_id hgnc_GRCh38p12                                                                                    description
ENSG00000000003 ENSG00000000003         TSPAN6                                              tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858]
ENSG00000000005 ENSG00000000005           TNMD                                                tenomodulin [Source:HGNC Symbol;Acc:HGNC:17757]
ENSG00000000419 ENSG00000000419           DPM1 dolichyl-phosphate mannosyltransferase subunit 1, catalytic [Source:HGNC Symbol;Acc:HGNC:3005]
ENSG00000000457 ENSG00000000457          SCYL3                                   SCY1 like pseudokinase 3 [Source:HGNC Symbol;Acc:HGNC:19285]
ENSG00000000460 ENSG00000000460       C1orf112                        chromosome 1 open reading frame 112 [Source:HGNC Symbol;Acc:HGNC:25565]
                   N_CTRL1    N_CTRL2
ENSG00000000003  92.874228 309.359955
ENSG00000000005   1.020596   1.041616
ENSG00000000419 255.148977   1.041616
ENSG00000000457 373.538103 887.456841
ENSG00000000460 110.224358 289.569251
countM = expM_raw[, -c(1:3)]
name_list = c("N_CTRL" = "Normal control", "MEMBR" = "Idiopathic membranous nephropathy", "MCD" = "Minimal change disease")
new_name = paste(name_list[paste0(gsub("[0-9]", "", colnames(countM)))], gsub("[A-Z_]", "", colnames(countM)))
colnames(countM) = sample_anno$geo_accession[match(new_name, sample_anno$title)]
countM = ceiling(countM)
countM = data.frame(SYMBOL = expM_raw$hgnc_GRCh38p12, countM)
countM = aggregate(. ~ SYMBOL, data = countM, function(x) x[which.max(x)])
countM = col2rownames(countM)
expM = count_to_exp(countM)
converting counts to integer mode

group = subString(sample_anno[colnames(countM), "title"], -1, " ", rev = TRUE, collapse = " ")
table(group)
group
Idiopathic membranous nephropathy            Minimal change disease                    Normal control 
                               12                                14                                 8 
SID_MN = colnames(expM)[group == "Idiopathic membranous nephropathy"]
SID_Control = colnames(expM)[group == "Normal control"]
expM_GSE216841 = expM[, c(SID_MN, SID_Control)]
group_GSE216841 = rep_by_len(c("MN", "Control"), list(SID_MN, SID_Control))
expM_list = list2(expM_GSE108109, expM_GSE216841)
names(expM_list) = subString(names(expM_list), 2, "_")
names(expM_list)
  1. 'GSE108109'
  2. 'GSE216841'
group_merge = Reduce(c, sapply(paste0("group_", names(expM_list)), get))
table(group_merge)
group_merge
Control      MN 
     14      56 
set_image(9, 10)
obj = run_ComBat(expM_list, group_merge, out_name = "./test/run_ComBat/01_Combat")
Warning message in MASS::cov.trob(data[, vars]):
"Probable convergence failure"
Warning message in MASS::cov.trob(data[, vars]):
"Probable convergence failure"
Warning message in MASS::cov.trob(data[, vars]):
"Probable convergence failure"

# 提取结果
expM = obj$expM
group = obj$group
boxplot(expM)



table(group)
group
Control      MN 
     14      56 

search_GEO: 数据集检索

  • 这个功能是V1.9最主要和重要的功能, 它可以让数据集检索结果的筛选变得极其高效!
  • 结合AI人工智能, 只需简单的三步就能获取GEO数据库中你想要的数据集
第一步: 获取检索词
  • 比如你需要检索以下几个关键词的数据: 人类阿尔兹海默症海马体组织的bulk转录组数据
  • 首先定义keywords, 然后使用get_GEO_query_promot生成AI提示词, 把提示词粘贴给AI即可获得检索词
  • ChatGPT生成的检索词感觉最为靠谱
  • 当然如果你自己会写检索词也可以自己写
  • 检索词只有一个要求: 不能包含单引号!
keywords = c("阿尔兹海默症", "转录组(RANseq和微阵列)", "海马体组织", "人类")
get_GEO_query_promot(keywords)
我需要检索GEO数据集,请编写符合下面条件的英文单行检索词,同义词尽可能地多,宁愿多检索也不能漏掉,物种要这样写"Homo sapiens"[porgn], 条件: 阿尔兹海默症、转录组(RANseq和微阵列)、海马体组织、人类。 回答仅给出检索词就行, 放在代码块里检索词, 最后一定要仔细检查去除检索词里的单引号
第二步: 下载数据集信息
  • 之后, 将AI或自己编写的检索词提供给函数search_GEO, 它就可以全自动获取满足检索条件的所有数据集的信息, 基本上是GSE界面的所有信息
  • 并且分别保存为txt和xlsx格式, 前者提供给AI解析筛选, 后者我们也可以进行查看和筛选
search_term = '("Alzheimer disease") AND (transcriptome OR transcriptional OR "gene expression" OR "high throughput sequencing" OR microarray) AND (hippocampus) AND "Homo sapiens"[porgn]'
search_GEO(search_term, out_name = "./test/阿尔兹海默_海马体_转录组数据")
>>>>> 步骤1: 构建检索链接,获取符合条件的UID ...

>>>>> 成功检索到81个符合条件的UID

>>>>> 步骤2: 提取每个UID对应的完整GSE Accession

>>>>> 步骤3: 下载每个GSE的HTML文件

>>>>> 步骤4: 提取每个GSE的元数据信息

Warning message in file.create(to[okay]):
"cannot create file './test/阿尔兹海默_海马体_转录组数据.xlsx', reason 'Permission denied'"
>>>>> 检索结果已保存到: ./test/阿尔兹海默_海马体_转录组数据.xlsx

>>>>> 检索结果已保存到: ./test/阿尔兹海默_海马体_转录组数据.txt
第三步: 让AI精确筛选想要的数据集
  • 获取每个数据集的信息后, 就可以扔给AI让它帮我们筛选数据了, 几十分钟甚至几个小时的工作AI十几秒就可以完成了
  • 首先贴心地为大家准备了生成提示词地函数: get_GEO_summary_promot, 大家也可以跟AI随便说你想要的数据
  • 然后将上面生成的txt文件和提示词扔给AI, 它就很快的帮我们把需要的数据和介绍整理成表格了
  • 这里我用的AI是豆包, 因为豆包对文件的处理速度最快, 然最后需要我们手动一一校验是否满足我们的需求
get_GEO_summary_promot(keywords)
从提供的文件中筛选整理严格符合下面条件的数据集成表格, 首先总结满足条件的数据集, 然后给出表格包含下面几列:GSE编号、核心研究内容、关键技术、样本信息: 条件: 阿尔兹海默症、转录组(RANseq和微阵列)、海马体组织、人类
  • txt

  • xlsx

  • AI筛选的结果

  • 做之前根本不敢想, 这竟然真的能够实现, 谁懂这个R语言+AI的含金量!

安装

# 当前目录的文件
cbind(dir())
A matrix: 22 × 1 of type chr
00_GEO_data
01_安装.R
02_使用.R
AnnoProbe.zip
annot_dir
fastGEO安装和使用教程.html
fastGEO安装和使用教程.ipynb
fastGEO安装和使用教程.md
fastGEO安装和使用教程.R
fastGEO安装和使用教程.Rmd
fastGEO安装和使用教程.Rproj
fastR_1.5.0.tar.gz
fastR_1.8.0.tar.gz
Heatmap_DEG.pdf
Heatmap_DEG.png
test
test.R
test.txt
tmp_GSE_html
阿尔兹海默_海马体_转录组数据.txt
阿尔兹海默_海马体_转录组数据.xlsx
测试案例
  • 双击html文档可以打开网页版的此教程, 可以浏览, 但不能运行
  • 安装和使用视频教程: R包fastGEO-GEO数据库快速下载探针注释
  • 双击项目文件fastGEO安装和使用教程.Rproj, 打开项目
  • 右下角Files面板点击R脚本, 运行安装代码

必须安装的

  • 执行基础的功能所必须的
# if (!require("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# BiocManager::install("GEOquery")
# BiocManager::install("limma")
# devtools::install_github("jmzeng1314/AnnoProbe")
# install.packages("data.table", repos="https://rdatatable.gitlab.io/data.table")
# devtools::install_github("hadley/readr")
# install.packages("stringr")
# install.packages("httr")
# install.packages("rvest")
# install.packages("shiny")
# install.packages("DT")

建议安装的

  • 这些可能不太容易安装, 尽量安装一下
# BiocManager::install("org.Hs.eg.db")
# BiocManager::install("org.Mm.eg.db")
# BiocManager::install("clusterProfiler")

安装我的工具包 fastR

  • 有很多函数是依赖我的这个工具包的
  • 当前文件夹没有安装包的需要付费获取安装包:
    • 直接添加微信转账50, 发送fastR和fastGEO的压缩包
    • 以前是20, 老用户依然可免费获取
    • 很多资料后面都会慢慢涨价, 早买会便宜很多!
    • 购买后可进入交流群,后续版本更新和问题答疑都在群里进行

# 自动寻找当前目录下的压缩包
file_fastR = grep("^fastR.*.gz", dir(), value = TRUE)
file_fastR
  1. 'fastR_1.5.0.tar.gz'
  2. 'fastR_1.8.0.tar.gz'
# 如果你正在使用要安装或更新的R包, 需要先取消加载, 取消占用
# detach("package:fastGEO", unload = TRUE)
# 强制进行安装更新
# install.packages(file_fastR, repos = NULL, upgrade = FALSE, force = TRUE)

安装 fastGEO

# ## 如果以前安装过, 可能本地存在注释文件, 如果有则需要备份
# ## 重新安装前保存预先存在的注释文件
# anno_file = paste0(.libPaths(), "/fastGEO/data/anno_obj.RData")
# anno_file_temp = paste0("temp-", basename(anno_file))
# anno_file_exists = FALSE
# if(file.exists(anno_file)){
#     file.copy(anno_file, anno_file_temp)
#     anno_file_exists = TRUE
# }
# anno_file_exists
# ## 安装fastGEO
# file_fastGEO = grep("^fastGEO.*.gz", dir(), value = TRUE)
# install.packages(file_fastGEO, repos = NULL, upgrade = FALSE, force = TRUE)

# ## 重新拷贝注释文件
# if(anno_file_exists){
#     file.copy(anno_file_temp, anno_file)
#     unlink(anno_file_temp)
# }
  • 安装完之后按照上一节的内容自行测试就行了

更新版本的安装

  • 对于小版本的细微改动, 我将只提供更新后的R包压缩包, 大家可以将压缩包放到教程目录下, 重新运行安装代码即可
  • 或者直接指定安装包路径进行安装
  • 其中file_pkgs是R包压缩包的绝对路径
# install.packages(file_pkgs, repos = NULL, upgrade = FALSE, force = TRUE)




Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐