Within the biomarkers research field, miRNA have become a promising molecule for diagnosis of different diseases like cancer (Loke et al. 2019) or neurodegenerative diseases (Pan et al. 2016; Kumar, Vijayan, and Reddy 2017), among others. One of the advantages of these molecules is that they are present on reachable tissues like blood, which is easier to sample than tumors or cerebrospinal fluid. Moreover, the analysis of small RNA could also provide insights of disease regulation and reveal new therapeutic approaches.
Therefore, it is important to extract as maximum information as possible from its analysis. In this page I would provide with an analysis done by myself that could serve as model or inspiration for others. If so, please refer to the web page or the article published (Not published yet, under review).
We used docker4seq
version 2.4.0 in order to process fastq files and obtain the count matrix (Calogero et al. 2020).
library(docker4seq)
downloadContainers(group="docker")
mirnaCounts(group = "docker", fastq.folder = getwd(),
scratch.folder = "/data/scratch/", mirbase.id = "hsa",
download.status = FALSE, adapter.type = "ILLUMINA", trimmed.fastq = TRUE)
This function generates, among others, 2 text files for further processing: all.counts.txt and cpm.txt. As we had 4 fastq for each sample (that is, technical replicates), we unified same samples into a single column, summing up the counts. We also changed the identifier of each sample. Both archives were stored in the corresponding project folder for further processing.
mexpr <- read.table("all.counts.txt", header = TRUE, sep = "\t")
cpms <- read.table("cpm.txt", header = TRUE, sep = "\t", dec = ".")
indices <- seq(2,193, by = 4)
muestras <- character()
conteos <- numeric()
srna.cpm <- numeric()
for (i in indices){
j <- i-1
muestra <- strsplit(colnames(mexpr)[i], "_")[[1]][2]
muestras <- c(muestras, muestra)
conteos <- cbind(conteos,rowSums(mexpr[,i:(i+3)]))
srna.cpm <- cbind(srna.cpm, rowSums(cpms[,j:(j+3)]))
}
conteos <- as.data.frame(conteos)
srna.cpm <- as.data.frame(srna.cpm)
colnames(conteos) <- muestras
colnames(srna.cpm) <- muestras
rownames(conteos) <- mexpr$X
save(conteos, file = "resultados/1_matriz-expresion-docker.Rdata")
save(srna.cpm, file = "resultados/1_cpm-docker.Rdata")
In the present project, there were 48 samples coming from patients belonging to 3 different groups: Control (n=20), Early-Disease (n=8), Late-Disease (n=20). To further identify differentially expressed miRNA in both disease cases, we used 3 differential expression packages from Bioconductor: DESeq2, edgeR and NOISeq. Both DESeq2 and edgeR are based on a parametric assumption that NGS generates counts that follow a negative binomial distribution defined by the mean (μ) and the overdispersion parameter (α). In contrast, NOISeq is a package based on a non-parametric assumption. We used all of them to generate a suitable set of miRNA candidates.
Kij ∼ NB(μij, αi)
μij = β0Controlj + β1Earlyj + β2Latej
As we had an unbalanced dataset, we performed the analysis in two parallel ways:
Using a binary outcome (composed of the 20 Controls and 20 Late cases).
Using the 3 outcomes (Controls, Early and Late).
First of all, we filetered out miRNAs to reduce noise, we selected these miRNA having more than 5 counts in at least 10 samples, which resulted in a total of 284 miRNA in case of the binary outcome and 347 miRNA in the dataset with all outcomes.
load("resultados/1_matriz-expresion-docker.Rdata")
grupo <- condicion$`grupo experimental` != "case_1"
grupo <- condicion$`grupo experimental`[grupo]
grupo <- factor(grupo)
conteos.bin <- conteos[,condicion$`grupo experimental` != "case_1"]
keep <- rowSums(conteos.bin >5) >= 10
conteos.bin <- conteos.bin[keep,]
sum(keep) #selects 284
## [1] 284
## [1] 347
Before differential expression testing, we performed an exploratory data analysis (EDA) in order to look for outliers. As it can be seen in both datasets (PCA plots), there may be an extreme point towards the left of the plots, belonging to a Late-Disease sample (S17). This was initially considered as outlier but then the analysis was repeated and it showed no differences among the differentially expressed miRNA (de-miRNA) detected. Thus, it was not identified as outlier.
Binary analysis identified 10 de-miRNA and multiple outcome identified 41 de-miRNA (pvalue < 0.05).
bin.model <- DESeq(bin.dds)
mcivscont.results.bin <- results(bin.model[order(results(bin.model)$pvalue),], name = resultsNames(bin.model)[2], pAdjustMethod = "fdr")
write.csv(mcivscont.results.bin[order(mcivscont.results.bin$pvalue),], file = "resultados/2-DEmiRNA_bin_MCIvsCONT.csv")
multi.model <- DESeq(multi.dds)
mcivscont.results.multi <- results(multi.model, name = resultsNames(multi.model)[2])
prevscont.results.multi <- results(multi.model, name = resultsNames(multi.model)[3])
write.csv(mcivscont.results.multi[order(mcivscont.results.multi$pvalue),], file = "resultados/2-DEmiRNA_MCIvsCONT.csv")
write.csv(prevscont.results.multi[order(prevscont.results.multi$pvalue),], file = "resultados/2-DEmiRNA_PREvsCONT.csv")
Similar to the DESeq2 procedure, we performed the differential expression analysis using default parameters in the edgeR documentation. As a result, binary analysis identified 4 de-miRNA (one in common to DESeq2) and multiclass analysis identified 15 de-miRNA.
library(edgeR)
bin.dge <- DGEList(counts = conteos.bin)
bin.dge$samples$group <- grupo
#sum(duplicated(rownames(bin.dge)))
bin.dge <- calcNormFactors(bin.dge)
desing <- model.matrix(~grupo, data = bin.dge$samples)
bin.dge <- estimateGLMCommonDisp(bin.dge,design = desing)
bin.dge <- estimateGLMTrendedDisp(bin.dge, design = desing)
bin.dge <- estimateGLMTagwiseDisp(bin.dge, design = desing)
bin.result <- glmFit(bin.dge)
bin.result <- glmLRT(bin.result)
bin.demir.glm <- topTags(bin.result,adjust.method = "fdr")
write.csv(bin.demir.glm, file = "resultados/4-DEmiRNA-glm_bin_MCIvsCONT.csv")
multi.dge <- DGEList(counts = conteos)
multi.dge$samples$group <- as.factor(condicion$`grupo experimental`)
#sum(duplicated(rownames(multi.dge))) #no duplicates
desing <- model.matrix(~group, data = multi.dge$samples)
multi.dge <- estimateGLMCommonDisp(multi.dge,design = desing)
multi.dge <- estimateGLMTrendedDisp(multi.dge, design = desing)
multi.dge <- estimateGLMTagwiseDisp(multi.dge, design = desing)
result.glm <- glmFit(multi.dge,design = desing)
result.MCI <- glmLRT(result.glm,coef = 2)
result.precli <- glmLRT(result.glm, coef = 3)
write.csv(topTags(result.MCI), file = "resultados/4-DEmiRNA-glm_MCIvsCONT.csv")
write.csv(topTags(result.precli), file = "resultados/4-DEmiRNA-glm_PRECvsCONT.csv")
Finally, we performed a third analysis following a non-parametric approach, resulting in 20 miRNAs with probability higher than 0.85.
library(NOISeq)
load("resultados/1_matriz-expresion-docker.Rdata")
conteos.bin <- conteos[,condicion$`grupo experimental` != "case_1"]
bin.nsq <- readData(data = conteos.bin, factors = data.frame(grupo = grupo))
#QCreport(bin.nsq, samples = NULL, factor = "grupo", file = "resultados/5-QCreport.pdf")
set.seed(2244) #needed for reproducible result given that it is based on simulation
result.nsq <- noiseq(bin.nsq, k = 0.5, norm = "rpkm", factor = "grupo", replicates = "no") #es un noiseq-sim
## [1] "Computing (M,D) values..."
## [1] "Computing probability of differential expression..."
## [1] "20 differentially expressed features"
In order to get insights regarding the biological meaning of this miRNAs, we first obtained the predicted gene targets, which are scored (rank_product
) from more probable to less probable (the lower the rank_product
value, the better the prediction). We did it twice, one annotation with the de-miRNAs identified in the binary analysis and other annotation with the de-miRNAs identified in the multiclass analysis.
library(miRNAtap)
library(miRNAtap.db)
library(topGO)
library(org.Hs.eg.db)
library(data.table)
rm(list = ls())
demirna.dsq <- read.csv(file = "resultados/2-DEmiRNA_bin_MCIvsCONT.csv", row.names = 1)
demirna.edg.glm <- read.csv(file = "resultados/4-DEmiRNA-glm_bin_MCIvsCONT.csv", row.names = 1)
demirna.nsq <- read.csv(file = "resultados/5-DEmiRNA_bin_MCI-Cont.csv", row.names = 1)
#sum(demirna.dsq$pvalue<0.05) # 10 selected
demirna.dsq <- demirna.dsq[demirna.dsq$pvalue<0.05,] #selects 10 miRNA
#sum(demirna.edg.glm$PValue < 0.05) # 4 selected
demirna.edg.glm <- demirna.edg.glm[demirna.edg.glm$PValue < 0.05,] #selecciona 4 miRNA
# sum(demirna.nsq$prob > 0.85) # 20 selected
demirna.bin.all <- unique(c(rownames(demirna.dsq), rownames(demirna.edg.glm), rownames(demirna.nsq)))
predictions.bin <- data.table(getPredictedTargets(demirna.bin.all[1], species = "hsa",
method = "geom", min_src = 3), keep.rownames = TRUE)
for (i in 2:length(demirna.bin.all)) { #
mir = demirna.bin.all[i]
predictions.bin <- rbind(predictions.bin,data.table(getPredictedTargets(mir, species = 'hsa',
method = 'geom', min_src = 3),
keep.rownames = TRUE), fill = TRUE)
}
#hay dups??
#sum(duplicated(predictions.bin$rn)) #check for duplicates
predictions.bin <- predictions.bin[,lapply(.SD, mean), by = rn] #join duplicates by rank_product mean (!)
# sum(duplicated(predictions.bin$rn)) #check for duplicates
predictions.bin <- predictions.bin[order(predictions.bin$rank_product)] #reordering
predictions.bin <- predictions.bin[predictions.bin$rank_product < 10,] #rank_product filtering
rankedGenes.bin <- predictions.bin$rank_product
rankedGenes.bin <- setNames(rankedGenes.bin, predictions.bin$rn)
selection = function(x) TRUE # we do not want to impose a cut off, instead we are using rank information
ontos <- c("BP", "CC", "MF")
# onto <- ontos[1]
for (onto in ontos) {
allGO2genes = annFUN.org(whichOnto=onto, feasibleGenes = NULL,
mapping="org.Hs.eg.db", ID = "entrez")
GOdata = new('topGOdata', ontology = onto, allGenes = rankedGenes.bin,
annot = annFUN.GO2genes, GO2genes = allGO2genes,
geneSel = selection, nodeSize=10)
#nodeSize es un parámetro que se usa para jerarquizar los términos GO. Es decir, se descartan aquellos GO que tengan menos de 10 genes en la lista de genes
results.ks = runTest(GOdata, algorithm = "weight01", statistic = "ks")
allRes = GenTable(GOdata, KS = results.ks, orderBy = "KS", topNodes = max(results.ks@geneData[[4]],20))
allRes[,c('GO.ID','Term','KS')]
write.csv(allRes[,c('GO.ID','Term','KS')], file = paste("resultados/6-GO-DEmi-bin_",onto,".csv", sep = ""))
}
rm(list = ls())
demirna.dsq1 <- read.csv(file = "resultados/2-DEmiRNA_MCIvsCONT.csv", row.names = 1)
demirna.dsq2 <- read.csv(file = "resultados/2-DEmiRNA_PREvsCONT.csv", row.names = 1)
demirna.edg.glm1 <- read.csv(file = "resultados/4-DEmiRNA-glm_MCIvsCONT.csv", row.names = 1)
demirna.edg.glm2 <- read.csv(file = "resultados/4-DEmiRNA-glm_PRECvsCONT.csv", row.names = 1)
demirna.dsq1 <- demirna.dsq1[demirna.dsq1$pvalue<0.05,]
demirna.dsq2 <- demirna.dsq2[demirna.dsq2$pvalue<0.05,]
demirna.edg.glm1 <- demirna.edg.glm1[demirna.edg.glm1$PValue < 0.05,]
demirna.edg.glm2 <- demirna.edg.glm2[demirna.edg.glm2$PValue < 0.05,]
demirna.all <- unique(c(rownames(demirna.dsq1),rownames(demirna.edg.glm1),rownames(demirna.dsq2), rownames(demirna.edg.glm2)))
predictions.multinom <- data.table(getPredictedTargets(demirna.all[1], species = "hsa", method = "geom", min_src = 3), keep.rownames = TRUE)
for (i in 2:length(demirna.all)) { #
mir = demirna.all[i]
predictions.multinom <- rbind(predictions.multinom,data.table(getPredictedTargets(mir, species = 'hsa',
method = 'geom', min_src = 3), keep.rownames = TRUE), fill = TRUE)
}
# sum(duplicated(predictions.multinom)) #check for duplicates
predictions.multinom <- predictions.multinom[order(predictions.multinom$rank_product)] #reordering
predictions.multinom <- predictions.multinom[predictions.multinom$rank_product < 10,] #rank_product filtering
rankedGenes <- predictions.multinom$rank_product
rankedGenes <- setNames(rankedGenes, predictions.multinom$rn)
selection = function(x) TRUE # we do not want to impose a cut off, instead we are using rank information
ontos <- c("BP", "CC", "MF")
# onto <- ontos[2]
for (onto in ontos) {
allGO2genes = annFUN.org(whichOnto=onto, feasibleGenes = NULL,
mapping="org.Hs.eg.db", ID = "entrez")
GOdata = new('topGOdata', ontology = onto, allGenes = rankedGenes,
annot = annFUN.GO2genes, GO2genes = allGO2genes,
geneSel = selection, nodeSize=10)
#nodeSize es un parámetro que se usa para jerarquizar los términos GO. Es decir, se descartan aquellos GO que tengan menos de 10 genes en la lista de genes
results.ks = runTest(GOdata, algorithm = "weight01", statistic = "ks")
allRes = GenTable(GOdata, KS = results.ks, orderBy = "KS", topNodes = max(results.ks@geneData[[4]],20))
allRes[,c('GO.ID','Term','KS')]
write.csv(allRes[,c('GO.ID','Term','KS')], file = paste("resultados/6-GO-DEmi-multi_",onto,".csv", sep = ""))
}
In order to visualize GO terms, we used the treeMap representation. But first, it was necessary to reduce the GO terms, grouping them by similarity. For this, we used rrvgo
package.
library(rrvgo)
rm(list = ls())
go.bin.bp <- read.csv("resultados/6-GO-DEmi-bin_BP.csv", row.names = 1)
go.bin.cc <- read.csv("resultados/6-GO-DEmi-bin_CC.csv", row.names = 1)
go.bin.mf <- read.csv("resultados/6-GO-DEmi-bin_MF.csv", row.names = 1)
go.multi.bp <- read.csv("resultados/6-GO-DEmi-multi_BP.csv", row.names = 1)
go.multi.cc <- read.csv("resultados/6-GO-DEmi-multi_CC.csv", row.names = 1)
go.multi.mf <- read.csv("resultados/6-GO-DEmi-multi_MF.csv", row.names = 1)
go.bp <- rbind(go.bin.bp, go.multi.bp)
go.cc <- rbind(go.bin.cc, go.multi.cc)
go.mf <- rbind(go.bin.mf, go.multi.mf)
Finally, we decided to select all de-miRNA identified by both analysis and build generalized linear models. That is a binomial (strictly bernoulli) regression and a multinomial regression, respectively. Then, in order to evaluate the prediction accuracy, we validated the area under the curve (AUC) with a bootstrap resampling using boot
package.
rm(list=ls())
library(nnet)
library(pROC)
library(boot)
load("resultados/1_matriz-expresion-docker.Rdata")
condicion <- readxl::read_xlsx("../Muestras_ConsueloChafer2.xlsx")
grupo <- factor(condicion$`grupo experimental`)
demirna.dsq <- read.csv(file = "resultados/2-DEmiRNA_bin_MCIvsCONT.csv", row.names = 1)
demirna.edg.glm <- read.csv(file = "resultados/4-DEmiRNA-glm_bin_MCIvsCONT.csv", row.names = 1)
demirna.nsq <- read.csv(file = "resultados/5-DEmiRNA_bin_MCI-Cont.csv", row.names = 1)
demirna.dsq <- demirna.dsq[demirna.dsq$pvalue<0.05,] #selecciona 10 miRNA
demirna.edg.glm <- demirna.edg.glm[demirna.edg.glm$PValue < 0.05,] #selecciona 4 miRNA
demirna.all <- c(rownames(demirna.dsq),rownames(demirna.edg.glm),rownames(demirna.nsq))
#demirna.all[duplicated(demirna.all)] #aparece el hsa-miR-4259, identificado por DESeq2 y edgeR
demirna.all <- unique(demirna.all)
conteos.bin <- conteos[grupo != "preclinico"]
glm1 <- glm(grupo ~ ., family = binomial(), data = datos)
glm2 <- step(glm1, trace = 0)
write.csv(coef(glm2), file = "resultados/9-coef_bin_inter.csv")
boot.glm <- function(data, indices){
data.b <- data[indices,]
glm.b <- glm(grupo ~ ., family = binomial(), data = data.b)
roc.b <- pROC::roc(as.numeric(data$grupo)-1, predict(glm.b, newdata = data ,type = "response"))
return(roc.b$auc)
}
glm.boot <- boot(datos, boot.glm, R = 1000, parallel = "multicore", ncpus = parallel::detectCores())
demirna.dsq1 <- read.csv(file = "resultados/2-DEmiRNA_MCIvsCONT.csv", row.names = 1)
demirna.dsq2 <- read.csv(file = "resultados/2-DEmiRNA_PREvsCONT.csv", row.names = 1)
demirna.edg.glm1 <- read.csv(file = "resultados/4-DEmiRNA-glm_MCIvsCONT.csv", row.names = 1)
demirna.edg.glm2 <- read.csv(file = "resultados/4-DEmiRNA-glm_PRECvsCONT.csv", row.names = 1)
demirna.dsq1 <- demirna.dsq1[demirna.dsq1$pvalue<0.05,]
demirna.dsq2 <- demirna.dsq2[demirna.dsq2$pvalue<0.05,]
demirna.edg.glm1 <- demirna.edg.glm1[demirna.edg.glm1$PValue < 0.05,]
demirna.edg.glm2 <- demirna.edg.glm2[demirna.edg.glm2$PValue < 0.05,]
demirna.all <- unique(c(rownames(demirna.dsq1),rownames(demirna.edg.glm1),rownames(demirna.dsq2), rownames(demirna.edg.glm2)))
datos <- data.frame(grupo = grupo, t(conteos[demirna.all,]))
modelo <- multinom(grupo ~ ., data = datos)
modelo2 <- step(modelo, trace = 0)
write.csv(coef(modelo2), file = "resultados/7-coef_multi_inter.csv")
##
## Call:
## multiclass.roc.default(response = as.numeric(grupo), predictor = as.numeric(predict(modelo)))
##
## Data: as.numeric(predict(modelo)) with 3 levels of as.numeric(grupo): 1, 2, 3.
## Multi-class area under the curve: 1
multinom.fun <- function(data, indices){
data.b <- data[indices,]
model.b <- multinom(grupo ~ ., data = data.b)
model.roc.b <- multiclass.roc(as.numeric(data$grupo), as.numeric(predict(model.b, newdata = data)))
return(model.roc.b$auc)
}
multinom.boot <- boot(datos, multinom.fun, R = 1000,
parallel = "multicore", ncpus = parallel::detectCores())
## # weights: 141 (92 variable)
## initial value 52.733390
## iter 10 value 18.419041
## iter 20 value 10.403683
## iter 30 value 8.581056
## iter 40 value 7.408737
## iter 50 value 6.723852
## iter 60 value 6.298944
## iter 70 value 5.588832
## iter 80 value 3.217361
## iter 90 value 0.961881
## iter 100 value 0.005292
## final value 0.005292
## stopped after 100 iterations
par(mfrow=c(1,2))
plot(density(glm.boot$t), main = "Bootstrap AUC distribution", xlab = "Binomial model")
abline(v=quantile(glm.boot$t, c(0.025, 0.95), names = FALSE), lty = 2)
text(x = mean(glm.boot$t), y = 0.5,labels = paste0("AUC=",signif(mean(glm.boot$t,digits = 3))), cex = 0.8)
mtext("A", adj = 0)
plot(density(multinom.boot$t), main = "Bootstrap AUC distribution", xlab = "Multinomial model")
text(x = mean(multinom.boot$t), y = 0.5,labels = paste0("AUC=",signif(mean(multinom.boot$t,digits = 3))), cex = 0.8)
abline(v = quantile(multinom.boot$t, probs = c(0.025,0.975)), lty = 2)
mtext("B", adj = 0)
Calogero, Raffaele, Neha Kulkarni, Luca Alessandri, Marco Beccuti, Greta Romano, and Francesca Cordero. 2020. Docker4seq: Running Ngs Computing Demanding Applications, E.g Reads Mapping and Counting, Wrapped in Docker Containers.
Huber, W., V. J. Carey, R. Gentleman, S. Anders, M. Carlson, B. S. Carvalho, H. C. Bravo, et al. 2015. “Orchestrating High-Throughput Genomic Analysis with Bioconductor.” Nature Methods 12 (2): 115–21. http://www.nature.com/nmeth/journal/v12/n2/full/nmeth.3252.html.
Kumar, Subodh, Murali Vijayan, and P. Hemachandra Reddy. 2017. “MicroRNA-455-3p as a potential peripheral biomarker for Alzheimer’s disease.” Human Molecular Genetics 26 (19): 3808–22. https://doi.org/10.1093/hmg/ddx267.
Loke, Sau Yeen, Prabhakaran Munusamy, Geok Ling Koh, Claire Hian Tzer Chan, Preetha Madhukumar, Jee Liang Thung, Kiat Tee Benita Tan, et al. 2019. “A Circulating miRNA Signature for Stratification of Breast Lesions Among Women with Abnormal Screening Mammograms.” Cancers 11 (12). https://doi.org/10.3390/cancers11121872.
Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for Rna-Seq Data with Deseq2.” Genome Biology 15 (12): 550. https://doi.org/10.1186/s13059-014-0550-8.
Pajak, Maciej, and T. Ian Simpson. 2020. MiRNAtap: MiRNAtap: MicroRNA Targets - Aggregated Predictions.
Pan, Yaoqian, Ruizhu Liu, Erin Terpstra, Yanqing Wang, Fangfang Qiao, Jin Wang, Yigang Tong, and Bo Pan. 2016. “Dysregulation and Diagnostic Potential of microRNA in Alzheimer’s Disease.” Journal of Alzheimer’s Disease 49: 1–12. https://doi.org/10.3233/JAD-150451.
R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Robinson, Mark D, Davis J McCarthy, and Gordon K Smyth. 2010. “EdgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40. https://doi.org/10.1093/bioinformatics/btp616.
Sayols, Sergi. 2020. Rrvgo: A Bioconductor Package to Reduce and Visualize Gene Ontology Terms. https://ssayols.github.io/rrvgo.
Tarazona, Sonia, Pedro Furio-Tari, David Turra, Antonio Di Pietro, Maria Jose Nueda, Alberto Ferrer, and Ana Conesa. 2015. “Data Quality Aware Analysis of Differential Expression in Rna-Seq with Noiseq R/Bioc Package.” Nucleic Acids Research 43 (21): e140.
Venables, W. N., and B. D. Ripley. 2002. Modern Applied Statistics with S. Fourth. New York: Springer. https://www.stats.ox.ac.uk/pub/MASS4/.