How does DESeq2 normalize the counts?

March 19, 2021

Sequencing RNA-seq Normalization Differential expression DESeq2

Adrià Mitjavila on March 19, 2021 (edited on March 22, 2021)

When studying RNA-seq data, one of the most common approaches is to perform differential expression analysis (or DE analysis), a method in which two different biological conditions (i.e. different developmental stages or disease status) are compared in order to see if the expression of genes changes between them. Counts, which are sequencing reads (for single-end sequencing) or fragments (for paired-end sequencing) that map to a given genomic feature (i.e. exons), are a proxy to estimate gene expression but, in order to compare the expression of the genes across different condition or even to compare different genes within the same condition, we have to normalize these counts.

Recently, I explained some RNA-seq count normalization methods, and their uses. Here, I am going to explain some theory behind the normalization used by DESeq2, a common R package for DE analysis.

If you want to read more about normalization methods, read this post.

Why another normalization method?

Remember the FPKM and TPM normalization methods we explained in the this post? They are nice normalization methods in normalizing read counts by sequencing depth (also called library size) and gene length, which makes them perfect to compare the expression of different genes in the same sample, as well as for comparison of a gene expression between similar conditions or between condition replicates.

When doing DE analysis, the normalization by sequencing depth is extremely important, since is one sample has, overall, the double of reads than the other, you will see differences among sample that are not due to biological causes, but due to the sequencing. On the other hand, since we compare the two samples gene by gene, one should expect that in both samples the same gene has the same length, so normalization by gene length is not necessary.

However, there is still one issue that must be addressed: the library composition (or RNA composition).

To explain differences in library composition:

Let’s say, for example, that we have two samples A and B that come from different cell types.
To make things easy, assume that only 3 genes are expressed (gene1, gene2, and gene3).
Imagine that gene1 and gene2 have equal, but very low expression in sample A and sample B, and that gene3 is highly expressed in sample A, but not in sample B.
If we sequence the same number of reads, we will probably have a lot of reads for gene3 in sample A and less in sample B, obtaining that gene3 is differentially expressed. No problem since this is what happens biologically.
However, in gene1 and gene2, despite being equally expressed in sample A and B, in sample A they will have less reads, since gene3 captures most of the sequenced reads. Due to this, we will obtain that gene1 and gene2 are upregulated in sample B compared to sample A, when this is not happening biologically.

Diagram of the example explained above. Top panel shows the real exression of the genes: gene1 and 2 are low but equally expressed in sample A and B, while gene3 is highly expressed in sample A but not in sample B. Bottom panel shows the sequenced transcripts: gene 3 shows similar trend as the real expression, however, since in sample A most reads map to gene3, gene1 and gene2 recieve less reads than in sample B, despite having equal real expression.

Figure 1: Diagram of the example explained above. Top panel shows the real exression of the genes: gene1 and 2 are low but equally expressed in sample A and B, while gene3 is highly expressed in sample A but not in sample B. Bottom panel shows the sequenced transcripts: gene 3 shows similar trend as the real expression, however, since in sample A most reads map to gene3, gene1 and gene2 recieve less reads than in sample B, despite having equal real expression.

How does `DESeq2` adresses this differences?

To account for sequencing depth and library composition, DESeq2 calculates sample-specific scaling factors using the mean of rations method. For the user runing DESeq2 in R, this implies only one step but, the DESeq2 algorithm performs multiple steps:

Compute the pseudo-reference: geometric mean for each gene across all samples.
- For each gene, it computes the geometric mean of the raw counts accross all samples in order to create a pseudo-reference sample. The good thing about this mean is that is not as susceptible to outliers as arithmetic mean.
- This means that DESeq2 calculates the product of the raw counts in a gene accross all the samples and then, it does a root with base equal to the number of samples.
- The geometric mean can also be calculated by computing the aritmetic mean of the natural logarithm transformed values and then raising the e number to the calculated aritmetich mean if the log values.
Removing genes without counts.
- If a gene has 0 counts in one sample, its geometric mean accross all samples will be 0. This can will lead to errors in the next steps, hence this gene must be removed from the calculation of the normalization factors.
- Note that this gene will be introduced once more in the normalization step (step 5 here), since it’s only removed to calculate the normalization factors.
Ratio of each sample to the pseudo-reference.
- The next step is computing the ratio between the raw counts in each sample and the geometric mean previously calculated. This way, we will adress the differences in library composition.
- Have in mind that, since most genes won’t be differential expressed, the majority of genes should have ratios near to 1.
Sample-specific size normalization: median of the ratios.
- For each sample, take the median of the all the counts/reference ratios calculated before. With this median, we will address the sequence depth differences accross different samples which, added to the ratios, accounts for both, differences in library composition and sequence depth.
- A cool thing about the median is that, unlike the mean, it is not very much affected by the presence of very or very low expressed genes.
- Note that, since most genes won’t be differentiall expressed, the median of the ratios should be near one.
Normalization of the raw counts.
- For each gene in each samples, divide the raw counts by the median of the ratios calculated above.

A little example

To understand things better, let’s do a little example with an invented counts matrix with 6 genes and 3 samples. Note that a real counts matrix would have tenths of thousands of genes, hence this examples is not representative of a real situation.

Table 1: Raw (unnormalized) counts in genes from each sample. Note that the total number of counts has been added to illustrate differences in sequencing depth, but it won’t be used for the calculation of normalization (or scaling) factors. Moreover, no genes with 0 counts have been added to make the example easier.
	sampleA	sampleB	sampleC
gene1	27	15	119
gene2	79	3	128
gene3	21	33	127
gene4	8	38	63
gene5	4	21	105
gene6	37	25	111
Total	176	135	653

Note that in table 1 the total number of counts for each samples has been added to illustrate differences in sequencing depth, but it won’t be used for the calculation of normalization (or scaling) factors. Moreover, no genes with 0 counts have been added to make the example easier.

Step 1: build a pseudo-reference by computing the geometric mean

What we have to do here is to follow the equation given above to compute the geometric mean of the raw counts of each gene across all samples.

Table 2: Table with the raw counts in every gene in eac sample and the pseudo-reference for each gene, computed by using the geometric mean.
	sampleA	sampleB	sampleC	pseudoRef
gene1	27	15	119	36.39156
gene2	79	3	128	31.18790
gene3	21	33	127	44.48146
gene4	8	38	63	26.75499
gene5	4	21	105	20.66123
gene6	37	25	111	46.82613

Step 2: filter 0 count genes

This step is not necessary here since we don’t have any genes with 0 counts.

Step 3: compute the ratios of the raw counts and the pseudo-reference

What we have to do here is to divide the raw counts of every gene in each sample for the pseudo-reference calculated in the step 1 (table 2).

Table 3: Table showing, for each gene, the ratios of raw counts and the geometric mean.
	ratiosA	ratiosB	ratiosC
gene1	0.7419303	0.4121835	3.269989
gene2	2.5330338	0.0961912	4.104156
gene3	0.4721069	0.7418822	2.855122
gene4	0.2990097	1.4202961	2.354701
gene5	0.1935993	1.0163964	5.081982
gene6	0.7901572	0.5338900	2.370471

Step 4: retrieve the median of the ratios

In this step, we have to calculate, in each sample, the median value for the ratios obtained in step 3 (table 3=. These values will be the scaling factor for each sample.

Table 4: Normalization (or scaling) factors for each of the samples in the experimental design.
	Scaling factor
sampleA	0.6070186
sampleB	0.6378861
sampleC	3.0625557

As seen in table 3, the scaling factors are far from 1, specially for the sample3. This is not the real situation, since with thousands of genes, where most of them are not differentially expressed (similar in all conditions), the mean of the ratios would be near to 1.

Step 5. Normalize the raw counts.

To normalize the raw counts in each sample, we have to divide them by the sample scaling factor.

Table 5: Table of normalized counts for every gene in each sample
	sampleA	sampleB	sampleC
gene1	44.479693	23.515170	38.85644
gene2	130.144288	4.703034	41.79516
gene3	34.595317	51.733374	41.46863
gene4	13.179168	59.571764	20.57105
gene5	6.589584	32.921238	34.28509
gene6	60.953654	39.191950	36.24424
Total	289.941704	211.636531	213.22061

Note that the total number of normalized counts for each sample has is much more similar across samples than the non normalized counts. However, in a real example, the total number of counts would be even more similar across all samples

More information

For a complete explanation of how DESeq normalizes counts, look at this video from StatQuest (here, the author of the video does the calculations using logarithmic values, but is essentially the same):

You can also take a look at this tutorial from Harvard Bioinformatics Core.

Finally, you can take a look to this video from Chipster, which covers the differntial expression analysis from raw count normalization to multiple testing correction:

Bibliography

Journal articles

Anders et al (2010). Differential expression analysis for sequence count data. Genome Biol. Online here.

Love et al (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. Online here.

Other sources

Harvard Bioinformatics Core (2019). Introduction to DGE: Count normalization with DESeq2. DGE workshop. Online here. Consulted on March 19, 2021.

Renesh Bedre (2021). Gene expression units explained: RPM, RPKM, FPKM, TPM, DESeq, TMM, SCnorm, GeTMM, and ComBat-Seq. Renesh Bedre Data Science Blog. Online here. Consulted on March 19, 2021.

RNA-seq and count normalizations

Definitions, uses and misuses

RNA-seq Counts Normalization RPKM FPKM TPM

What is MNase-seq?

My first paper has been published

Sequencing MNase-seq Xrn1 Nucleosomes