By: Kobe De Ridder, Huiwen Che, Kaat Leroy & Bernard Thienpont
Quantifying the cell types that contribute to tissue samples is important for the study of disease mechanisms, and also for accurate diagnosis and prognosis of patients. This can be done by different techniques, including immunohistology, cell sorting, and single-cell RNA sequencing, but these techniques are time-consuming and expensive. Additionally, the cellular contributions to cell-free DNA (extracellular DNA present in bodily fluids resulting from cell death or active secretion) cannot be quantified using these techniques, despite the advantages for diagnosis and prognosis.
An accurate and cost-effective alternative is provided by reference-based computational deconvolution of DNA methylation profiles. However, multiple algorithms for deconvolution have been described, and they each require a set of prespecified parameters, rendering their implementation all but straightforward. In our study, we provide a comprehensive evaluation of 16 deconvolution algorithms, in combination with several normalization algorithms, for deconvolution of both array- and sequencing-based, tissue and leukocyte in silico, in vitro and in vivo mixtures.
Furthermore, we assess several variables influencing deconvolution performance such as marker selection strategy, reference complexity, technical variability and cell type similarity among others. Most of the heavy lifting for this project was performed using the VSC resources, both for data storage of large whole-genome bisulfite sequencing datasets as well as for compute power during in silico mixture construction and deconvolution.
Most of the heavy lifting for this project was performed using the VSC resources, both for data storage of large whole-genome bisulfite sequencing datasets as well as for compute power during in silico mixture construction and deconvolution.
Key findings:
Deconvolution using the EpiDISH software package produced the most consistent and robust results. EpiDISH leverages a robust partial correlation model, which might explain its superior performance: it is insensitive to outlier values while still being able to pick up subtle signals coming from rare cell types.
Normalization strategy choices do not impact deconvolution performance, while reference marker specificity is strongly associated with deconvolution performance. The algorithm used for marker selection as well as the similarity of cell types that need to be deconvolved both influence performance. Lastly, a higher number of informative markers typically improves deconvolution performance, but plateaus at ~100 markers per cell type.
For sequencing-based deconvolution, both sequencing depth and evenness of coverage are important for effective deconvolution. In our experiments, deconvolution performance plateaued at 14× coverage. However, accurate deconvolution of lowly abundant fractions will require higher sequencing depth.
Figure 1: Schematic representation of benchmark workflow.
Figure 2
A: Heatmap showing R2 values between sorted cell samples highlighting similarity between different cell types. B: Boxplots showing root mean square error values between predicted and actual proportions in CD8+ T-cells, natural killer cells and all cell types combined for in silico and in vitro datasets. C: Scatterplots showing true proportions (x-axis) and predicted proportions (y-axis) for in silico and in vitro datasets using different normalizations (columns) and deconvolution algorithms (rows). Colors represent cell types. D: Heatmap showing R2 values between predicted and actual proportions for all deconvolution algorithms (rows) using variable numbers of marker regions (columns). Barplots show row and column R2 averages. E: Line plots showing R2 values between predicted and actual proportions at variable numbers of marker regions (x-axis) and increasing sequencing depths (colors).
Read the full publication in the Nature Portfolio here