Preseq Benchmark
RustQC includes a built-in reimplementation of preseq lc_extrap for library complexity estimation. This page compares the output against preseq v3.2.0, which is the version used by the nf-core/rnaseq pipeline.
Performance
Section titled “Performance”Preseq runs as part of the single-pass rustqc rna analysis alongside all other
tools. The fragment counting is performed during the same BAM read that powers
dupRadar, featureCounts, and the 7 RSeQC tools. The extrapolation computation
itself is fast (under 1 second) — it is purely a function of the histogram data.
| Tool | Runtime |
|---|---|
| preseq lc_extrap (standalone) | ~2m 30s |
| RustQC (all tools, single pass) | 5m 13s |
RustQC produces preseq output at zero additional I/O cost — the fragment histogram is collected during the same BAM pass that produces all other outputs.
Output comparison
Section titled “Output comparison”Input: GM12878 REP1 — a 10 GB paired-end RNA-seq BAM (93M fragments).
The preseq reference was generated with:
preseq lc_extrap -bam -pe -seed 1 -seg_len 100000000 \ GM12878_REP1.markdup.sorted.bamComplexity curve
Section titled “Complexity curve”RustQC’s extrapolation matches preseq v3.2.0 within 0.1% across the entire range, from 1M to 10B reads. The difference is consistent with expected stochastic variation from the bootstrap confidence interval estimation.
| Total reads | preseq v3.2.0 | RustQC | Difference |
|---|---|---|---|
| 1,000,000 | 897,694 | 897,332 | -0.04% |
| 10,000,000 | 6,581,472 | 6,580,040 | -0.02% |
| 50,000,000 | 23,587,800 | 23,583,614 | -0.02% |
| 100,000,000 | 37,212,648 | 37,205,668 | -0.02% |
| 500,000,000 | 83,289,016 | 83,274,130 | -0.02% |
| 1,000,000,000 | 103,397,736 | 103,380,000 | -0.02% |
Fragment counting
Section titled “Fragment counting”The fragment histogram (frequency-of-frequencies table) is the input to the extrapolation algorithm. RustQC’s parallel hash-based counting matches preseq’s priority-queue-based mate merging within 0.001% for the singleton count (the most sensitive histogram bin).
| Metric | preseq v3.2.0 | RustQC |
|---|---|---|
| Total fragments | 93,003,768 | 93,003,768 |
| Distinct fragments | ~52.4M | ~52.4M |
| Singleton count | Matches within 0.001% | Matches within 0.001% |
Confidence intervals
Section titled “Confidence intervals”Bootstrap confidence intervals overlap between preseq and RustQC at every extrapolation point. RustQC’s intervals are marginally wider due to minor differences in bootstrap random sampling, but the point estimates (medians) are within 0.1% at all depths.
Algorithm details
Section titled “Algorithm details”RustQC implements the same Good-Toulmin rational function extrapolation as preseq v3.2.0:
- Fragment counting: PE reads are merged into fragments using
(tid, min_start, max_end)keys derived from CIGAR-based alignment coordinates, matching preseq’smerge_mates()behavior - Power series: Coefficients derived from the frequency-of-frequencies histogram
- Continued fraction: Evaluated using Euler’s forward recurrence with preseq-style rescaling
- Degree selection: Starts at degree 7, steps by 2, selects the most stable approximation
- Bootstrap: Categorical resampling with configurable seed for reproducibility
Key implementation choices matched to preseq v3.2.0:
- Read filtering:
!(flag & 0x100)only (secondary excluded, supplementary included) - Proper pair flag
0x2determines PE vs SE fragment construction - Non-proper-pair reads treated as individual fragments
Benchmark conditions
Section titled “Benchmark conditions”- BAM file: GM12878 ENCODE RNA-Seq, paired-end, ~185M reads, 10 GB
- preseq version: 3.2.0 (via Docker, matching nf-core/rnaseq)
- RustQC: 10 threads,
--gtfmode - Hardware: Apple M3 Max, 128 GB RAM