How does the 16S-SNAPP pipeline estimate the abundance of different organisms?
16S-SNAP APP does not directly estimate the abundance by organism.
Instead, it tallies read counts by mapped reference sequences (templates) and the resulting consensus sequences. The abundance table of different taxonomic entities (taxa) is generated after the consensus sequences are classified.
Allocating read counts to consensus sequences is a simple addition process in general. However, if a unique read pair is aligned/mapped to more than one template, it becomes a challenge to split its count (aka abundance, size, etc) among these templates, a problem similar to allocating multi-mapped read counts in RNA-seq analysis. We assume that reads from different amplicon regions are distributed more evenly within than between templates.
Although this assumption may be too ideal and far from reality (due to differential PCR efficiencies among amplicons), it models a read count allocation scheme. In SNAPP, we implemented an iterative procedure using the Python Scipy optimization module to minimize the total sum of squared deviations for each multi-mapped read pair based on the existing assigned read count distribution.