America presents a wide range of ecoregions that were quickly explored and occupied by the first humans to reach the continent (1). This remarkable human migration likely required different genetic and cultural adaptation patterns to ensure successful subsistence (2 – 5). The Amazon rainforest is one of the main ecoregions of the American continent, which is well known for its tropical climate and extraordinary biological diversity. Despite being a nutrient-rich environment, it is also hostile, presenting different obstacles to long-term human survival. Several challenges, including the instability of food resources (6, 7), low light penetration (8), and high diversity of pathogens (9), probably contribute to strong selective pressures for human survival and reproduction in this ecoregion.
The Amazon region, which encompasses the single largest tropical rainforest and nine South American countries, is virtually unrivaled in scale, complexity, and opportunity. It is currently populated by 1 million indigenous people, divided into approximately 300 different ethnic groups (10). Ecological conditions in the region have historically been favorable for transmitting numerous tropical diseases, especially vector-borne diseases (11 – 13). Although there are numerous studies on postcontact epidemics (14) in the Americas, historical data on precontact diseases (i.e., diseases native to the Americas) are inadequate. However, it has been reported that tuberculosis (Mycobacterium tuberculosis) (15, 16) and Chagas disease (i.e., American trypanosomiasis) (17) were present long before the Europeans arrived. Chagas disease is a vector-borne disease caused by the protozoan Trypanosoma cruzi, and it is usually transmitted through different triatomine bugs in endemic areas. The oldest record of T. cruzi in South American human archeological remains dates back to 9000 years in mummies from northern Chile and southern Peru (18). Human remains infected with T. cruzi were also found in Brazil about 7000 years before the present (yr B.P.) (19). There are, however, not many studies on the adaptations to the rainforest including Amazonian populations. Most of these are limited to a few individuals from the western Amazonia (20, 21). To date, knowledge regarding genetic adaptations in humans within this complex ecosystem is largely unknown. Motivated by this lack of knowledge, we searched for possible footprints of genetic adaptation to the Amazon rainforest environment by analyzing the genomic data of 118 nonadmixed individuals belonging to 19 native populations (table S1). We were specifically interested in identifying signals for positive natural selection related to tropical diseases.
To search for signals of positive selection, we applied two distinct approaches: (i) Population Branch Statistics (PBS), which identify alleles that have experienced strong changes in frequency in one population relative to two reference populations (19), and (ii) Cross-Population Extended Homozygosity Haplotype (XP-EHH) statistics, which contrast the extended haplotype homozygosity within and between populations (22). We then explored these results through gene pathway enrichment analyses [the Mapping and Annotation of Genome-Wide Association Studies (FUMA GWAS) (23), Gene Ontology (GO) (24), Kyoto Encyclopedia of Genes and Genomes (KEGG) (25), and Reactome (26) and the Gene Set Analysis Toolkit (GESTALT) (27)]. To formally test whether natural selection underlies the cases of extreme differentiation, rejecting genetic drift as a cause, PBS values were compared against those obtained with neutral coalescent simulations generated according to a plausible demographic scenario for the peopling of South America. Last, we performed a functional follow-up analysis to characterize the role of the putative selected gene.
RESULTS AND DISCUSSION
Genetic adaptation to the Amazon rainforest
The upper distribution of the combined positive selection indices (PBS + XP-EHH) is highly correlated to cardiovascular and metabolic traits (MTRR, DNAJA4, KCMA1, and MTPN), immune-related traits (KCMA1 and GCA), and pathogen infection (table S2). Among the three genes that were highlighted in our analysis, PPP3CA and DYNC1I1 were suggestively associated with T. cruzi seropositivity and immune response (Fig. 1 and table S2) (28), while NOS1AP was related to the mosquito bite reaction (Fig. 1) (29). The strongest selection signature found in Amazonian populations was around the PPP3CA gene region (Fig. 1 and figs. S1 and S4). Neutral coalescent simulations indicate that these deviations in allele frequency (rs2659540 G>A) are statistically significant (P 0.99) native Amazonians (table S1 and fig. S9), 35 non-admixed Mesoamericans (Native American ancestry > 0.97), and 231 East Asians from the Human Genome Diversity Project (HGDP; dataset 11, http://cephb.fr/hgdp/). Samples were genotyped with the high-density SNP array Axiom Human Origins (~700K SNPs, Affymetrix/Thermo Fisher Scientific). Native American ancestry proportions were estimated using ADMIXTURE (54).
Data quality control and phasing
First, we used the liftOver software (https://genome.ucsc.edu/cgi-bin/hgLiftOver) to update the physical positions from GRCh37-hg19 to the current genome assembly, GRCh38-hg38. Individuals with more than 10% of missing data and those SNPs with minor allele frequency (MAF) lower than 5% or with more than 1% of missing data were removed. We further computationally phased the data using SHAPEIT2 (55) and the recombination map files from the 1000 Genomes Project (56) to phase the haplotypes. No reference panels were used because of the peculiarities of evolutionary history and the lack of representation of Native American populations in these panels. We used the ancestral and derived allele information annotated in the 1000 Genomes Project from EPO pipeline (56), according to Ensembl FTP site (http://ftp.ensembl.org/pub/release-71/fasta/ancestral_alleles/homo_sapiens_ancestor_GRCh37_e71.tar.bz2), and removed SNPs without such information so that our final phased and polarized datasets contain 517,984 SNPs.
Genome annotation
We used ANNOVAR (57) and in-house scripts for the genome annotation. We annotated gene symbols to each SNP from both datasets using the RefGene database from ANNOVAR itself, taking 1 Mb as a maximum distance threshold to generate the gene annotation. The SNP IDs were mapped to the publicly available reference from the Axiom Human Origins SNP Array (https://thermofisher.com/br/en/home/life-science/microarray-analysis/microarray-data-analysis/genechip-array-annotation-files.html).
Selection scans
To identify the signature of positive selection, we applied methods relying on extended haplotype homozygosity (XP-EHH) and population differentiation (PBS) as described below. We used the phased and polarized dataset to apply Cross Population Extended Haplotype Homozygosity (XP-EHH) statistics, using the rehh R package (22, 58). P value normalization was performed using a negative log 10 transformation. We structured the XP-EHH analysis using the Native Amazonians as the focal population and the Mesoamericans or East Asians in the comparative analysis.
We also applied the PBS as published by Yi et al. (59). Before calculating the PBS values, we removed all monomorphic sites in at least two of the three analyzed populations, as well as the SNPs with MAF lower than 5% when considering the three populations altogether. We structured our analysis by placing the Native Amazonians as the focal population, Mesoamericans as the sister population, and East Asians as the outgroup. We then applied a rolling-mean approach (window size = 20, steps = 5) to reduce the noise and avoid spurious outliers. For each window, only the SNP with the highest PBS value and its related annotation are reported.
In addition, assuming that selective sweeps leave signatures in adjacent genomic regions, we took a gene approach where we summarized the results by taking the mean of all the SNP PBS values of each gene. We considered as significant those SNPs or genes located in the extreme 99.9th percentile of the distribution (empirical P value from right tail ≤ 0.001). Candidate genes for positive selection were accessed by overlapping XP-EHH greater than 2 (Amazonians × East Asians and Amazonians × Mesoamericans) and PBS analysis upper percentile of distribution (99.5th percentile).
Demographic model and simulation
The demographic model (fig. S3A) was built assuming a standstill in Beringia at 26,000 yr B.P., the settlement of America at 15,000 yr B.P., and divergence between Central and South American populations at 13,000 yr B.P. The effective population size and its changes over time were based on the Castro e Silva et al. (60) dataset inferred through ASCEND (61) and IBDNe (62) softwares. We assume a mutation rate of 1 × 10 −8 and a generation time of 25 years. Demographic simulations for single SNPs were performed using the ms software (63). We used 10,000 simulations according to this scenario to generate a null distribution of the PBS values to which observed PBS values were compared. This procedure allowed us to evaluate the significance of the differences and get the outlier PBS values.
Selection time and intensity
First, to deal with the ascertainment bias of the array data, we imputed [IMPUTE4 (64)] SNPs from chromosome 4 using the HGDP Whole Genome Sequence data as the reference panel (65). Next, 500 kb flanking SNPs adjacent to each side of the candidate SNP (rs2659540) were selected, totaling a 1-Mb genomic region. On the basis of the SNPs observed in this region, DAF (derived allele frequency) and iHS (integrated haplotype score) were inferred (58).
To assess when and the intensity of the selective event in the candidate SNP, we used the msms software (66), with the demographic model described above, assuming that the allele already segregated in populations with a frequency of 0.25 (basal frequency observed in East Asian) and assuming uniform prior distribution for time periods (1000 to 12,500 years) and selection coefficient (0 to 0.2) to simulate 10,000 sequences of 1 Mb each. From the simulated sequences, we selected only the markers in positions also observed in the real data. With this subset of SNPs, the summary statistics DAF and iHS were inferred.
Using the ABC approach, implemented in the abc R package [v.2.2.1 (67)], the parameters that produced the simulated data are accepted or rejected according to the Euclidean distance (ε) between observed and simulated summary statistics. As a threshold for the rejection algorithm, for each simulated model, we retained 1% of the top simulations with Euclidean distance between observed and simulated summary statistics closer to zero. The posterior distribution was estimated on the basis of these retained simulations.
Overrepresentation and gene set enrichment analyses
We performed both overrepresentation analysis (ORA) and gene set enrichment analysis (GSEA) on our selection scans’ results using the GWAS Catalog (https://ebi.ac.uk/gwas/), KEGG Pathway (https://genome.jp/kegg/pathway.html), and GO (http://geneontology.org/docs/go-enrichment-analysis/) databases. We used WebGestalt (http://webgestalt.org/) to apply GSEA, and FUMA GWAS (https://fuma.ctglab.nl/), Enrichr (https://maayanlab.cloud/Enrichr/), and GO for ORA. The GWAS Catalog database was analyzed using both FUMA GWAS and Enrichr, because FUMA GWAS requires a custom gene background and Enrichr does not. We also used Enrichr alongside GOATOOLS to analyze ORA in the GO database. Last, we used WebGestalt to address GSEA in KEGG. For ORA, we selected genes mapped to SNPs with scan values above the 99.99th, 99.95th, and 99th percentiles, while we used all genes with mean scores for GSEA. Threshold values were set at the minimum necessary to return robust results in the enrichment analysis.
To verify the results found, we used the R package GOfuncR (68), which uses random sets to compute the family-wise error rate (FWER; probability of having at least one false positive when multiple comparisons are being tested), performing a hypergeometric test, taking the gene length and spatial clustering of genes (2-Mb physical distance) in account. We then used a threshold FWER of A) and Chagas disease, we performed two analyses. In the first, from the geographic coordinates of each population analyzed on the present study, we used data from geographic distribution models of the number of Chagas disease vectors (Triatomine species), described by (13) and (72) and correlated with the frequency of the putatively selected rs2659540 G>A in different geographic regions. In the second analysis, we correlated data on the incidence rate (cases per 100,000 population) of Chagas disease in different regions of Brazil (Sistema de Informação de Agravos de Notificação SINAN, Brazilian Ministry of Health) and Mexico (73), between the years 2003 and 2013, and the frequency of the candidate allele rs2659540 G>A. Correlation analysis was performed in R language with the stats package.
Acknowledgments
We thank C. E. G. Amorim and M. C. Bortolini for comments and suggestions, O. Lao for his help in the analysis, and T. Ferraz for graphic assistance. We also thank all the native communities who participated in the study.
Funding: This work was supported by São Paulo Research Foundation (FAPESP) grants 17/14916-8 (to C.M.C.-S.), 15/26875-9 (T.H.), and 19/11821-1 (to G.V.), NHLBI R01HL133165 (to A.P.), and NIH grants R01 GM075091(to K.N.) and 5U19AI098461 (to A.P.).
Author contributions: Conceptualization: T.H. Methodology: C.M.C.-S., K.N., L.V.P., and A.P. Investigation: C.M.C.-S., K.N., G.V., and T.H. Funding acquisition: D.C., A.P., and T.H. Project administration: T.H. Supervision: T.H. Writing—original draft: K.N., M.A.C.e.S., L.V.P., A.P., and T.H. Writing—review and editing: T.H.