Publicado em 09 maio 2000

Por Historical perspective
High-throughput sequencing is a highly specialised trade, practised in a very limited number of laboratories in the developed world. It can be estimated that a dozen labs are contributing over half the total sequence data currently being deposited in the public databases, with another 50 or so accounting for the bulk of the rest. Ali of these labs are located in North America, the larger European countries, Australia and Japan. It may thus come as a surprise that the latest entrant in this select club hails from Brazil, and more specifícally the state of São Paulo. São Paulo has a law stating that 1 % of the tax revenue collected by the state has to be given to an independent agency that supports scientific research, known as FAPESP (Fundação de Amparo à Pesquisa do Estado de São Paulo). As São Paulo is the richest state in Brazil, this amounts to a considerable amount of money (USD 250 Mio in 1998). By law, FAPESP is also forbidden to spend more than 5% of its money on administrative costs. The combination of ample funding and political independence gives the Foundation a lot of freedom to develop innovative scientific programs. In 1997, FAPESP decided that Brazil should not miss out on the scientific and economic opportunities that can be derived from genome sequencing, and should be able to produce its own data, analyse them, and use the results for local scientific projects. To start off, it was decided that a good target organism should be bacterial, and of interest to the local economy. The agency settled on Xylella fastidiosa, a bacterium that infects orange trees, a major source of income in São Paulo, and causes Citrus Variegated Chlorosis. This choice also brought in additional funding from the citrus growers" association (Fundecitrus). ONSA A major goal of this first genome project was to bring sequencing technology to as many laboratories as possible, thus propelling them into the genome age. Therefore, the concept of setting up a single sequencing centre was rejected from the start. Instead, bids were put up for laboratories interested in participating in the project, and those that were selected received equipment (ABI370 sequencers), reagents, and ample technical advice. In total, 30 labs were selected for the Xylella project, dispersed geographically throughout the state of São Paulo. In addition to the sequencing labs, the project steering committee designated a DNA co-ordinator (for the handling and distribution of clones) and a bioinformatics centre. The bioinformatics group, located at the University of Campinas (about 80 km from São Paulo), was made responsible for ali of the data handling, from base calling to final assembly verification. The sequencing labs submitted trace files only, and were paid on the basis of the amount of non-vector, high-quality sequences (based on phred scores) that could be extracted from their data. The entire process was automated using Web pages, and enabled the bioinformatics group to keep very close tabs on the daily progress of the project as a whole. The sequencing consortium that emerged from the Xylella project carne to be known as ONSA. the Organisation for Nucleotide Sequencing and Analysis. It is not coincidental that onça is the Brazilian name for the jaguar, a slightly smaller but more nimble feline than TIGR or LION. Starting from scratch, i.e. with labs that had never done any sequencing before, ONSA managed to sequence over 90% of the Xylella genome in less than a year. As usual, gap closing and finishing took another year, but the genome is now completed, and was presented at a plant pathogen conference in February 2000. It is notable that this was the first plant pathogen whose genome was sequenced. The consortium is currently sequencing another citrus pathogen, Xanthomonas axonopodis pv. Citri, which causes Citrus Canker. ORESTES Once the Xylella project was well under way, and the consortium had proven that it could produce sequence data quickly and efficiently, Andy Simpson of the São Paulo Branch of the Ludwig Institute (the Xylella DNA co-ordinator) started thinking of more ambitious projects to tackle. He had developed a technique for cDNA cloning using low-stringency PCR and applied it to gene discovery in Schistosoma mansonii. He reasoned that the technique, dubbed ORESTES (for Open Reading frame EST Sequencing), could be applied on a large scale to the generation of novel human ESTs. The novelty of the ORESTES approach lies in its use of defined sequence primers, used at very low stringency, to generate a large number of low-complexity cDNA libraries. Because the probabilities of priming the first and second cDNA strands are distributed randomly on their respective templates, the technique preferentially generates clones coming from the central portions of mRNAs, which are underrepresented in current EST collections. Andy proposed an ambitious new project, aiming to produce 1 million human EST sequences using the ORESTES technique, and thus to add substantially to our knowledge about the transcriptome. The Human Câncer Genome Project (HCGP) was funded jointly by FAPESP and the Ludwig Institute for Câncer Research, for a total of USD 10 Mio over two years. It uses some of the existing ONSA expertise, but also adds a number of new groups with a stronger interest in human biology and medicine. Technically, it is also different: six megabase high-throughput capillary sequencers have been placed in sequencing centers, each of which is fed by a consortium of geographically clustered labs. Because of the higher complexity of the project, co-ordinators have been designated for tissue sampling, RNA preparation, and library construction, in addition to the overall project and sequencing coordination. The HCGP has been a resounding success. The megabase machines were put in production in the summer of 1999, and after the usual teething pains have been running at full blast since last October. They have already produced over 200'000 new EST sequences, about 25% of which do not match any sequences in the public EST databases, and about half of which contain novel sequence information. The collection was used to update the annotation of chromosome 22, where it identified about 100 genes for which there was no previous experimental evidence. It is expected that the project will reach the million mark by the end of the current year; it is already producing data at a much higher rate than NCT's CGAP project. It is also sampling a number of câncer types that were not included in CGAP. The sequences are being deposited in the public databases, as the two funding institutions had pledged to do. The bioinformatics of the HCGP is being handled by a new group at the Ludwig Institute in São Paulo, headed by Sandro de Souza. The group has done an excellent job not only in data management, but also in annotating and data basing the sequences, thus allowing project participants to quickly find sequences based on a number of criteria, including library of origin, annotation class, similarity to sequences in the public databases, etc. They are also integrating the HCGP sequences in contigs with ESTs already in the public databases, and using their data to complete the annotation of emerging human genome sequences. It is very likely that the São Paulo group will become a member of the Ensembl annotation initiative. Credit In two years, Brazil (or at least São Paulo state) has gone from essentially nothing to being one of the larger producers of sequence data in the world. It has done so not by investing massively in a large sequencing facility, but by bringing together a large number of individual labs, many of which are already using these new data and know-how in their own research. In this way, the genome projects have already had a major impact on Brazilian science. The world has not really taken notice yet, but I would bet that within another year or two ONSA and the HCGP will have achieved the same recognition as TIGR and CGAP. Bioinformaticians and genome scientists take note!