Systems biology involves the study of an organism as one single system. Instead of analysing all the individual components that make up a cell, the cell is instead viewed as an interacting network of genes, proteins and biochemical reactions and these are studied as a whole. In 20th century, molecular biology was focused upon. A ‘reductionist’ approach was followed, in which the individual components, such as the cell nucleus or sugar metabolism, were studied in isolation. However, we have progressed to an era where systems biology plays a leading role. A ‘holistic’ approach is followed, the components and their interaction are studied simultaneously. These cellular interactions are ultimately responsible for an organisms form and function. For example, if you look at the human immune system, it’s role is not defined by one single cellular component or mechanism. Instead, it is compromised of numerous genes, proteins, cells and mechanisms which work together to produce a response and fight of pathogens and disease. As science progressed in the past few years, tools and technologies were developed which allowed us to examine the foundations of biological activity-genes and proteins. It was learnt that these fundamental cellular components rarely act alone, either interacting with each other or other complex molecules. The systems approach looks at: The parts that make up the system & How these parts interact Placement of these interactions in terms of space and time i.e. where and when these interactions occur The technologies used for systems biology are high throughput in nature. The ‘omics’ technologies provide information on the parts of the systems. These include genomics (HT DNA sequencing), transcriptomics (gene chips, microarrays), proteomics (MS, 2D-PAGE, protein chips, yeast-2-hybrid) and metabolomics (NMR, X-ray). These technologies are still focused upon today, and the real challenge of systems biology in integration all the ‘omics’ data. Once this has been completed, this leads to a model, upon which perturbation experiments can be carried out. Currently, a common problem is that too much of the ‘omics’ data is qualitative. Ideally, one should be able to quantify the data obtained from ‘omics’ technologies. The holy grail of systems biology is a quantitative model which is able to predict the response of an entire system to any perturbation, and hopefully with continued effort and new discoveries, this will eventually be achieved.
The fundamental basis of genomics is determining the DNA sequence of the entire genome of an organism. The most common manner of determining a genome sequence is fragmenting the DNA, cloning the DNA into a suitable vector, analysis of the DNA sequence (i.e. what bases make up the sequence), assembly of the sequences into a single large molecule and filling in the gaps. There are a number of strategies in order to carry out the first steps of fragmenting and cloning the DNA. 1. Production of an ordered library
In summary, DNA is fragmented into smaller pieces in a manner and order you can predict and cloned into a large, suitable vector. It is then subcloned into multicopy vectors so you know what exactly you are sequencing. This will help later in assembly. Large fragments of DNA are cloned into a large vector. A vector commonly used for this a Bacterial Artificial Chromosome (BAC) vector. It is a small plasmid constructed from the functional fertility plasmid (F-plasmid) of E.coli. It possesses a number of regulatory genes which originated from the F-plasmid, including oriS, which mediates unidirection replication, and parA, which maintains the copy number at one or two. A low copy number is ideal in order to prevent recombination between DNA fragments in the vector. Furthermore, it possesses a chloramphenicol resistance marker. It is an extremely useful tool due to its ease of manipulation and its ability to stably maintain large fragments of DNA. Generally, it can maintain inserts of up to 300Kb. The host of a BAC is usually a mutant E.coli, which is modified in order to ensure the BAC is not destroyed i.e. restriction and modification systems are disabled. The chromosome to be sequenced in cloned into BACs. The whole collection of BACs representing the genome is referred to as a library. They are then subcloned into small multicopy vectors for sequencing. An ordered library of overlapping clones, referred to as a contig, is the result. The entire genome sequence is determined by placing these contigs in the correct order, in other words, the sequences are assembled into a consensus sequence. This shows which residues are most abundant in the alignment at each position. It is sequenced by chromosome walking.
BACs used for genome projects such as the Human Genome Project. 2. Shotgun sequencing
This method is the most commonly used approach to DNA sequencing. It is rapid and does not require any prior knowledge (e.g. order) Firstly, the multiple copies of DNA fragments are randomly broken up, usually using a nebuliser. A nebuliser is a pressurised syringe which uses compressed air to atomise compounds, in this case DNA. Another method of randomising fragmentation is by random digestion. In this method, a restriction enzyme which recognised a short sequence which frequently occurs in the DNA sequence is used to cut the DNA. Each of the random fragments is then cloned into a suitable vector, and the inserts are sequenced. For shotgun sequencing to work, it is important to ensure that the sequencing itself is efficient and there is a high number of clones. Due to the fact that multiple copies of the genome were sequenced, a very large number of sequenced clones are obtained. Much of the sequencing is redundant but to ensure all sequences are obtained it is necessary to sequence a large number of clones. Many of these clones may be identical or near identical. This 7-9 fold coverage as it is referred to greatly reduces the errors in the sequence because the redundancy in sequencing allows for a consensus nucleotide to be selected at any point in the sequence where uncertainty may exist. In comparison to the previous method, the order of the clones is not known. Therefore, the individual sequences are assembled by joining overlapping fragments together, generating a genome suitable for annotation. A computer programme searches for overlapping sequences. This method of sequencing is very high throughput i.e. is rapid, with genomes being sequenced in a matter of days. However, a drawback of this method is that some areas of the genome are not covered when it comes to assembly and gaps are left. This is due to the fact that some genes have a toxic effect on E. coli therefore cannot be cloned. Also repetitive stretches of DNA cannot be joined together properly. Filling in of these gaps takes longer the sequencing 99% of the genome. Therefore the strength of shotgun sequencing is its ability to work in the absence of a genetic or physical map, while its main limitation is the data analysis required for identifying overlaps.
Pyrosequencing is a relatively new method of DNA sequencing, often referred to as next generation sequencing. It is based on the idea of ‘sequencing by synthesis’. Different fragments of DNA are sequenced simultaneously in a Microplate format, with no cloning of DNA required. A primer is then annealed onto single stranded PCR amplicon which serves as a template. This is incubated with the enzymes DNA polymerase, ATP sulfurylase, luciferase and Apyrase as well as Adenosine 5’phosphosulfate (APS). One of the four dNTPs is then added. The polymerase incorporates the dNTP (Deoxyribonucleotide triphosphate) into the sequence, with the release of pyrophosphate (PPi).This PPi converted to ATP in the presence of APS by the enzyme ATP sulfurylase. This ATP fuels the conversion of luciferin to oxy-luciferin and light by luciferase. The production of light is detected by a charged coupled device (CCD) chip, and is seen as a peak in the raw data output. The height of each peak corresponds to the number of nucleotides incorporated. Since the added dNTP is known, the sequence can be determined. Apyrase, a nucleotide-degrading enzyme, continuously degrades ATP and any unincorporated nucleotides. Once they have been degraded, another dNTP is added. As the process continues, the complementary DNA strand is synthesised and the sequence of nucleotides is determined from the signal peaks in the Pyrogram trace. Currently, 100 million bp can be determined in a 7 hour run. However, drawbacks of the method include the fact that short reads of 100-500bp make assembly of unknown sequences difficult. Therefore, this method is primarily used now for comparative genome sequencing. The next step after determining the sequence of a genome is to locate probable genes. Most genes encode proteins, with the exception of rRNA, tRNA, regulatory and structural RNA. Annotation is the process of identifying functional regions within a genome (i.e. genes) and assigning a probable function to them. The first step in identifying genes is to identify open reading frames (ORFs). A functional ORF is a sequence of DNA which encodes a protein. The first step in identifying an ORF is to look for a start codon (ATG/AUG) and termination codons (TAA/TAG/TGA) in the sequence. The majority of cellular proteins consist of at least 100 amino acids, so most functional ORFs are longer than 100 codons. The majority of ORFs are non-functional i.e. do not code for proteins. Gene prediction makes use of the non-randomness of genes compared to non-genes. Firstly, the sequence upstream of a gene is not random. It includes a ribosome binding site (RBS), promoter sequences and spacing between the RBS and the start codon. The RBS contains a specific sequence, known as the Shine-Dalgarno sequence (AGGAGG), and this is looked for when attempting to locate genes. However, only highly expressed genes have a conserved RBS, poorly expressed genes have a poor RBS and non-consensus promoters. Secondly, the DNA sequence of a gene has non-random distribution. Genes encode proteins, and proteins are made up of amino acids that determine the protein structure and function. The DNA sequence of a gene is therefore constrained by the demands of the protein. For example, if a protein possesses an ATP binding motif, this has a conserved amino acid sequence, hence a constrained DNA sequence. Codon usage by bacteria is biased, especially in highly expressed genes. Codon usage bias refers to differences among organisms in their preference of different codons. Due to the redundancy of the genetic code, a single amino acid can be encoded by a number of triplets. For examples, proline has four codons. M.tuberculosis uses CCC and CCG to code for this amino acid, while S. aureus prefers CCU, CCA. How these differences arise is a much debated areas of molecular evolution. It is thought that optimal codons help to achieve faster translation rates and high accuracy. This is another feature which will help to identify a gene. Considering that if the codon sequence is contained in the ORF, the bacteria is most likely using it to code for the amino acid with the aim of producing a protein. Computer programmes have been designed and developed which make use of these non-random distributions within genes. For examples, Glimmer3 (Gene Locator and Interpolated Markov ModelER), is a system used to find genes in microbial DNA. It is used by The Institute for Genomic Research (TIGR) as well as the Sanger Institute to annotate the genomes of over 100 bacterial species. Once genes have been identified by programmes such as Glimmer3, it is then necessary to assign a probable function to these putative genes. The sequence is compared to the sequence of known proteins in GenBank. The most popular method of comparing probable proteins with known proteins is Basic Local Alignment Search Tool, more commonly known as BLAST. This computer programme compares the sequence obtained from the putative gene to known sequences. BLAST can identify conserved domains between two sequences, which indicates function, or it can alternatively identify a full-length similarity to a known protein. The results of a BLAST search give a number of output values. First the bit score indicates how good the alignment is, the higher the score the better. The bit score takes into account the alignment of similar or identical residues, as well as any gaps which may have been introduced to align sequences. The E-value describes how likely it is that the similarity between two sequences occurred by chance. The lower the E-value the better. It is important to note when considering these value that BLAST compares partial sequences, the output values obtained do not reflect the similarity between two complete sequences. Other results from a BLAST search include the % similairity between two sequences, which combines both identical and conservative substitutions on two sequences. A conservative substitution is one in which an amino acid has been replaced with another of similar chemical function e.g. Asp and Glu. % identity gives an indication as to how many identical residues at a certain position are present in two sequences. % homology does not exist. Two sequences are homologous or they are not/ Paralogs and orthologs can be identified from a BLAST search. Paralogs are sequences which share a common evolutionary origin and are present within the same organism. They are most likely the result of gene duplication.
Orthologs are sequences which share a common evolutionary origin but are present in different organisms. Although sequences may share a common ancestor, they do not necessarily have the same or similar functions. Hypothetical proteins can also be identified. These are predicted to be functional but have no counterpart species i.e are only found in one species. There is no experimental evidence that it is expressed in vivo. Conserved hypothetical proteins are groups of proteins predicted to be functional and found in two or more species. There is no experimental evidence that they are expressed in vivo.
Further functional investigation of hypothetical proteins The problem with comparing putative genes to known sequences is the fact that a large proportion (up to 40%) of genes bear no similarity to any known gene. Therefore, in order to determine more information regarding the protein, the amino acid sequence should be examined. This can provide clues to function as well as structure and localisation of the protein. Hydrophobicity plots display the distribution of polar and non-polar residues along a protein sequence, and provide quantitative analysis of the degree of hydrophobicity or hydrophilicity. They plot the amino-acid sequence (x-axis) against the degree of hydrophobicity or hydrophilicity (y-axis). Highly hydrophobic segments are generally thought to membrane spanning domains. If the protein as a whole has a high level of
hydrophobicity, this suggests that it is an integral membrane protein. Signalling proteins often have a high level of hydrophobicity. Meanwhile, if the protein or a segment of a protein is found to be hydrophilic, this suggests that it is surface exposed and potentially antigenic. Secondary structure of the protein can be predicted by analysing the shape of the plot. Secreted proteins are often identified based on the presence of a conserved signalling sequence. It is found at the N-terminus of a protein consists of approximately 15-30 amino acids. It is recognised by a signal recognition particle (SRP) while still being synthesised in the ribosome and the protein is transported to the ER. The signal sequence is cleaved upon translocation of the protein. Computer programmes such as SignalP identify signal sequences. Proteins which have a specific function are often organised into domains. For examples, proteins which bind to DNA will have a helix-turn-helix motif. Regardless of whether the function of two proteins is the same or different, these domains are conserved. Programmes such as Blocks and Interpro search databases for the occurrence of sequence motifs. Sequence motifs is a nucleotide or aa sequence pattern that if widespread and has been proved or assumed to have a biological signifance. ATP-binding proteins have a specific motif known as a Walker-motif (GXXFXGKT/S). This is detected by programmes such as Blocks, indicating that the protein binds ATP. Using all these different techniques can provide a wealth of information regarding protein function, structure and localisation. For examples, the VapA protein was
analysed by Blocks and the aforementioned techniques, and it was determined that the protein is · · an extracellular protein ( hydrophobicity plot) a secreted protein based on the fact that it has a signal peptide o is cleaved at a threonine residue · is a potential ATPase
‘In silico’ analysis to understand biology-Example using IupABC
Siderphores are complex protein molecules synthesised by bacteria, which they use to acquire iron. Iron is an essential element for growth, and at low concentrations the bacteria requires the help of a siderophore to bind it. Three proteins form a complex which is involved in the uptake of siderphores into the cell once it has bound iron-IupA, IupB and IupC. The genes for these are contained in the same operon. A mutant (α5) was created which had a mutation inserted into iupA. This prevented all three genes from being transcribed and hence the proteins were not expressed. Due to the absence of the uptake system, siderophores accumulated outside of the cell, and the bacteria could not obtain any iron, therefore could not survive.
A new mutant (α6) was then generated which had an insert in a conserved hypothetical protein gene, iupR. This mutant could not grow at low iron concentrations. This gene encodes a protein which possesses a protease domain and transmembrane spanning helices. It appears to be contained in an operon with another gene, iupI, which encodes a hypothetical protein with a DNA-binding domain. The connection between two genes was predicted to be that the protein encoded by iupR is a membrane bound protease which cleaves a repressor protein iupI. It was postulated that when IupR is produced, it cleaves IupI, which is no longer functional and can therefore cannot bind to and repress any genes. However, when iupR was disrupted, the protease was not produced, therefore did not cleave IupI. This protein then repressed a number of genes, which are most likely involved in iron uptake considering the phenotypic results of the mutation. Experiments validate this model suggested. A western blot was carried out using IupI antibodies. If the protein was cleaved, two bands would be expected. In the wild type, there was partial cleavage of IupI. In the α6 mutant, IupI was not cleaved at all, only one band was present. These differences showed that IupR and IupI are functionally linked. In the α5 mutant, IupI was completely cleaved by IupR, indicating it was hyperactive as the organisms other mechanism of iron acquisition had been disabled. An overwhelming amount of data has been obtained from genome sequencing analysis. It has given an insight into the metabolic capabilities of an organism, how genomes evolved, similarities/differences between virulent/avirulent strains and the
minimum number of essential genes it takes to make an organism. However, it does not provide to answers to numerous other questions such as what genes are required from specific metabolic activities, virulence, and cell differentiation, what genes are involved in oncogenesis, what genes are expressed in specific organs. In order to obtain answers to these questions, functional analysis is required.
Further Functional Analysis
1. Northern Blot
A northern blot is a technique used in molecular biology used to study gene expression by detecting RNA in a sample. The expression of genes under various conditions can be analysed using this method. If mRNA is detected, this means that the gene has been expressed. The level of expression during different development stages of the cell as well as under normal and diseased conditions are just some of the different influences on gene expression which can be analysed. The mRNA produced by the cell is isolated. It is then applied to an agarose gel and a gel electrophoresis is carried to separate the mRNA based on size. The separated mRNA molecules are transferred to a nylon membrane by capillary or vacuum blotting system. Once on the membrane, the RNA is immobilised by UV light or
heat, which create a covalent linkage between the RNA and the membrane. A radioactive probe is then added which hybridises to the mRNA. This probe consists of a sequence complementary only to that of the RNA of interest, therefore it will only bind to this RNA. The probes are labelled with radioactive isotopes which can be detected with X-ray. This method is useful in determining differences in gene expression between different tissues, organs, developmental stages as well as examining the effect of different environmental parameters on expression. It provides information on a transcripts length (size) and its abundancy. However, it is quite labour intensive as the process takes some time, and only one probe can be used per hybridisation. Also, it is difficult to quantify the level of expression using this method.
Quantative PCR is used to measure the abundance of DNA i.e. how much there is. It greatly depends on the observations that the amount of product from a PCR reaction is dependant upon the amount of template provided. The PCR is run with primers annealing to the gene of interest and an internal control. · Real-Time PCR Real-time PCR involves simultaneous amplification and quantification of a targeted DNA molecule. The formation of PCR product is monitored throughout the reaction. This method follows the basis of traditional PCR, with the difference being that as the DNA is amplified and produced, it is quantified in real time as the reaction progresses.
A primer specific for the gene of interest is added to a reaction of template DNA, polymerase and dNTPs. When new DNA is synthesised, it is double stranded. Fluorescent molecules such as SYBR Green detect this double stranded DNA by binding to the minor groove. This molecule is only fluorescent when bound to dsDNA, therefore is only detected in the presence of dsDNA. As the PCR progresses and more DNA is synthesised, the level of fluorescence intensity increases and this is measured after each cycle, enabling the quantification of DNA concentrations. The use of
fluorescent probes is the most reliable and accurate method of quantifying DNA, but is also expensive. The results of qPCR are presented in graph form. The cycle number is plotted against the level of fluorescence. An important parameter is qPCR is the Ct value. This refers to a threshold or minimum value that needs to be reached in order to get fluorescence. The higher amount of starting DNA, the lower the Ct value. An example of how qPCR can be used is measuring the gene expression of iupA. This gene is required by R.equi to acquire iron when there is very little of it in the environment. RT-PCR simply tells us that iupA is not expressed at high levels of Fe2+, whereas it is expressed at low levels. Real-time PCR however measures the difference in RNA abundance under the different conditions, showing the iupA mRNA is 180-fold more abundant under iron-limiting conditions.
This is a method of ‘in vitro’ mutagenesis. Gene disruption basically involves knocking out and a gene and examining the resulting phenotype. In this manner, the function of the gene can possibly be elucidated. The method involves inserting an antibiotic resistant marker into the gene of interest. This disrupts the gene, rendering it non-functional, and the mutants become resistant that specific antibiotic. The mutant DNA is then transformed into an organism, usually E.coli, which is grown in the presence of the specific antibiotic. Only mutants will grow, and these can then be used in different experiments to examine the effect of the disruption. This method is extremely useful in determining whether a gene is required or plays a role in a particular cellular process.
Other methods of determining whether genes are expressed under various conditions include the use of reporter genes such as GFP to determine whether the gene is transcribes. The reporter gene attaches onto the regulatory sequences of the gene of interest. If the gene is transcribed, so is the reporter gene. GFP produces green fluorescent protein, and so if the gene is transcribed, the colonies appear green under UV light. However, a drawback of this is that only a few genes can be analysed at one time, therefore it is time consuming and labour intensive.
The methods used in genomics, albeit successful in what they were trying to achieve, are difficult to use when trying to analyse systems as a whole. Bacterial genomes have approx 4000 genes, while humans have approx 40,000. Any of the current genomic methodologies are inadequate when trying to measure gene expression and mutagenesis in these complex systems. New methodologies were required in the global analysis of gene expression. This is what led to the development of gene array systems.
DNA microarrays are simple, comprehensive, high throughput and the data provided is consistent. They enable global gene expression profiling, which allows us to determine the locations and conditions under which a gene is expressed. This in turn allows inferences about its function. The principle of these gene array systems is that each gene is represented by a short nucleotide sequence (probes), which are immobilised on a solid substrate. Labelled nucleic acid molecules hybridise with high specificity and sensitivity to complementary sequences on the substrate, and when they do, the label is detected. This method allows for parallel quantitative measurement of many different sequences of a complex mixture.
The earliest gene arrays were manually made and contained 10 to a few hundred probes. The probes representing all the genes of a genome are spotted onto a nylon filter.
Next, mRNA is synthesised and converted into cDNA by reverse transcription. This is due to the fact that RNA will not hybridise onto the DNA probes representing the genome on the array. The cDNA is labelled with [32P] dATP. The labelled cDNA is then added to the array and hybridised to any complementary DNA/ genes on the filter. It is then detected with X-ray or with a phosphor imaging system. Spots where the cDNA has hybridised to any genes appear darker compared to the other spots. This indicates that gene which is represented by that spot is expressed under those specific conditions.
The drawback of this method is that the nylon membrane had limited capacity i.e. can only hold a certain number of spots. Therefore the whole genome could not be represented on a single array, which hinders analysis of global gene expression. Demands for higher throughput resulted in new, automated techniques.
The next progression in gene array systems came in the use of glass slides instead of nylon filters. The glass slides are 1 x 3 inches in size and PCR fragments or oligonucleotides representing all the genes of a genome are spotted onto the slide by capillary printing using a robotic arrayer. These small slides can hold up to ~20,000 spots. Glass has a number of advantages over the use of nylon filters. Firstly, it provides an optically flat and uniform surface. The probes can be immobilised onto the surface effectively and the hybridisation is robust. It allows for covalent attachment between the oligonucleotides and the cDNA. Glass is a durable material which can withstand high temperature and washes of high ionic strength. The glass has a low intrinsic level of fluorescence meaning it does not contribute much to background noise and this also allows for the use of fluorescent instead of radioactive labels for detection. Two probes can be labelled separately and incubated simultaneously onto the slide. Finally, very small volumes (1-40µl) which represent each gene can be spotted onto the slide. The process involved in using glass microarrays varies very little to that of the traditional method. Firstly, cDNA is synthesised from control and experiment cell lines. These two cDNA samples are then labelled with different fluorescent dyes, such as Cy3 (green) and Cy5 (red). Both cDNA samples are hybridised to the DNA on the array, and the cDNA from each sample competes for hybridisation to a particular probe. The slide is then scanned and the Cy3 and Cy5 signals are overlaid. A computer programme calculates the logarithm of the ratio of Cy5 intensity to Cy3 intensity. If the logarithmic value of (Cy5/Cy3) is positive, this indicates relative excess expression of Cy5. In other words, the gene is more highly expressed in the cDNA represented by Cy5. If the logarithmic value of (Cy5/Cy3) is negative, this indicates relative excess expression of Cy3. In other words, the gene is more highly expressed in the cDNA represented by Cy3. Visually, if a spot appears red, this indicates that the Cy5 cDNA has hybridised here, while if it is green, this indicates that the Cy3 cDNA has hybridised. If the spot appears yellow or the logarithmic value of (Cy5/Cy3) is close to zero, this indicates equal levels of hybridisation of both cDNA samples i.e. that gene is expressed equally under both conditions.
Applications of glass microarrays
An example of how this microarray system could be used is to examine the effect of heat shock on gene expression in E.coli. First, mRNA is synthesised and isolated from cells which have undergone a heat-shock process and cells which not. It is converted to cDNA and one sample is labelled with Cy3 (heat shock), the other with Cy5 (no heat shock). Both cDNA samples are added to the microarray. If a spot appears green, this indicates that this gene is expressed or upregulated during the heat shock process. If a spot appears red, this indicates that it is not expressed during heat-shock. If a spot appears yellow, this indicates that the gene is equally expressed in both conditions and is not affected by heat-shock. This experiment was carried out with 119 genes identified as having an altered level of gene expression under heatshock conditions. Of these 119, 35 were not known to be involved in the heat shock response in E.coli. This array provides quantitative information on the relative expression levels of a gene in both a test and control strain. Another example of use of a DNA microarray is in R.equi. R.equi is an important pathogen commonly found in foals and immunosuppressed humans, as well as other mammals. It produces disease which has pulmonary manifestations such as pneumonia. Inside the body, R.equi infects macrophages and kills them. An experiment was designed to examine the genes involved in the virulence of R.equi. The virulence plasmid was taken from a virulent and avirulent strain of R.equi. The plasmids were 99.9% identical, with a difference in one specific region. It was postulated that this region was a putative pathogenicity island, considering it was flanked by transposon resolvase genes and had a lower GC content in comparison to the rest of the plasmid. A DNA microarray representing the entire plasmid was constructed. cDNA was then synthesised from the virulent and avirulent strains and labelled separately. The analysis showed that the genes present in the pathogenicity island were upregulated when the pathogen infected the macrophage, indicating that they do contribute to the virulence of R.equi.
In more recent years, new technologies in gene array systems have come to the forefront. Firstly, Affymetrix use a unique ‘in silico’ system to build a gene chip. Two techniques are used in the synthesis of the chip-photolithography and solid-phase DNA synthesis. In summary, a solid substrate made of silica is photo-activated, allowing for the attachment of nucleotides. A nucleotide is then photoactivated and a chemical linkage occurs between the nucleotide and the solid substrate. A second nucleotide is then photoactivated and attached to the first. Repeated cycles of photoactivation and chemical linkage allow one to ‘build’ an oligonucleotide. To go into more detail, synthetic linkers are attached to a glass substrate. These linkers are modified with photochemically removable protecting groups. Specific areas of the substrate area exposed to light using a photolithographic mask, leading to localised photodeprotection. Hydroxyl protected nucleotides are then incubated with the substrate and chemical coupling occurs between the nucleotide and sites which are now photodeprotected. The nucleotide is now attached to the glass slide at those specific locations. Light is then directed to different regions of the slide with a different mask, again leading to photodeprotection. The chemical cycle with the nucleotide is repeated. Multiple cycles of photodeprotection using a mask and chemical linkage allow for the synthesis of oligonucleotides in various locations on the slide. Each gene is represented by about 60 oligonucleotides, with different oligonucleotides for each gene. 1.4 million different oligonucleotides can be incorporated into a single Affymetrix chip. This method has a number of advantages. Firstly, it is very high density. There is no need to collect and store clones DNA products, which can be time-consuming. Most importantly, it is very high throughput. It allows for the simultaneous, rapid characterisation of thousands of genes and is therefore a useful tool in categorising diseases such as cancer by determining the presence or absence of particular genes. This provides extremely important biological and prognostic information. The drawback, however, is that it is still quite expensive to synthesise a custom array, therefore this method is not as flexible as the glass slide method.
In recent years, an alternative to micro-arrays has been discovered-RNASeq. Using this method, RNA is converted to cDNA. The cDNA is sequenced at random, and in doing so the identity of the cDNA (i.e. which part of the genome it belongs to) and the abundance of individual transcripts is determined. This method is dependent upon very high throughput next generation sequencing such as Illumina sequencing. Currently, 1300 Mb can be sequenced per run, costing approximately €4 per million bp. In contrast to arrays, where it needs to be known what gene each spot is representing, no prior assumptions are made to analyse the
transcriptome. In other words, a design is not required, similar to the shotgun approach of DNA sequencing. The basic techniques used in RNASeq technology are bridge PCR to generate templates ‘in situ’ for sequencing, and sequencing by synthesis. Bridge PCR 1. 2. DNA is randomly fragmented and adaptors are ligated to each end These single stranded fragments are randomly bound to the inside surface of a
flow-cell channel 3. Unlabelled nucleotides and a polymerase enzyme is added to initiate solid-phase bridge amplification a. Nucleotides incorporated to bind double-stranded bridges on a solid-phase substrate 4. substrate. 5. a. Steps 3 and 4 are repeated millions of times Several million dense clusters of dsDNA generated in each channel of flow cell The dsDNA is denatured, leaving single-stranded templates attached to the
Sequencing by Synthesis
1. First sequencing cycle begins, with the incorporation of fluorescent ddNTPs, primers and DNA Pol
2. The flow cell is subject to laser excitation, leading to the emission of fluorescence from each cluster and determination of the first base
3. Sequencing cycles are repeated over and over to determine sequence of bases in a fragment, one base at a time
4. Data is aligned and compared to a reference, sequencing differences are identified. 35-100bp are usually read.
(This method of generating a chip helps to identify mRNA (mono-and polycistronic), noncoding RNA, anti-sense RNA and cis-acting RNA.)
Applications of GeneChip Technology and RNASeq
· Detection of mutations o Two genes, BRCA1 and BRCA2, are known to predispose to breast and ovarian cancer. They each contain more than 80,000bp, making them 8 times larger than the average gene. In each of these genes, there are more than 600 mutations which can increase a person’s risk to breast cancer. Screening can be done to identify mutations in either these genes. It is usually limited to women who are at risk of developing either type of cancer based on their family history. Using gene array technology, this screening can be carried out quickly and efficiently.
Detection of Polymorphisms o When microarrays are used to detect mutations or polymorphisms, the target is usually a single gene. One type of sequence commonly used in micro
arrays is an SNP. This is a small genetic variation/change that can occur within the DNA sequence. This can be used as a basis to predict IQ and behaviour such as alcoholism. · Analysis of Gene Expression o Response to growth conditions o Response to therapeutics different drugs →novel method for identifying new
Identification of pathogens o Identification of species and resistance factors is of critical importance in biology, medicine and agriculture o Techniques allow for comprehensive and unbiased analysis of viral prevalence in a given biological setting o Take sample of microbes from intestinal tract o Isolate DNA o Probe for gene specific to species of interest
These applications are only possible due to the fact that a very large number of sequences can be analysed, the methods used are low cost and rapid due to automation, and chips can be custom designed to suit the requirements of the experiment.
Transcriptomics and genomics technologies have enabled numerous advances in the recent years. Expression profiling using DNA arrays provides a wealth of knowledge by showing which genes are expressed/regulated in response to different stimuli. However, it does not provide to clues as to the effect of the stimulus on protein modification or stability, nor does it give any insight into alternative splicing. To answer, these questions, the proteins themselves must be examined. Proteomics is therefore the study the proteome, the entire ensemble of proteins within a cell. It is important to note that while the genome of an organism, the proteome is not. Proteins can undergo numerous modifications at various stages. When the Mycobacterium tuberculosis genome was analysed, a wealth of information was obtained. A BLAST search identified two interesting ORF’s which beared a resemble to that of tyrosine phosphatases found in other organisms, mainly eukaryotes. Tyrosine phosphatases are enzymes responsible for the removal of phosphate groups from tyrosine molecules. Reversible phosphorylation of tyrosine residues is a key mechanism for the transduction of signals that regulate a number of cellular activities, including cell growth and differentiation, mobility, metabolism and survival. While tyrosine signalling has been characterised in a number of bacterial species, it has not been identified in M.tuberculosis to date. So the question was what was this protein doing in this bacteria? Further questions arose as to what the protein’s role in virulence was, what macrophage proteins it interacted with and what proteins were modified by these phosphatases? It was speculated that it may have be have been interfering with tyrosine signalling in human cells once the pathogen had infected. However, to prove this theory and provide an answer to all the other questions, analysis of the proteins themselves was required. Proteomics is dependent upon 2-dimensional gel electrophoresis. Using this technique, proteins are separated by two properties in two dimensions on gels. In the first dimension, proteins are separated based on their isoelectric point (pI). This refers to the pH at which the protein has no net electrical charge. It is carried out in a small tube containing a gel with a pH gradient. An electric potential is applied to the gel, making one end more positive than the other. At any pH other than the isoelectric point, proteins will have a charge. If they are positively charged, they will move toward the more negative pole, whereas if they have a negative charge they will move toward the more positive charge. The proteins move along the gel, until they reach their pI and accumulate here. In the second dimension, proteins are separated based on their size. The tube which contained the proteins separated in the first dimension is transferred to a denaturing polyacrylamide gel. Sodium dodecyl sulfate (SDS) is the denaturing agent used. This linearises proteins, and a number of SDS molecules bind to the protein. The number of molecules that bind is proportional to the protein length. A proteins length is roughly proportional to its mass, so it can be said that the number of SDS molecules that bind is roughly proportional the mass of the protein. Since SDS bears a negative, the result of binding leads to all the proteins having the same mass-to-charge ratio. Furthermore, it confers a negative charge onto the protein, therefore enabling migration (the proteins had no charge due to IEF). An electric potential is applied to the gel, but at a 90° angle to the potential used in the 1D. Proteins migrate to the positive pole, and the proteins are separated on basis of their molecular weight. The polyacrylamide acts a sieve, allowing smaller proteins to move further. Larger proteins are retained higher in the gel, while smaller proteins move further down the gel. The resulting pattern from 2D gel electrophoresis is very complex. The gels are evaluated by imaging analysis using systems such as BioRad Multiimaging System. Typically, these systems quantify proteins and show separation of two different protein spots. 2D gel electrophoresis can be used to examine the effect of a stimulus, such as a pathogen or environmental stimulus, on protein expression and modification. For example, if one protein sample which has been isolated from cells infected a pathogen is compared to that of uninfected cells, and different spots are seen to be present in one and not the other, this indicates that that specific protein may be involved in the pathogenesis of the bacteria. These proteins can be isolated and further analysed. In principle this can be done by expression profiling, by examining whether the gene which codes for that protein is differentially expressed under the two different conditions. However, RNA expression levels do not always correspond to protein expression levels. As mentioned above, proteins can undergo numerous modifications. A high proportion of proteins undergo post-translational processing. Common modifications that occur include adenylation, phosphorylation, methylation and glycosylation. These can have a dramatic effect on protein catalytic activity, biological activity, stability, and interaction with other proteins. These will also affect properties such as isoelectric point, mass and charge, which will therefore alter the position of the protein in a 2D gel. Transcript profiling does not reveal or identify any of these modifications, yet the effect they have on cell function is dramatic. 2D gel electrophoresis can be used to detect these simple modifications. Often two gels which possess different protein samples and have undergone 2D gel electrophoresis are compared. It is important to be able to differentiate between these two gels. To do so, identification of the spots is required. Traditionally, this was done using radioactive labels, epitope tagging, western blots and N-terminal sequencing. However, all these methods are time consuming and so not suited for highthroughput analysis. The method therefore used is Mass Spectrometry (MS). Using this technology, the MW of proteins can be determined with 1-2Da. In the process of MS, protein spots are cut out of the gel. The proteins are treated with a protease enzyme such a trypsin, which yields peptide fragments. The mass of these peptide fragments is then measured in a mass analyser such as matrixassisted laser desorption ionisation-time of flight (MALDI-TOF) or an electron spray ionisation-time of flight (ESI-TOF). The masses obtained are then compared a database of known protein sequences and their peptide masses. By comparison, the protein can be identified. This technique incorporates both proteomics and genomics, and proves to be extremely powerful combination.
Proteins rarely, if ever, act alone. The majority of the time, they function in complexes. These complexes can range from simple heterodimers to extremely sophisticated, such as ribosomes and secretion systems. Establishing a network of protein-protein interactions is extremely important. It provides insight into relationships between proteins, it may provide clues to the function of novel proteins through guilt by associations, and it helps build a model. A very popular method of identifying protein interaction is the yeast-2-hybrid system. Other methods include co-immune precipitation, protein chips and affinity chromatography. The yeast-2-hybrid system is based on the Gal-4 transcriptional regulator. Gal4 has two domains, an activation domain (AD) and a DNA binding domain (BD). The premise behind this technology is that downstream reporter genes can be expressed by the binding of a transcriptional regulator onto an upstream activating sequence (UAS). The BD of Gal4 is responsible for binding to the UAS, and the AD is responsible for activating transcription. These two domains can be physically separated and expressed as fusion proteins. In order to activate transcription, the two domains must come in physical contact with each other. Two plasmids are introduced into yeast. One plasmid contains a fusion protein of BD, the other contains the fusion protein of AD. The BD is fused into a protein referred to as the bait, while the AD is fused into a protein referred to as the prey. These bait and prey proteins are the proteins that are being investigated i.e. we are trying to determine whether the bait and the prey interact. The fusion proteins are expressed in the yeast. If the bait and the prey proteins interact or bind, then the gene, which is often a reporter gene such as lacZ, will be expressed. Yeast colonies in which the bait and prey interact will therefore appear blue when grown on X-gal. This technique of determining protein interactions is quite a popular one. It is cheap, no expensive equipment is required, and it is suitable for high throughput screens. However, it does suffer a number of drawbacks. It is prone to false-positive results. This can be due to the fact that the proteins being investigated are overexpressed in yeast, and overexpression can result in non-specific interactions. Also, if a mammalian protein is being examined, this could be modified in yeast, leading to false results. Another drawback is that only two interacting proteins can be identified, whereas in many cases complexes are composed of numerous proteins. Finally, additional confirmation of the interaction is required. To verify the interaction that has been identified by yeast-2-hybrid system, the proteins which are suggested to be involved in the interaction are tagged with radiolabelled epitopes by ‘in vitro’ transcription/translation. An epitope is a portion of a molecule to which an antibody binds and can be composed of sugars, lipids, and AA. Tags used include c-Myc and HA tags. The proteins then undergo co-immune precipitation using specific c-Myc or HA antibodies. These antibodies target a protein which is believed to a member of a larger protein complex. By targeting one member, it is possible to pull multiple members out of solution. The resulting complex are then analysed on a denaturing gel electrophoresis.