Academia.eduAcademia.edu
Acknowledgment I am indebted to Dr. Kiran Kharat, Head of Department, Centre for Advanced Life Sciences, Deogiri College, Aurangabad for providing me the platform to carry out this study and his constant encouragement and kindness. I consider myself very fortunate in being privileged to work under the guidance of Mr. Mahadev Jadhav, Assistant Professor, Department of Bioinformatics, Deogiri College, Aurangabad. This project is an outcome of his constant help, time to time suggestions and providing his efforts and guidance. I wish my sincere thanks to Mr. Deepak Chavhan, Ms. Yogini Pathak and Ms. Anisha for their guidance and help. I also want to thank my friend Abeer Naqvi, Nikki Bhati and Mr. Sachin Thorat for his invaluable help. Muktar Shaikh Index Serial no. Topic Page no. 1 Figure legend 1 2 Abstract 3 3 Introduction 4 4 About MARK 10 5 Algorithms 14 6 Result 22 7 Discussion 36 8 Conclusion 37 9 Future Aspects 38 10 References 39 Figure Legend Fig. no. Fig. name Page no. 1 Strategy For Development 11 2 Flow Of Program 12 3 Home Page or MARK 22 4 D-MARK 23 5 MARK’s File chooser 24 6 MARK with input 25 7 GC content by MARK 26 8 GC content by Standard Tool 26 9 AT content by MARK 27 10 Melting Temperature By MARK 28 Fig. no. Fig. name Page no. 11 Melting Temperature By Standard Tool 28 12 Luciferase Protein Sequence By MARK 29 13 Luciferase Protein sequence 30 14 Translated Protein Sequence as input to MARK 31 15 Hydrophobicity calculated By MARK 32 16 Hydrophobicity calculated By Standard Tool 32 17 Hydrophilicity calculated By MARK 33 18 Iso-electric Point calculated By MARK 34 19 Iso-electric Point calculated By Standard tool 34 20 Molecular weight By MARK 35 21 Molecular weight by Standard Tool 35 Abstract:- Biological concepts provide diverse challenges in the field of Computational Biology. Various Biological properties need to be explained and computed using Mathematical and statistical parameters. We have made attempt to calculate some of such properties using jdk 1.8 and NetBeans 8.0. Our tool is named as MARK meaning Multi Action Research Kit. Our tool is divided into two modules, one as D-MARK and the other as P-MARK for DNA and Protein respectively. The DNA module computes properties like GC content, AT content, Melting Temperature and the Translated Protein Sequence. The Protein module consists of computing for the Molecular weight, the iso-electric point, and the hydrophobic and hydrophilic properties of protein. MARK is a bio-tool which is compatible with the most naive users and also is flexible for the system on which it is made to perform. Keywords: - MARK, D-MARK, P-MARK, jdk 1.8, NetBeans 8.0, Computational Biology. Introduction:- Computer Based methods are increasingly used to improve the quality of various biological services. The biological data (genes and proteins) from the biological and medical research is immense and require software professionals to mine them for new knowledge discovery. The knowledge to merge the programming concepts of Java to understand a wide range of biological concepts opens a new career challenge for many IT professionals. This project is an attempt to do so. In molecular biology and genetics, GC-content (or guanine-cytosine content) is the percentage of nitrogenous bases on a DNA molecule that are either guanine or cytosine (from a possibility of four different ones, also including adenine and thymine) [1].This may refer to a specific fragment of DNA or RNA, or that of the whole genome. When it refers to a fragment of the genetic material, it may denote the GC-content of part of a gene (domain), single gene, group of genes (or gene clusters), or even a non-coding region. G (guanine) and C (cytosine) undergo a specific hydrogen bonding [2]. The GC pair is bound by three hydrogen bonds. DNA with high GC-content is more stable than DNA with low GC-content. Hence the Melting temperature of DNA is depends on the GC-content. GC-content affects on stability of DNA because of hydrogen bonds between two strands. In molecular biology and genetics, AT-content is the percentage of nitrogenous bases on a DNA molecule that are either Adenine or Thymine from the possibility of four different ones [3]. This may refer to a specific fragment of DNA or genome. The adenine and thymine have two hydrogen bonds. Because of two bonds it also affects the stability and the Melting temperature of DNA making it less stable [3]. AT region play important role in Genome or DNA sequence of organism. The repetitive sequence (AT rich) protects genes or chromosomes from restriction enzymes. Melting temperature (TM) of DNA is the temperature at which DNA melts or loses its stability. TM of DNA depends on nitrogenous base composition. If DNA has high content of CG then the DNA is more stable and melting temperature is higher. And if it has less GC and more AT content then the DNA has less ability to stabilize their structure and the melting temperature is less. RNA that transcribed from DNA attaches to a ribosome, where the information contained in RNA is translated into a corresponding sequence of amino acids to form a new protein molecule. The translation process begins at the start codon (AUG) on mRNA and runs until any one of the stop codons (TAG, TGA, and TAA) hit [4]. Codon is the transcribed base triplet coding for a particular amino acid. For example, the base triplet AUG codes for the amino acid methionine. In other words, methionine is always the first amino acid in a growing polypeptide. For detailed information in codons and amino acids, refer. The algorithm involved in the translation process of RNA sequence to protein is much more complicated than the transcription process. The hydrophobic effect is the observed tendency of non-polar substances to aggregate in aqueous solution and exclude water molecules,[5][6] and they fall, specifically under the title rubric when a particular temperature dependence of the affinity of the a polar small molecule or moiety for the aqueous phase obtains. The part of the name, hydrophobic, literally meaning "water-fearing," and it describes the segregation and apparent repulsion between water and non-polar substances. At the molecular level, the hydrophobic effect is important in driving protein folding, [7][8] formation of lipid bilayers and micelles, insertion of membrane proteins into the non-polar lipid environment and protein-small molecule interactions [9][10]. Substances for which this effect is observed are known as “hydrophobes”. In the case of protein folding, the hydrophobic effect is important to understand the structure of proteins that have hydrophobic amino acids, such as alanine, valine, leucine, isoleucine, phenylalanine, tryptophan and methionine clustered together within the protein. Structures of water-soluble proteins have a hydrophobic core in which side chains are buried from water, which stabilizes the folded state, and charged and polar side chains are situated on the solvent-exposed surface where they interact with surrounding water molecules. Minimizing the number of hydrophobic side chains exposed to water is the principal driving force behind the folding process, [11] although formation of hydrogen bonds within the protein also stabilizes protein structure [12]. Hydrophobic interaction chromatography (HIC) is a key technique for protein separation and purification. Different methodologies to estimate the hydrophobicity of a protein is reviewed, which have been related to the chromatographic behavior of proteins in HIC. These methodologies consider either knowledge of the three-dimensional structure or the amino acid composition of proteins. Despite some restrictions; they have proven to be useful in predicting protein retention time in HIC. Hydrophilic molecules are polar, and can join the Hydrogen Bond network that polar water molecules form. Hydrophilic molecules do not dissolve. The Hydrogen Bond network forces non-polar molecules away. The isoelectric point (pI, pH (I), IEP), is the pH at which a particular molecule carries no net electrical charge. The standard nomenclature to represent the isoelectric point is pH (I), although pI is also commonly seen. The net charge on the molecule is affected by pH of its surrounding environment and can become more positively or negatively charged due to the gain or loss, respectively, of protons (H+).The pI value can affect the solubility of a molecule at a given pH. Such molecules have minimum solubility in water or salt solutions at the pH that corresponds to their pI and often precipitate out of solution. Biological amphoteric molecules such as proteins contain both acidic and basic functional groups. Amino acids that make up proteins may be positive, negative, neutral, or polar in nature, and together give a protein its overall charge. At a pH below their pI, proteins carry a net positive charge; above their pI they carry a net negative charge. Proteins can, thus, be separated according to their isoelectric point (overall charge) on a polyacrylamide gel using either QPNC-PAGE or a technique called isoelectric focusing, which uses a pH gradient to separate proteins. Isoelectric focusing is also the first step in 2-D gel polyacrylamide gel electrophoresis.The isoelectric point (pI), sometimes abbreviated to IEP, is the pH at which a particular molecule or surface carries no net electrical charge. Amphoteric molecules called zwitterions contain both positive and negative charges depending on the functional groups present in the molecule. The net charge on the molecule is affected by pH of their surrounding environment and can become more positively or negatively charged due to the loss or gain of protons (H+). The pI is the pH value at which the molecule carries no electrical charge or the negative and positive charges are equal.For an amino acid with only one amine and one carboxyl group, the pI can be calculated from the mean of the pKa's of this molecule.[6] For amino acids with more than two ionizable groups, such as lysine, the same formula is used, but this time the two pKa's used are those of the two groups that lose and gain a charge from the neutral form of the amino acid. Lysine has a single carboxylic pKa and two amine pKa values (one of which is on the R-group), so fully protonated lysine has a +2 net charge. To get a neutral charge, we must deprotonate the lysine twice, and therefore use the R-group and amine pKa values (found at List of standard amino acids). Java is a powerful object oriented programming language that dominates many other programming languages for more than a decade. It is well designed and available as many executable technologies for software development such as Java Swing, Java Beans, Java Applets, Java Web Start, Java Database Connectivity (JDBC), Java Server Pages (JSP) and Java 2 Enterprise Edition (J2EE). Beyond its usage in the IT sector, the language is prominent even in the new emerging fields including bioinformatics and computational biology. Java has been the language of choice for beginning computing science courses at many universities and for many companies in projects and products development. Java is a powerful object oriented programming language that dominates many other programming languages for more than a decade. Java evolves many of the object oriented programming concepts from C++, a direct descendent language of C. Even though it derives many of its features from C and C++, it has many significant practical and philosophical differences [13]. The language is platform independent or architecture-neutral and supports the development of both online and offline applications. It is also well known for security and portability. Java achieves this protection by confining a Java program to the Java execution environment and not allowing its access to other parts of the computer. Likewise, it supports the portability of executable code in internet where many types of computers and operating systems are in use throughout the world. In other words, Java’s solution for the two major problems namely, security and portability is both elegant and efficient. The key achieving the mentioned problems is that the output of a Java compiler is not the executable code, but the byte codes. During execution, the Java run-time system called Java virtual machine (JVM) interprets the byte code to produce the output [14]. The major advantages of developing software using Java include simple, object-oriented, robust, multithreaded, high performance, distributed and dynamic way of programming. Apart from the common software development, Java has intruded into the computational design and development of many core concepts of molecular biology. The bioinformatics and computational biology are the two major research fields merging the concepts of biology and software programming. A major driving force behind the research in biology and medicine over the past decade or two is due to the availability of huge amounts of DNA and protein data. The Human Genome Project is one of the major initiatives [15]. It led to massive improvements in the efficiency of DNA sequencing, and inspired numerous other genome projects. National Center for Biotechnology Information (NCBI) [Human Genome Project (HGP)] and National Institute of Health (NIH) at US, European Bioinformatics Institute (EBI) at UK [National Center for Biotechnology Information (NCBI), European Molecular Biology Laboratory (EMBL) at Germany and Swiss Institute of Bioinformatics (SIB) at Switzerland are the chief organizations involved in genome research. Besides, many small organizations are building software portals to collaborate with one another to share data and research results. A complete biological application depends on five basic concepts: (i) A source of data (ii) An application programming language to access and analyze the data (iii) Reuse software tools and libraries developed by others (iv) A web application platform to provide a HTML user interface for the data and analysis results and finally (v) A data store, such as a relational database, to store results or user's data. Here the latter two notions are optional, the middle one is highly desirable and the first two are considered to be mandatory. Among the various programming languages, Java has been widely accepted for biological software development. The concepts of Java are developed with code reuse in mind and the regular expression library of Java is not part of the basic syntax of the language but vast and defined tangibly. It is more verbose than Perl, an equally accepted programming language for developing biological software. My project introduces the Java programming perceptions in the field of biology and medicine together with a basic understanding of genes and proteins. About Mark - Multi Action Research Kit We have made Tool. Using Java JDK 1.8 and NetBeans IDE. We have tried to build an efficient offline tool that will work like any other online tool available with the user for calculating the various operations on DNA and Protein sequence. Mark it a single stop solution tool for multiple operations, like – FOR DNA:- AT content, GC content, melting temperature of DNA, and Translation of given sequence. FOR PROTEIN:- Hydrophobisity, Hydrophilisity, Molecular weight and Isoelectric point. We have made the use of features of NetBeans for our application development, and used the Java Language. When we start the application, MARK’s Welcome screen appears directing the user about what to do next. The welcome screen of mark contains three options i.e. 1] DNA 2] Protein 3] Exit. The DNA and Protein Buttons provide a link to next frame. If user clicks on DNA then the home screen of DNA MARK will appear. If user clicks on Protein the options for PROTEIN MARK will appear on screen. And if user clicks on Exit button then the system will exit. Each of the MARK consists of four options mentioned previously. Strategy for Application development:- Fig 1 – Strategy for Application development. In the process of development, we first check the basic needs of user. According to the need we set a goal i.e. WHAT we want from tool. Imagination is the next and most important part of our process. In this step we imagine how our tool must work. After the imagination our next step is the development. We try to convert imagination to reality. After developing the tool in accordance with the goals, need and imagination we test the tools performance. Flow of Program:- Fig 2 – Flow Of Program. Home Class:- It is the main class of the program. When the program executes, the Home class appear. It contains three buttons. If user clicks on DNA then the control goes to Class DNA. If user clicks on Protein then the control goes to Class Protein. And if user clicks on Exit, then the execution of application will terminate. DNA Class:- When user clicks on DNA button from Home class then the DNA class executes and the new window opens. The frame of DNA class contains one text area and various buttons, i.e. GC contain, AT contain, Melting temperature, and Translation. User can perform various operations by clicking these buttons. Protein Class:- When user clicks on protein button from Home class then the Protein class executes and the new window opens. The frame of Protein class contains one text area and various buttons, i.e. Hydrophobisity, Hydrophilisity, Molecular weight, and Isoelectric point. User can perform various operations by clicking these buttons. Algorithms:- For GC contain:- For AT contain:- For Melting temperature:- For translation:- For Hydrophobicity:- For Hydrophilicity:- For Iso-electric point:- For Molecular weight:- Result:- By implementing the above algorithms we have developed MARK. And finally MARK is ready to use and to solve problems. When we start MARK home page appears. Fig 3 – Home Page of MARK The next action of MARK depends on users’ choice. If user clicks on DNA then the DNA MARK will open. If user clicks on Protein then Protein Mark will open. DNA MARK:- DNA MARK is DNA Multi Action Research Kit. It has various options to process the DNA sequence. Fig 4 – D-MARK User can give input either by typing, pasting on text area or by browsing file. Input to MARK:- Fig 5 - MARK’s File chooser Fig 6 - MARK with input We have taken Luciferase gene sequence for checking the accuracy of D-MARK. GC Content:- D-MARK result Fig 7 - GC content by MARK ENDMEMO DNA/RNA GC Content Calculator (http://www.endmemo.com/bio/gc.php) Fig 8- GC content by standard tool AT Content:- D-MARK result Fig 9 –AT content by MARK. Melting Temperature:- D-MARK result Fig 10 – Melting Temperature by MARK. Oligo Calc Result (http://www.basic.northwestern.edu/biotools/oligocalc.html) Fig 11 –Melting Temperature by Standard Tool. Translation:- Result of MARK Fig 12 –Luciferase Protein sequence by MARK. Result of Expasy Translate (http://web.expasy.org/translate/) ------------ MARK result. ------------ Result of Translet expasy. M  S L Q L S I L D Q T P I R R G S N A A E A L Q E S I E L V R K A D E W G Y T R M S L Q L S I L D Q T P I R R G S N A A E A L Q E S I E L V R K A D E W G Y T R Y W L S E H H N T I T L A G A A P E I L I A R L A S E S K R I R L G S G G I  M  L Y W L S E H H N T I T L A G A A P E I L I A R L A S E S K R I R L G S G G I M L P N H S T L K V A E N F K L L E A L Y P N R I D L G V G R A P G G D R I T A Q L L P N H S T L K V A E N F K L L E A L Y P N R I D L G V G R A P G G D R I T A Q L L N P S N T F D P Q E Y I Q Q I S D L H D F L T D N P N Y N N I Q G K V R A I P Q I D N P S N T F D P Q E Y I Q Q I S D L H D F L T D N P N Y N N I Q G K V R A I P Q I D T V P E  M  W  M  L T S S G E S A Y L A A H S G  M  A L S F A Q F I N P V G G K E A M T V P E M W M L T S S G E S A Y L A A H S G M A L S F A Q F I N P V G G K E A M A I Y K Q R F K P S A Q L K A P K A S V G V F A F C S E D E Q K A A Q V Q AV M D A I Y K Q R F K P S A Q L K A P K A S V G V F A F C S E D E Q K A A Q V Q A V M D Y R L L S F E K G R Y D E I P T Y E A A S K Y K Y T E G E W Q R V L F N R Q R T V Y R L L S F E K G R Y D E I P T Y E A A S K Y K Y T E G E W Q R V L F N R Q R T V V G T P D I V K E K I T S L A A E  M  E V N E V I L S T F T E S Q K D R F S S Y E L L A V G T P D I V K E K I T S L A A E M E V N E V I L S T F T E S Q K D R F S S Y E L L A K L F N L T A N T N N K Q A –STOP-- K L F N L T A N T N N K Q A --STOP-- Fig 13 –Luciferase Protein sequence. We passed translated output as a input for P-MARK. Fig 14 – Translated Protein Sequence as input to Mark. Hydrophobicity:- P-MARK result Fig 15 – Hydrophobicity calculated by MARK. Peptide 2.0 result (http://www.peptide2.com/N_peptide_hydrophobicity_hydrophilicity.php) Hydrophobic: 46.4% Fig 16 – Hydrophobicity calculated by Standard Tool. Hydrophilicity:- P-MARK result Fig 17 – Hydrophilicity calculated by MARK. Iso-electric Point:- P-MARK Result Fig 18 – Iso electric point calculated by MARK. Result by ProtParam (http://web.expasy.org/protparam/) Fig 19 – Iso electric point calculated by Standard Tool. Molecular weight:- P-MARK result Fig 20 – Molecular weight calculated by MARK. Prot Param result (http://web.expasy.org/protparam/) Fig 21 – Molecular weight calculated by Standard Tool. Discussion:- Biological concepts provide diverse challenges for the researchers. The basic properties such as the GC content, the AT content, the Melting temperature are important for the parameters which defined the stability of the DNA molecule. The Proteins on the other hand have their own set of parameters like their iso-electric point, the molecular weight, and the hydrophobic and hydrophilic nature. These properties provide a basic idea about how the protein would behave in a biological system. Combining these properties in such a way when the user gives input of the DNA sequence the entire set of DNA and Protein parameters can be computed using MARK such that a concept map about the novel DNA can be obtained. MARK – Multi Action Research Kit is a tool that stands as an off-line interface and a single stop solution for finding the basic properties of the biological macromolecules i.e. DNA and Protein. These properties are fragmented in different tools and MARK combines these properties into one unit enabling the researchers to study the biological macromolecules in a small amount of time and intensify the research further. The accuracy of MARK is absolute when compared to the standard online tools. Conclusion:- Mark stands as a – A Desktop tool. Easy user interface. Compatible with first time users. Single stop solution for basic properties of DNA and Proteins. Future aspects:- MARK can be further extended for the purpose of Secondary Structure Prediction, Motif Finding, Epitope Prediction. MARK is named keeping in mind, the future aspect, about increasing the scope of this tool beyond the basic properties of DNA and Proteins. It aims at manipulation of the DNA and Protein sequences. Reference:- Definition of GC – content on CancerWeb of Newcastle University,UK. Yakovchuk P, Protozanova E, Frank-Kamenetskii MD (2006). "Base-stacking and base-pairing contributions into thermal stability of the DNA double helix". Nucleic Acids Res.34 (2): 564–74. doi:10.1093/nar/gkj454. PMC 1360284. PMID 16449200. Levin RE, Van Sickle C (1976). "Autolysis of high-GC isolates of Pseudomonas putrefaciens". Antonie Van Leeuwenhoek 42 (1–2): 145–55. doi:10.1007/BF00399459.PMID 7999. Journal of biomedicine and biotechnology, 2010, 1-12. IUPAC, Compendium of Chemical Terminology, 2nd ed. (the "Gold Book") (1997). Online corrected version:  (2006–) "hydrophobic interaction". Interfaces and the driving force of hydrophobic assembly Nature, Volume 437, Issue 7059, pp. 640-647 (2005)doi:10.1038/nature04162. Kauzmann W (1959). "Some factors in the interpretation of protein denaturation". Advances in Protein Chemistry 14: 1–63. doi:10.1016/S0065-3233(08)60608-7. PMID 14404936. Charton, M.; Charton, B. I. (1982). "The structural dependence of amino acid hydrophobicity parameters". Journal of Theoretical Biology 99 (4): 629–644. doi:10.1016/0022-5193(82)90191-6. PMID 7183857. "The Binding of Benzoarylsulfonamide Ligands to Human Carbonic Anhydrase is Insensitive to Formal Fluorination of the Ligand" Angew. Chem., Int. Ed., Volume 52, Issue 30, pp. 7714-7717 2003; DOI: 10.1002/anie.201301813. "Water Networks Contribute to Enthalpy/Entropy Compensation in Protein–Ligand Binding" J. Am. Chem. Soc., 2013, 135 (41), pp 15579–15584; DOI: 10.1021/ja4075776. Pace C, Shirley B, McNutt M, Gajiwala K (1 January 1996). "Forces contributing to the conformational stability of proteins". FASEB J. 10 (1): 75–83. PMID 8566551. Rose G, Fleming P, Banavar J, Maritan A (2006). "A backbone-based theory of protein folding". Proc. Natl. Acad. Sci. U.S.A. 103 (45): 16623–33. doi:10.1073/pnas.0606843103. PMC 1636505. PMID 17075053. Schildt H. The complete reference – Java 2 Seventh Edition. H. M. Deitel - Deitel & Associates, Inc., P. J. Deitel - Deitel & Associates, Inc. JavaTM How to program. Alain Trottier. Java 2 Core Language Little Black Book Websites and Books – Regular Expressions in Java, http://www.tutorialspoint.com/java/java_regular_expressions.htm (13 Nov 2011). Intro to Java Programming, by Y. Daniel Liagn. http://www.endmemo.com/bio/gc.php. http://www.basic.northwestern.edu/biotools/oligocalc.html. http://web.expasy.org/translate/. http://www.peptide2.com/N_peptide_hydrophobicity_hydrophilicity.php. http://web.expasy.org/protparam/. MARK 2014 38 CENTER FOR ADVANCED LIFE SCIENCES