Computer methods


I. The FINDER Program This was written in 1991 and published 1993, so it's a little old now. The program makes use of amino acid composition data to identify proteins, and here is the reference and a link to Medline [Shaw, G. Rapid identification of proteins. Proc. Natl. Acad. Sci. USA 90:5138-5142 (1993)]. You can download a *.pdf version of this here (the NAS server). The original version of the program was written for IBM-PC compatibles using MS-DOS using the Borland C compiler and required, in 1993, about 7Mb of hard disc space to run. This was a lot in those days, representing heavily compressed data from the ~40,000 full protein sequences known at that time, but of course is nothing now. On an early Pentium class computer it would run in a few seconds. The approach does work very well if you can get a pure protein and if you can get good quality amino acid composition data. There are are at least two interactive versions of programs similar to and possibly influenced by mine on the WWW at the following addresses, and they somewhat supercede my program, so you might want to go there instead, especially as you can run them over the web. These are at ExPasy and the Heidelberg-EMBL. I can supply you with the original executable program, C source codes, and the program you need to make the database files if you are interested, just send me an email. The method is particularly useful for characterizing proteins as part of genome/proteome projects; since microbial genomes only contain a few thousand ORFs, even very bad quality composition data can firmly correlate a microbial protein to the appropriate gene.


II. Motifer Program One of the problems of searching protein and nucleic acid sequence databases is that some sequences are related but only very distantly. What can often happen in the case of proteins is that there are a few conserved peptides which are separated by long unconserved segments of variable length. This is a feature of some important signaling modules such as Src-homology 2 (SH2) and particularly Pleckstrin homology (PH) domains. To deal with sequences of this kind I wrote a program called Motifer in 1993-1994. This was a C program constructed in the Borland C/C++ compiler for the IBM-PC. You can input up to four different peptide sequences characteristic of your loosely defined domain of choice. Each peptide can be defined in as tight or loose a fashion as you want; for instance you can say that for one of the peptides the first amino acid has to be a Cys, the second can be anything, the third has to be a Cys or a His, the fourth has to be Glu or Asp etc, then there is a space of between 5 and 10 residues then another Cys. Sequences of this type do occur commonly in many important protein modules but defeat regular search programs like Blast and Fasta which are good for looking for regions of long fairly continuous alignment. The program will find every sequence in a protein sequence database that has a peptide matching each of the criteria and you can inspect them at your leisure. Early versions of this program identified the only SH2 domain yet found in yeast (Medline Link) and several particularly informative examples of previously unrecognized PH domains found in several important signaling molecules (Medline Link). The program works with any FASTA format sequence file and can be emailed to you if you want it. It also doesn't care whether you're looking for protein or nucleic acid sequences, it can deal with both. Being somewhat old now it prefers to work in a DOS environment, but will run without problems in a DOS window in Windows 95, 98, XP or whatever.


III. Spreadsheet Programs A few years ago I described a method of using the considerable capabilities built into Microsoft Excel and similar spreadsheets in the analysis of protein and nucleic acid sequence data, and here is the reference and a link to Medline [Shaw, G. "Protein sequence interpretation with a spreadsheet program". BioTechniques, 19:978-983 (1995)]. Since I worked on that paper I played around some more with this approach and I think it has a good deal of potential. In fact I do almost all my protein and nucleic acid sequence analysis in Excel and Word (see below). Some time later I wrote a chapter for the book ["Spreadsheets in Science and Engineering" edited by Gordon Filby (Springer-Verlag 1998)], which added some refinements to the method, and have had various input and improvements from all over since then. Below are some of the latest spreadsheets which can do a fairly astonishing range of things surprisingly well. You can cut a sequence out of a database and paste it into the spreadsheet, have it manipulated in a variety of ways and obtain a nice printout which you can put into your grant, research paper or whatever. The approach is particularly applicable for teaching purposes, since students can see exactly how the calculations are performed, and because they probably all have Excel already, there's no need to buy any expensive specialized software.

For teaching purposes I made these Excel files which contain examples of spreadsheet use to calculate charge, hydrophobicity, coiled coils etc. Using a projection screen on a PC you can paste in sequences and show the result in real time, and the students can also see exactly what the program is doing. Here they are;

Simple example; calculating charge along a proteins sequence: Charged.xls
Amino acid composition, Isoelectric point from a protein (very useful): Aacomp.xls
Search for membrane spanning domains, antigenic regions: Memb.xls
Chou and Fasman implementation: Chofas.xls
Alpha-helical Coiled Coil Predictor: Coilcoil.xls

Simple DNA example: DNA.xls
Dot Plot programs for nucleic acid and protein: Dotplot.xls

Unfortunately you can't get these files by direct FTP anymore, because of the risk that you might be a terrorist or somesuch. However you can download them sort of by browser based FTP from this site; http://plaza.ufl.edu/gshaw/. Just mouse select the one you want from the list and it should download directly to your computer and be loaded into Excel. All of these spread sheets work in the same kind of way; the last sheet contains a set of sequences, which can be formated in any of a variety of ways. I put the name, accession number or other identifier into the top cell, and that should then appear in the graphs. The sequence can then be pasted into the next cell below, and it does not matter if the sequence is on more than one line or if it contains blanks, tabs or other whitespace characters. Simple select the whole column containing the sequence you are interested in (or paste in a new sequence and select that column). Copy the column and paste it into the first column of the first page of the spread sheet. The spreadsheet should recalculate for the new sequence, and plot out whatever data that particular spreadsheet does.

Specific instructions for each spreadsheet.

Charged.xls; This is the simplest one, it gets a sequence and translates acidic residues (Glutamic and Aspartic acids, E and D) to -1, basic amino acids (Lysine and Arginine, K and R) to +1 and Histidine (H) to +0.5. All other amino acids are tranlated to 0 and the average score for each peptide segment over a window of 28 amino acids is plotted out. The spreadsheet is quite simple, using one column to see if the amino acid at a particular position is, for example, E, and if so assigned a value of -1. This is easy to follow but somewhat cumbersome. It is possible to nest up to 7 of those kinds of assignments in one cells and the command;

=IF(E2="D",-1,IF(E2="E",-1,IF(E2="K",1,IF(E2="R",1,IF(E2="H",0.5,0)))))

Will effectlvely replace all those columns. The spreadsheet also uses the =abs() command, which returns the absolute value of charge, i.e. makes them all positive, so you can plot out the charge density as well as the absolute amount of charge.

Aacomp.xls; This can do various manipulations like calculate the molecular weight, amino acid composition, isoelectric point, charge at different pH values and other things from a protein sequence. The "AAcomp" page shows the composition of the protein in comparison with various average composition values, and is useful since the average amount of each amino acid in proteins is quite variable; Leucine is generally about 10% while Tryptophan is only about 1%, so a protein with equal amounts of Leu and Trp is likely Leu poor and Trp rich. The values in the are ones I calcalated some years ago from various different genome projects as well as older versions of Genbank and the PIR database, and, as you can see, are fairly similar. The value the spreadsheet uses is the one in the "in use" column, and you can paste in whatever values you want to this column. The "AA Comp Plot" page prints out this data in various ways, including a radar diagram which immediately points out which amino acids are unusually common and which are rare in your particular protein. The "IEP" and "IEP Plot" pages show calculations of the charge of the protein at different pH values.

Memb.xls; This spreadsheet assigns numbers to amino acids like the Charged.xls spreadsheet, but can do this for a variety of methods of looking for membrane spanning segments, leader seqeunces, antigenic sites etc. So Kyte and Doolittle described a program for finding membrane spanning segments, in which hydrophobic amino acids were given somewhat arbitrary positive scores, hydrophilic amino acids were give negative scores, and small, neutral amino acids given scores close to 0. Plotting the scores over a window of about 20 amino acids, the size of transmembrane a-helix, gives peaks for groups of hydrophobic amino acids which has been remarkably accurate in the prediction of membrane topology of quite a lot of proteins. Several other scoring matrices are built into the "Calculations" page, and you can flip between them by typing the appropriate number in the "Type number" box, and you can add your own values if you want to. The first four are methods of finding hydrophobic regions, the last two are supposed to find antigenic regions.

Chofas.xls and Coilcoil.xls; This are implementations of the the Chou and Fasman and Lupas et al. algorithms for the prediction of overall protein structure and and of a-helical coiled coils respectively. Both of them work as well as any other version of these programs I have seen, and a lot of people have commented to me that they did not think it would be possible to implement such quite complex algorithms in Excel.

DNA.xls; shows how you can do similar things with DNA sequences looking for GC rich regions etc. If you have the energy you can look for restriction sites, calculate melting temperature etc. etc. Finally DotPlot.xls shows a method of comparing two sequences (or one sequence with itself) which is good for looking for repeats. Basically a big table is produced with one sequence running horizontally, the other vertically. Any position on the table in which the amino acid in the horizontal position is the same as that in the vertical is marked, and if the two sequences are identical this generates a diagonal line from top left to bottom right of the table. A diagonal line off of this line indicates a sequence repeat, and a diagonal line at 180 degrees to the top left bottom right line indicates inverted repeated sequences. Inverted repeats don't have any special meaning in proteins as far as anyone knows, but in DNA often signify binding sites for transcription factors. The program works fine for DNA, RNA or protein sequences, and is quite a nice way of looking for and displaying sequence repeats.

The original Biotechniques article was more recently reprinted in a collection of Biotechniques articles in the book "Biocomputing: computer tools for biologists", edited by Stuart Brown (Biotechniques press, 2003). This article is the first chapter in this book, and is reprinted along with a short update of the general method.


IV. Macros. Microsoft Word Version 6 and above contains a Basic interpretor allowing the creation of useful Macros for those of us who work with protein and nucleic acid sequences. Word 6 uses Word Basic while Word 97 changed over to Visual Basic. Both are fully functional programming languages, and Visual Basic in particular is extremely versatile. I thought it would be nice to be able to use the mouse to select a piece of protein or nucleic acid sequence from within a Word document and be able to run a program which would give me whatever kind of data I wanted. So I wrote macros in these two languages that work on both PC and Macintosh which calculates of exact MW of mouse selected protein or nucleic acid sequence, and for nucleic acids will give you GC content, melting temperature etc. For protein you can get isoelectric points, composition data and it would be easy to write macros to do whatever else you were interested in. I wrote this up and it was published in the June 2000 edition of Biotechniques, and here is the Medline link:[Useful Microsoft Word Macros for Molecular Biologists and Protein Chemists, Biotechniques 28:1198-1201 (2000)]. Here is a page I wrote for them (with some corrections and additions). The original Biotechniques article was more recently reprinted in a collection of Biotechniques articles in the book "Biocomputing: computer tools for biologists", edited by Stuart Brown (Biotechniques press, 2003). This article is the second chapter in this book, and is reprinted along with a short update of the general method.


V. Odd Fonts. Recently I developed a series of fonts for the IBM-PC and compatibles which can display sequence alignments with ease. Large hydrophobic amino acids (WFYLIVCM) are linked to blocked out characters, charged amino acids (EDKRH) are linked to bold and shaded characters, while the remain small and neutral amino acids (GAPSTNQ) are printed out in normal courier. All you have to do is type in your alignment in any font, then select my font, and the major types of amino acids are automatically displayed. This produces a very informative and aesthetic figure with no effort on your part. If you want this font send me an email. Final versions of this and Macintosh versions will be available for FTP download in the near future if I get round to it.


VI. A Hek293 data base. Recently we found that the widely used Hek293 line, derived by adenovirus transduction of human embryonic kidney cells (hence HEK), surprisingly have attributes of neuronal lineage cells. Here is some of the Affymetrix data we generated as a result of this finding.


VII. Javascript Programs. I got into Javascript in the last couple of years, and did versions of some of the above spreadsheets and some newer stuff also. Check out my company website protocols section at http://www.encorbio.com/protocols.htm, or, for example, look at http://www.encorbio.com/protocols/Codon.htm.


Return to Shaw Homepage