- MEGAN MEtaGenome ANalyzer. A stand-alone metagenome analysis tool.
- Metagenomics and Our Microbial Planet A website on metagenomics and the vital role of microbes on Earth from the National Academies.
- The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet A report released by the National Research Council in March 2007. Also, see the Report In Brief.
- IMG/M The Integrated Microbial Genomes system, for metagenome analysis by the DOE-JGI.
- CAMERA Cyberinfrastructure for Metagenomics, data repository and tools for metagenomics research.
- A good overview of metagenomics from the Science Creative Quarterly
- list of Metagenome Projects from genomesonline.org
- MG-RAST publicly available, free, metagenomics annotation pipeline and repository for pyrosequences, Sanger sequences, and other sequence approaches.
- Human microbiome project
- MetaHIT official website for the EU-funded project : Metagenomics of the Human Intestinal Tract
- Annotathon Bioinformatics Training Through Metagenomic Sequence Annotation
- Metagenomics Metagenomics research and applications
The format of Windows and Unix text files differs slightly. In Windows, lines end with both the line feed and carriage return ASCII characters, but Unix uses only a line feed. As a consequence, some Windows applications will not show the line breaks in Unix-format files. Likewise, Unix programs may display the carriage returns in Windows text files with Ctrl-m ( ^M ) characters at the end of each line.
There are many ways to solve this problem. This document provides instructions for using FTP, screen capture, unix2dos and dos2unix, tr, awk, Perl, and vi to do the conversion. Before you use these utilities, the files you are converting must first be on a Unix computer.
Note: In the instructions below, replace unixfile.txt with the name of the Unix file you are transferring, and replace winfile.txt with the name of the Windows file you are transferring.
FTP
When using an FTP program to move a text file between Unix and Windows, be sure the file is transferred in ASCII format. This will ensure that the document is transformed into a text format appropriate for the host. Some FTP programs, especially graphical applications like Hummingbird FTP, do this automatically. If you are using FTP from the command line, however, before you begin the file transfer, be sure to enter at the FTP prompt:
ascii
Note: You need to use a client that supports secure FTP to transfer files to and from Indiana University’s central systems. For more, see At IU, what SSH/SFTP clients are supported and where can I get them?
Screen capture
You can also convert files from Unix to Windows format when transferring them to a PC with a communications program by selecting ASCII text download. Select this option with your communications program to capture all the text subsequently displayed to your screen, and then enter at the Unix prompt:
cat unixfile.txt
Most communications programs will add carriage returns to the stream of text as they save it to your computer’s hard drive. Once the file has finished displaying, abort the text download.
Note: This method may be slow for large text files. Also, no error checking is performed on the file as it is transferred.
dos2unix and unix2dos
On systems using Solaris, the utilities dos2unix and unix2dos are available. These utilities provide a straightforward method for converting files from the Unix command line.
To use either command, simply type the command followed by the name of the file you wish to convert, and the name of a file which will contain the converted results. Thus, to convert a Windows file to a Unix file, at the Unix prompt, enter:
dos2unix winfile.txt unixfile.txt
Linux? You geeks use Linux?
If you work in science, and you work on big datasets (such as analyzing next generation sequencing data), chances are that you use Linux for some of your work. I frequent several of our lab’s Red Hat servers for data analysis and code development purposes. However, these aren’t just my servers to use. Other lab members and, depending on the server, IT staff use them too. I try to remember to check and see who is on and what they’re running before getting too involved with something that’s going to hog memory or processor time. But, of course, I don’t always remember.
I decided to automate this process to take the remembering part out. By adding in a shell script + some code in my profile file, my ssh login immediately displays relevant information without having to invoke it manually.
Shell Script
The code is based on the Bash shell, so it may our may not apply to your ssh login. I keep the shell script in my /home/user directory with the name “.greeting.sh”. Adding the leading period just makes it invisible to standard “ls” queries so it doesn’t add to the clutter in my home directory. The code for the “.greeting.sh” follows between the lines of # signs:
##################################################
#!/bin/bash
UNAME=`whoami`
TIME=`date`
HOST=`hostname`
UCNT=`users | wc -w`
ULST=`users`
PROC=`ps aux|awk ‘NR > 0 { s +=$3 }; END {printf(”%d\n”, s + 0.5);}’`
MPCT=`free | grep Mem | awk ‘{printf(”%d\n”, $3 / $2 * 100 + 0.5);}’`
MYSHELL=`echo $SHELL`
echo
echo “$TIME”
echo “Shell: $MYSHELL”
echo “Hello $UNAME! Welcome to $HOST!”
if [ $UCNT -ge 2 ]
then
echo “$UCNT users are currently logged into $HOST:”
echo “$ULST”
else
echo “No other users currently logged in.”
fi
echo “System Status:”
if [ $PROC -ge 80 ]
then
echo “High processor usage at ${PROC}%”
elif [ $PROC -ge 50 ]
then
echo “Medium processor usage at ${PROC}%”
else
echo “Low processor usage at ${PROC}%”
fi
if [ $MPCT -ge 80 ]
then
echo “High memory usage at ${MPCT}%”
elif [ $MPCT -ge 50 ]
then
echo “Medium memory usage at ${MPCT}%”
else
echo “Low memory usage at ${MPCT}%”
fi
echo
exit 0
##################################################
For example, the code above prints the following when logging in: The date, a greeting, the hostname, my current shell, whether other users are logged in (and the list of users if others are on), and information about current processor and memory usage. I customize this script depending on the primary use of the server. If you have a server that should always be running a certain program, add a line that looks for that program. If it were called “myprogram” you could add the following line to the program:
PROG=`ps aux | grep -v grep | grep myprogram | wc -l`
If the program is running, then it will return 1 (if only one instance is running), or 0 if it isn’t running. By adding in some language later testing if $PROG -ge 1, a message could print saying the program was running or not.
Take note! Don’t forget to alter the permissions on the script to allow execution, using something like “chmod +x .greeting.sh”. Also note that the variables are defined using backticks (same key as the ~ on standard US QWERTY keyboards), not single quotes.
Automatically running
The script isn’t much use if you have to run it manually (if I remembered to do that, why would I need a script?), so I like to set the script to run automatically immediately following an ssh login. As I said before, I use Bash on most of the Linux servers I use. For this shell, there is a file called “.bash_profile” in the home directory of each user. This profile file is executed on every ssh connection to set some common environment variables, like PATH. By adding in code to run the greeting script, the output from the script will be displayed immediately after login. Example code to add to the bottom of your profile file:
##################################################
if [ -e "/home/user/.greeting.sh" ]
then
/home/user/.greeting.sh
fi
##################################################
That’s all there is to it. A simple, but powerfull script to automatically give you information on server login. Feel free to your system and purpose.
Next Generation Seq Tools
Something I came across.
Integrated solutions
* CLCbio Genomics Workbench – de novo and reference assembly of Sanger, 454, Solexa, Helicos, and SOLiD data. Commercial next-gen-seq software that extends the CLCbio Main Workbench software. Includes SNP detection, browser and other features. Runs on Windows, Mac OS X and Linux.
* NextGENe – de novo and reference assembly of Illumina and SOLiD data. Uses a novel Condensation Assembly Tool approach where reads are joined via “anchors” into mini-contigs before assembly. Requires Win or MacOS.
* SeqMan Genome Analyser – Software for Next Generation sequence assembly of Illumina, 454 Life Sciences and Sanger data integrating with Lasergene Sequence Analysis software for additional analysis and visualization capabilities. Can use a hybrid templated/de novo approach. Early release commercial software. Compatible with Windows® XP X64 and Mac OS X 10.4.
Firefox?!?!
I know what you’re thinking. “Come on. A browser? As a bioinformatics tool?” You might actually be surprised. I think that most people that do research spend at least some amount of time online trying to track down information. Maybe it’s protein name, or DNA elements in a chromosome segment. Maybe it’s a certain paper or topic through PubMed. Personally, I spend a good amount of time searching out answers. Furthermore, I switch between databases / websites between tabs to get information from different sources. Could there be a way to search faster?
Keyword Search To The Rescue!
Luckily, there is a faster way: the keyword search. Basically the keyword search will allow you to make a bookmark shortcut to any search box using a keyword. Once a keyword search has been saved that particular search can be invoked with just the keyword. I frequently use the UCSC Genome Browser for research, so I’ll use this as an example.
How To
- Navigate to the UCSC Genome Browser main page.
- In the top navigation panel click “Genomes”
- The default page should be the Human genome browser. If you are interested in a different organism you can certainly change it using the drop-down boxes. There should be an input box labeled “position or search term”. Right click in the box.
- In the pop-up menu select “Add a Keyword for This Search…”. An “Add Bookmark” window will appear.
- In the “Name” box type a descriptive name. In this case use “UCSC Human Search”.
- In the “Keyword” box type the keyword you want to use. In this case use “ucsc”.
- Press the “Add” button to save this search.
Let’s test the keyword. Open a new blank Firefox tab by pressing CTRL+T or File -> New Tab. In the address bar type “ucsc MECP2″ and press enter. The “ucsc” keyword triggers the query “MECP2″ to be run through the search box we saved. After a few seconds a window for the UCSC browser should appear listing possible genes matching the symbol MECP2. If you had navigated to the UCSC Browser directly and typed MECP2 directly in the search box the results would have been the same.
What about direct chromosome positions? Let’s try it. Clear the text from the URL bar, type “ucsc chr1:1-20000000″, and press enter. The page should change to show the first 20,000,000 base pairs of chromosome 1.
What other uses could it have? What about a “pubmed” keyword search? Or an Ensembl search? It can be particularly powerful of you combine these searches. If you were researching Rett Syndrome, you could in one tab search for “pubmed Rett Syndrome”. After reading a few papers and finding information on MECP2 in Rett Syndrome all you have do is hit CTRL+T to open another tab. Then type “ucsc MECP2″ to find it in the genome browser. If you had a saved search for the NCBI Protein database you could go even further by opening yet another tab and typing “protein MECP2_HUMAN” (assuming your keyword was protein). The result would be a page about the MECP2 protein in humans where you could get the amino acid sequence. Your specific search set would depend on what databases you search most frequently in your research.
This kind of time savings can really add up. Plus you can show off your cool new hack to friends when they’re trying to search for something.
VNTI Is Dead
The golden age of Vector NTI has ended, and free software licenses are no longer available to academics. This move has been disturbing to many, and support for deactivated licenses haven’t been the best so far. But after sending a plea to the tech support services associated with VNTI, they’ve come through with some help.
To answer an oft answered question, DNA/RNA/Protein sequences CANNOT be exported after a license is expired. I know, I know, bad programming practice and bad PR practice. BUT if your data is locked in you can get a temporary license to export everything. For DNA / RNA molecules you can export into GenBank, EMBL, and FASTA file formats. For protein sequences you can export into GenPept, SWISS-PROT, or Protein FASTA format. File export DOES NOT work for Enzymes, Oligos, Gel Markers, Citations, BLAST Results, or Analysis Results. Those of you with extensive Oligo libraries will want to contact Tech Support directly for assistance in exporting or moving these files. Sorry guys. It may or may not be supported.
Exporting DNA/RNA Molecules
- Open your VNTI Database.
- Go to ‘DNA/RNA Molecules’ from the drop down box.
- Select all the molecules you want to export. For everything, select one molecule and either press CTRL+A or use ‘Edit’ -> ‘Select All’.
- Go to ‘Edit’ -> ‘Copy To’ -> ‘File…’. Make sure to choose the format you want. If you want all three, just repeat the process for each one.
Exporting Protein Sequences
The process is identical to exporting DNA / RNA molecules, except the Protein Molecules library must be used.
Getting a Temporary License
To get your temporary license e-mail Technical Support at bioinfosupport[AT]invitrogen.com. In your message just explain that you’ve been a user of the VNTI free license, but the license expired and you need a temporary one to export all your data.
Now, I’m glad that Life Sciences / Invitrogen has come through with some help for the community. Do I agree with the change in marketing? No. Do I think the transition was handled gracefully? No. But they could have elected to lock everyone’s data in permanently, and have instead elected to extend the olive branch. Hope this helps some of your out there with trapped data.

Following up on the previous bioinformatics tool chest post, I thought I’d cover Bioconductor next. Bioconductor is actually an off-shoot of the R-project.
Now hold on, I know what you’re thinking. “But you talked about R last time, why do we have to talk about R again?!?” It’s simple really. Though bioconductor is a derivitive of R, its purpose truly is unique enough to deserve its own post.
Bioconductor (or BioC) is an open-source derivitive of R focused on facilitating the analysis of genomic data. One might ask, why should I care? If you perform any kind of high-throughput SNP genotyping or gene expression analysis, this software suite gives you immediate access to free, open-source, extremely powerful data analysis options. Got Affymetrix CEL files for expression data? No problem. Bioconductor can load, normalize, analyze, and summarize that data for you. How about SNP genotyping data? Again no problem. Want to check the copy number of your SNP data? You’ll have several options. Many Bioconductor packages are built using S4 methods and classes (the exact definition of which are unimportant for this article). The advantage of that coding system is that you can use and extend existing classes to perform your own, custom designed analysis methods. And even better, once you’ve worked out a new method, you can incorporate it into a package and submit it to Bioconductor for everyone to use!
The bottom line is this: if you need powerful, customizable, freely available analysis software (and who doesn’t after spending ridulous amounts of money running many samples on high-throughput technology) then Bioconductor is a viable choice. If you have genomic data give BioC a try, and if it’s useful to you build your own packages for the whole community.

Data
Scientists love data. Call it a character flaw, but most of us can’t get enough. More data, more! But the data alone are just the start. To really be useful, we have to do something with the data. Model. Summarize. Evangelize it. Something. Who hasn’t needed to plot a standard curve? Or find the mean value of a series of numbers? What should you do when you have these questions.
The Problem
Many scientists turn to our friend Excel to solve these problems. It’s easy to work with, and you can even make graphs easily. That isn’t necessarily a good thing, as perfectly nice people make really bad graphs because those fancy 3D features are so tantalizing. Everyone interested in bioinformatics or computational biology needs a tool in their tool chest that can handle:
- statistics
- figure, graph creation
- very large data
The Solution
Look no further friends, your savior has arrive, and its name is R. R is a free, cross-platform, open-source derivitive of the S language. In case you didn’t catch that last part: R is free. You can download R from the nearest mirror to get started.
The Good
- Freely available
- Open-source — can compile it to your needs (OS, cpu, available memory, optimization levels)
- Tons of add on packages
- Scriptable
- Ability to write own functions and packages
- Able to handle large datasets
- Interfaces with compiled languages
- Can save plots as Post-scripts (print quality)
- Extensive tutorials online along with mailing lists and archives for trouble shooting
The Bad
- Command-line interface
- Can be slow reading large files
- Interpreted language (can be slower than compiled code)
- No tech support line
- Steep learning curve for beginners, especially non-programmers
Computational Tools For Glycomics Studies
Sugars are involved in almost every aspect of biology, from recognising pathogens and to blood clotting.The glycome’s basic building blocks are far more numerous and varied than the four letters of the DNA alphabet or the score of amino acids that make proteins.In the late 1980s, when researchers isolated the first gene for a glycosyl transferase, an enzyme that adds sugars to fats and proteins. The discovery gave scientists the first opportunity to study this process, which is usually called glycoslyation, by manipulating the activity of such enzymes.

Glycomics, or glycobiology is a discipline of biology that deals with the structure and function of oligosaccharides (chains of sugars). The identity of the entirety of carbohydrates in an organism is thus collectively referred to as the glycome.The progressing glycomics projects will dramatically accelerate the understanding of the roles of carbohydrates in cell communication and hopefully lead to novel therapeutic approaches for treatment of human disease
The Functional Glycomics Gateway is a comprehensive and free online resource that is the result of a collaboration between the Consortium for Functional Glycomics (CFG) and Nature Publishing Group. It is aimed at keeping you abreast of developments in the emerging field of functional glycomics.
http://www.functionalglycomics.org/static/index.shtml
For annotation and/or cross-reference carbohydrate-related data collections which will allow us to find important data for compounds of interest in a compact and well-structured representation
http://www.glycosciences.de/sweetdb/
Many pdb-files contain carbohydrate structures. Since there is not such a standard nomenclature like it exists for amino acids, it is difficult to find the carbohydrate information. Sometimes entire oligosaccharides are encoded in one single residue. Information about carbohydrate linkages is often missing, and if it is present, it is not in a unique format and therefore also difficult to find.pdb2linucs automatically extracts carbohydrate information from pdb-files .
http://www.dkfz-heidelberg.de/spec/pdb2linucs/
GlycoSuite comprises GlycoSuiteDB, the leading curated and annotated glycan database, and new bioinformatic tools which interface mass spectrometric data with the database.
https://glycosuite.proteomesystems.com/glycosuite/glycodb
A Complex Carbohydrate Structure Database, also known as CarbBank is available . But, due to lack of funding it is no longer updated.

Sequencing of a genome often starts with a random shotgun sequencing strategy or with direct sequencing on genomic DNA . The DNA sequences of the clones or sequenced genome fragments often overlap, yielding enlarged DNA sequences (contigs).
Genome Assembly
The genomic sequences are assembled into a series of genomic sequence contigs. These are then ordered, oriented with respect to each other, and placed along each chromosome with appropriately sized gaps inserted between adjacent contigs. The resulting genome assembly thus consists of a set of genomic sequence contigs and a specification for how to arrange the sequence contigs along each chromosome.
Finished Chromosomes
A chromosome sequence is considered finished when any gaps that remain cannot be closed using current cloning and sequencing technology. In practice, therefore, the sequence for a finished chromosome usually consists of a small number of genomic sequence contigs.
Unfinished Chromosomes
Genomic sequence contigs for unfinished chromosomes are assembled and laid out based largely on the clone tiling path. However, the tiling paths do not specify the orientation of the clone sequences or how they should be joined; therefore, data on the alignment of the input genomic sequences to each other and to other sequences are also used to guide the assembly. Genomic sequences that augment the initial set of genomic contigs based on the tiling path clones are also incorporated.
To download complete human chromosome sequences:
It is possible to download in fasta format of each chromosome as whole sequences, through NCBI ftp site.NCBI ftp site maintains section called assembled chromosomes. We can download each chromosome sequences by clicking file which starts with hs_ref.
ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/
Manual Annotation:
Vega site maintianed by Sanger Institute presents data from the manual annotation of the human genome.
High-quality annotated human chromosome sequences
To download all human annotated contigs in one fasta sequnence
ftp://ftp.sanger.ac.uk/pub/vega/human/
Identification of genes
Genes are found using three complementary approaches: (a) known genes are placed primarily by aligning mRNAs to the assembled genomic contigs; (b) additional genes are located based on alignment of ESTs to the assembled genomic contigs; and (c) previously unknown genes are predicted using hints provided by protein homologies.
