Converting between Unix and Windows text files?

Posted by: Abbas  :  Category: General, Technical

The format of Windows and Unix text files differs slightly. In Windows, lines end with both the line feed and carriage return ASCII characters, but Unix uses only a line feed. As a consequence, some Windows applications will not show the line breaks in Unix-format files. Likewise, Unix programs may display the carriage returns in Windows text files with Ctrl-m ( ^M ) characters at the end of each line.

There are many ways to solve this problem. This document provides instructions for using FTP, screen capture, unix2dos and dos2unix, tr, awk, Perl, and vi to do the conversion. Before you use these utilities, the files you are converting must first be on a Unix computer.

Note: In the instructions below, replace unixfile.txt with the name of the Unix file you are transferring, and replace winfile.txt with the name of the Windows file you are transferring.

FTP

When using an FTP program to move a text file between Unix and Windows, be sure the file is transferred in ASCII format. This will ensure that the document is transformed into a text format appropriate for the host. Some FTP programs, especially graphical applications like Hummingbird FTP, do this automatically. If you are using FTP from the command line, however, before you begin the file transfer, be sure to enter at the FTP prompt:

ascii

Note: You need to use a client that supports secure FTP to transfer files to and from Indiana University’s central systems. For more, see At IU, what SSH/SFTP clients are supported and where can I get them?

Screen capture

You can also convert files from Unix to Windows format when transferring them to a PC with a communications program by selecting ASCII text download. Select this option with your communications program to capture all the text subsequently displayed to your screen, and then enter at the Unix prompt:

cat unixfile.txt

Most communications programs will add carriage returns to the stream of text as they save it to your computer’s hard drive. Once the file has finished displaying, abort the text download.

Note: This method may be slow for large text files. Also, no error checking is performed on the file as it is transferred.

dos2unix and unix2dos

On systems using Solaris, the utilities dos2unix and unix2dos are available. These utilities provide a straightforward method for converting files from the Unix command line.

To use either command, simply type the command followed by the name of the file you wish to convert, and the name of a file which will contain the converted results. Thus, to convert a Windows file to a Unix file, at the Unix prompt, enter:

dos2unix winfile.txt unixfile.txt

Read more…

Craig Venter: On the verge of creating synthetic life

Posted by: Abbas  :  Category: General, News, Science

The New York Times Nods to R

Posted by: Eli Roberson  :  Category: General

In a previous Bioinformatics Toolchest post I discussed R, a statistical programming software that I’m a big fan of. Today the New York Times had a business computing article discussing the use of R in academics and business. From my own experience, I think the use of R is pretty extensive in the right academic circles. But there are some corporate giants such as Google and Pfizer that are listed as users of R as well. Historically most of these types of statistical analyses for academics, government, and corporate uses were performed using SAS. I don’t have any problem with SAS, but I’m a proponent of the current shift toward using free software. Still being a student, I can’t afford to pay for software. Having free alternatives like R available give me a chance to use powerful software at an affordable cost, while still giving a wealth of features.

The grant environment for academic research is pretty hostile right now. In a time when funds are drying up, I know that I’d rather use a free alternative than pay for an individual license for each computer in my lab.

Vector NTI Is Dead // Long Live Vector NTI

Posted by: Eli Roberson  :  Category: General

Previously in the Bioinformatics Toolchest series I talked about Vector NTI as a great tool available free to researchers. Unfortunately I’m going to have to reverse that recommendation. The cornerstones of my reasoning were that the tool worked well and was freely available to academic researchers. However, Vector NTI is no longer free for researchers.

Vector NTI was offered by Invitrogen. But Invitrogen will not exist much longer. Invitrogen and Applied Biosystems have finalized a merger to become the mega-company Life Technologies. I’m personally not a big fan of the big companies swallowing each other to create even bigger companies with less competition. But I thought, hey, probably not that bad. A few weeks ago I received an e-mail since I am a registered Vector NTI user. It stated that on December 15, 2008 Vector NTI 11 would be out. This is exciting for me. New Vector features, bug fixes, streamlining, should be good. Wrong. The new version comes with the discontinuation for free licenses for researchers. Why you ask? Good question. The original FAQ they published has disappeared since then, but is here courtesy of the Google cache. An excerpt from that FAQ follows

6. Why has Invitrogen discontinued the free v10 license program?
Over the past three years, the v10 free license program has been an overwhelming success by the sheer number of researchers using this version of the software.  In that time, you have told us very clearly you want added features, easier licensing, and more personalized technical support.  In response, we have completely redesigned both the software and our licensing options for academic researchers.  Vector NTI AdvanceTM 11 contains major new cloning, design and search functionality, a completely updated interface, support on Intel-based Macs as well as Windows® Vista, and new, cost-effective 1-year and 3-year license options exclusively for academic researchers.  These new license options also include personalized Technical Support by email, and are delivered directly to you by email without the need to register or log in separately.  At significantly reduced prices compared with our Commercial Licenses, these new options respect the current grant funding and other realities of academic research.

Personally I would rather have the free license with the option to purchase a tech support contract, or pay a higher rate for per use support. Who knows the real reason the licenses were discontinued. Maybe too many researchers asking for assistance. Maybe restructuring for additional money during and after the merger. Either way, Vector NTI is no longer a viable option for those looking for free tools. However, if you have liked the tool in the past and need it’s features for your research, it’s still a good application. If you’re willing to pay the price. Any suggestions for alternative free tools are encouraged in the comments.

Thanks for the informative NIH Access Battle

Posted by: Yasmeen  :  Category: General

Dear ER,

Your blurb was very informative and I agree, access to the public is a good thing.  It can demonstrate progress and reassure skeptics that scientists are not using funding and research simply for themselves and not the public good.  My only caveat to this is that can other groups copy and steal information? I know its unethical but we know people have done it in the past and well, human nature is a strange thing.

What is a HapMap (we all have heard of it)?

Posted by: Abbas  :  Category: General
What Is the HapMap

The HapMap is a catalog of common genetic variants that occur in human beings. It describes what these variants are, where they occur in our DNA, and how they are distributed among people within populations and among populations in different parts of the world. The International HapMap Project is not using the information in the HapMap to establish connections between particular genetic variants and diseases. Rather, the Project is designed to provide information that other researchers can use to link genetic variants to the risk for specific illnesses, which will lead to new methods of preventing, diagnosing, and treating disease.

read the rest here:  The International HapMap project, http://www.hapmap.org/.

Bioinformatics Tool Chest: Vector NTI

Posted by: Eli Roberson  :  Category: General

Continuing on the topic of bioinformatics tools for researchers, I thought I’d move away from R for a bit. The tool for today is Vector NTI. Fomerly Vector NTI was a product of Informax. To use Vector NTI one had to purchase a license, costing in the thousands of dollars. Then Vector was purchased by Invitrogen. This was great for researchers, because now Invitrogen offers annual, renewable licenses for free to academic researchers. Basically all that you need to do now is to sign up on the invitrogen site for the Vector NTI User Community and confirm that you are actually an academic researcher to get your free license.

Enough of the “how to get it” spiel, why would you want to get Vector? Think of it as a swiss army knife of research software. A central feature of the software is the local database. The database stores DNA, RNA, and protein molecule sequences, restriction enzymes with recognition sequences, oligos with sequences, gel markers, citations, blast results, and analysis results. The database comes prepopulated with many molecules (especially from the Invitrogen product line), oligos, markers, etc. Furthermore, the database doesn’t just store sequences, but also features, such as genes other key features. New molecules are easily added to the database. To add any of the molecules in the database to the current Vector tool all you have to do is drag it from the database window into the tool window.

Okay, okay. I know what you’re saying, “Yeah yeah yeah, I can already store my data in a database.” But that isn’t all. Say you have a molecule that you want to design PCR primers for. Vector can do that for you, and help analyze multiplex PCR primers. Want to clone a DNA segment? Not a problem. Use the database to figure out the best vector and electronically create the molecule ahead of time. There are even cloning wizards!

What about sequencing? Say you run some standard dye-terminator capillary sequencing on an ABI machine. You can actually load the *.abi file directly into Vector to analyze and edit the chromatogram. Say you’ve cloned the DNA you were interested in and sequenced it in both directions. Load the *.abi file into the Contig module of Vector, edit the chromatograms, and then align them with the electronic molecule to see if your product matches expectation.

And all of these things are just the beginning of what you can do with Vector NTI. If you want to find out more, get it yourself and try it out.

Bioinformatics Tool Chest: Bioconductor

Posted by: Eli Roberson  :  Category: General, Technical

Image of Bioconductor Logo

Following up on the previous bioinformatics tool chest post, I thought I’d cover Bioconductor next. Bioconductor is actually an off-shoot of the R-project.

Now hold on, I know what you’re thinking. “But you talked about R last time, why do we have to talk about R again?!?” It’s simple really. Though bioconductor is a derivitive of R, its purpose truly is unique enough to deserve its own post.

Bioconductor (or BioC) is an open-source derivitive of R focused on facilitating the analysis of genomic data. One might ask, why should I care? If you perform any kind of high-throughput SNP genotyping or gene expression analysis, this software suite gives you immediate access to free, open-source, extremely powerful data analysis options. Got Affymetrix CEL files for expression data? No problem. Bioconductor can load, normalize, analyze, and summarize that data for you. How about SNP genotyping data? Again no problem. Want to check the copy number of your SNP data? You’ll have several options. Many Bioconductor packages are built using S4 methods and classes (the exact definition of which are unimportant for this article). The advantage of that coding system is that you can use and extend existing classes to perform your own, custom designed analysis methods. And even better, once you’ve worked out a new method, you can incorporate it into a package and submit it to Bioconductor for everyone to use!

The bottom line is this: if you need powerful, customizable, freely available analysis software (and who doesn’t after spending ridulous amounts of money running many samples on high-throughput technology) then Bioconductor is a viable choice. If you have genomic data give BioC a try, and if it’s useful to you build your own packages for the whole community.

How Perl Saved the Human Genome Project

Posted by: Abbas  :  Category: General, News

The helix graphic is reproduced from Dr. Lincoln Stein's article

 

The helix graphic is reproduced from Dr. Lincoln Stein’s article “How Perl Saved the Human Genome Project” as published in the September 1996 issue of The Perl Journal.
Reprinted courtesy of the Perl Journal, http://www.tpj.com Archive.
Lincoln Stein’s website is http://stein.cshl.org

DATE: Early February, 1996

LOCATION: Cambridge, England, in the conference room of the largest DNA sequencing center in Europe.

OCCASION: A high level meeting between the computer scientists of this center and the largest DNA sequencing center in the United States.

THE PROBLEM: Although the two centers use almost identical laboratory techniques, almost identical databases, and almost identical data analysis tools, they still can’t interchange data or meaningfully compare results.

THE SOLUTION: Perl.

The human genome project was inaugurated at the beginning of the decade as an ambitious international effort to determine the complete DNA sequence of human beings and several experimental animals. The justification for this undertaking is both scientific and medical. By understanding the genetic makeup of an organism in excruciating detail, it is hoped that we will be better able to understand how organisms develop from single eggs into complex multicellular beings, how food is metabolized and transformed into the constituents of the body, how the nervous system assembles itself into a smoothly functioning ensemble. From the medical point of view, the wealth of knowledge that will come from knowing the complete DNA sequence will greatly accelerate the process of finding the causes of (and potential cures for) human diseases.

Six years after its birth, the genome project is ahead of schedule. Detailed maps of the human and all the experimental animals have been completed (mapping out the DNA using a series of landmarks is an obligatory first step before determining the complete DNA sequence). The sequence of the smallest model organism, yeast, is nearly completed, and the sequence of the next smallest, a tiny soil-dwelling worm, isn’t far behind. Large scale sequencing efforts for human DNA started at several centers a number of months ago and will be in full swing within the year.

read more…

Bioinformatics Tool Chest: R Programming Language

Posted by: Eli Roberson  :  Category: General, Technical

Data

Scientists love data. Call it a character flaw, but most of us can’t get enough. More data, more! But the data alone are just the start. To really be useful, we have to do something with the data. Model. Summarize. Evangelize it. Something. Who hasn’t needed to plot a standard curve? Or find the mean value of a series of numbers? What should you do when you have these questions.

The Problem

Many scientists turn to our friend Excel to solve these problems. It’s easy to work with, and you can even make graphs easily. That isn’t necessarily a good thing, as perfectly nice people make really bad graphs because those fancy 3D features are so tantalizing. Everyone interested in bioinformatics or computational biology needs a tool in their tool chest that can handle:

  1. statistics
  2. figure, graph creation
  3. very large data

The Solution

Look no further friends, your savior has arrive, and its name is R. R is a free, cross-platform, open-source derivitive of the S language. In case you didn’t catch that last part: R is free. You can download R from the nearest mirror to get started.

The Good

  • Freely available
  • Open-source — can compile it to your needs (OS, cpu, available memory, optimization levels)
  • Tons of add on packages
  • Scriptable
  • Ability to write own functions and packages
  • Able to handle large datasets
  • Interfaces with compiled languages
  • Can save plots as Post-scripts (print quality)
  • Extensive tutorials online along with mailing lists and archives for trouble shooting

The Bad

  • Command-line interface
  • Can be slow reading large files
  • Interpreted language (can be slower than compiled code)
  • No tech support line
  • Steep learning curve for beginners, especially non-programmers