Converting between Unix and Windows text files?

Posted by: Abbas  :  Category: General, Technical

The format of Windows and Unix text files differs slightly. In Windows, lines end with both the line feed and carriage return ASCII characters, but Unix uses only a line feed. As a consequence, some Windows applications will not show the line breaks in Unix-format files. Likewise, Unix programs may display the carriage returns in Windows text files with Ctrl-m ( ^M ) characters at the end of each line.

There are many ways to solve this problem. This document provides instructions for using FTP, screen capture, unix2dos and dos2unix, tr, awk, Perl, and vi to do the conversion. Before you use these utilities, the files you are converting must first be on a Unix computer.

Note: In the instructions below, replace unixfile.txt with the name of the Unix file you are transferring, and replace winfile.txt with the name of the Windows file you are transferring.

FTP

When using an FTP program to move a text file between Unix and Windows, be sure the file is transferred in ASCII format. This will ensure that the document is transformed into a text format appropriate for the host. Some FTP programs, especially graphical applications like Hummingbird FTP, do this automatically. If you are using FTP from the command line, however, before you begin the file transfer, be sure to enter at the FTP prompt:

ascii

Note: You need to use a client that supports secure FTP to transfer files to and from Indiana University’s central systems. For more, see At IU, what SSH/SFTP clients are supported and where can I get them?

Screen capture

You can also convert files from Unix to Windows format when transferring them to a PC with a communications program by selecting ASCII text download. Select this option with your communications program to capture all the text subsequently displayed to your screen, and then enter at the Unix prompt:

cat unixfile.txt

Most communications programs will add carriage returns to the stream of text as they save it to your computer’s hard drive. Once the file has finished displaying, abort the text download.

Note: This method may be slow for large text files. Also, no error checking is performed on the file as it is transferred.

dos2unix and unix2dos

On systems using Solaris, the utilities dos2unix and unix2dos are available. These utilities provide a straightforward method for converting files from the Unix command line.

To use either command, simply type the command followed by the name of the file you wish to convert, and the name of a file which will contain the converted results. Thus, to convert a Windows file to a Unix file, at the Unix prompt, enter:

dos2unix winfile.txt unixfile.txt

Read more…

Linux: Who’s on the server???

Posted by: Eli Roberson  :  Category: Science, Technical

Linux? You geeks use Linux?

If you work in science, and you work on big datasets (such as analyzing next generation sequencing data), chances are that you use Linux for some of your work. I frequent several of our lab’s Red Hat servers for data analysis and code development purposes. However, these aren’t just my servers to use. Other lab members and, depending on the server, IT staff use them too. I try to remember to check and see who is on and what they’re running before getting too involved with something that’s going to hog memory or processor time. But, of course, I don’t always remember.

I decided to automate this process to take the remembering part out. By adding in a shell script + some code in my profile file, my ssh login immediately displays relevant information without having to invoke it manually.

Shell Script

The code is based on the Bash shell, so it may our may not apply to your ssh login. I keep the shell script in my /home/user directory with the name “.greeting.sh”. Adding the leading period just makes it invisible to standard “ls” queries so it doesn’t add to the clutter in my home directory. The code for the “.greeting.sh” follows between the lines of # signs:

##################################################
#!/bin/bash

UNAME=`whoami`
TIME=`date`
HOST=`hostname`
UCNT=`users | wc -w`
ULST=`users`
PROC=`ps aux|awk ‘NR > 0 { s +=$3 }; END {printf(”%d\n”, s + 0.5);}’`
MPCT=`free | grep Mem | awk ‘{printf(”%d\n”, $3 / $2 * 100 + 0.5);}’`
MYSHELL=`echo $SHELL`

echo
echo “$TIME”
echo “Shell: $MYSHELL”
echo “Hello $UNAME! Welcome to $HOST!”

if [ $UCNT -ge 2 ]
then
echo “$UCNT users are currently logged into $HOST:”
echo “$ULST”
else
echo “No other users currently logged in.”
fi

echo “System Status:”

if [ $PROC -ge 80 ]
then
echo “High processor usage at ${PROC}%”
elif [ $PROC -ge 50 ]
then
echo “Medium processor usage at ${PROC}%”
else
echo “Low processor usage at ${PROC}%”
fi

if [ $MPCT -ge 80 ]
then
echo “High memory usage at ${MPCT}%”
elif [ $MPCT -ge 50 ]
then
echo “Medium memory usage at ${MPCT}%”
else
echo “Low memory usage at ${MPCT}%”
fi

echo

exit 0
##################################################

For example, the code above prints the following when logging in: The date, a greeting, the hostname, my current shell, whether other users are logged in (and the list of users if others are on), and information about current processor and memory usage. I customize this script depending on the primary use of the server. If you have a server that should always be running a certain program, add a line that looks for that program. If it were called “myprogram” you could add the following line to the program:

PROG=`ps aux | grep -v grep | grep myprogram | wc -l`

If the program is running, then it will return 1 (if only one instance is running), or 0 if it isn’t running. By adding in some language later testing if $PROG -ge 1, a message could print saying the program was running or not.

Take note! Don’t forget to alter the permissions on the script to allow execution, using something like “chmod +x .greeting.sh”. Also note that the variables are defined using backticks (same key as the ~ on standard US QWERTY keyboards), not single quotes.

Automatically running

The script isn’t much use if you have to run it manually (if I remembered to do that, why would I need a script?), so I like to set the script to run automatically immediately following an ssh login. As I said before, I use Bash on most of the Linux servers I use. For this shell, there is a file called “.bash_profile” in the home directory of each user. This profile file is executed on every ssh connection to set some common environment variables, like PATH. By adding in code to run the greeting script, the output from the script will be displayed immediately after login. Example code to add to the bottom of your profile file:

##################################################
if [ -e "/home/user/.greeting.sh" ]
then
/home/user/.greeting.sh
fi
##################################################

That’s all there is to it. A simple, but powerfull script to automatically give you information on server login. Feel free to your system and purpose.

Craig Venter: On the verge of creating synthetic life

Posted by: Abbas  :  Category: General, News, Science

Next Generation Seq Tools

Posted by: admin  :  Category: Technical

Something I came across.

Integrated solutions
* CLCbio Genomics Workbench - de novo and reference assembly of Sanger, 454, Solexa, Helicos, and SOLiD data. Commercial next-gen-seq software that extends the CLCbio Main Workbench software. Includes SNP detection, browser and other features. Runs on Windows, Mac OS X and Linux.

* NextGENe - de novo and reference assembly of Illumina and SOLiD data. Uses a novel Condensation Assembly Tool approach where reads are joined via “anchors” into mini-contigs before assembly. Requires Win or MacOS.

* SeqMan Genome Analyser - Software for Next Generation sequence assembly of Illumina, 454 Life Sciences and Sanger data integrating with Lasergene Sequence Analysis software for additional analysis and visualization capabilities. Can use a hybrid templated/de novo approach. Early release commercial software. Compatible with Windows® XP X64 and Mac OS X 10.4.

Read more…

Bioinformatics Tool Chest: Why You Should Be Using Firefox

Posted by: Eli Roberson  :  Category: Science, Technical

Firefox?!?!

I know what you’re thinking. “Come on. A browser? As a bioinformatics tool?” You might actually be surprised. I think that most people that do research spend at least some amount of time online trying to track down information. Maybe it’s  protein name, or DNA elements in a chromosome segment. Maybe it’s a certain paper or topic through PubMed. Personally, I spend a good amount of time searching out answers. Furthermore, I switch between databases / websites between tabs to get information from different sources. Could there be a way to search faster?

Keyword Search To The Rescue!

Luckily, there is a faster way: the keyword search. Basically the keyword search will allow you to make a bookmark shortcut to any search box using a keyword. Once a keyword search has been saved that particular search can be invoked with just the keyword. I frequently use the UCSC Genome Browser for research, so I’ll use this as an example.

How To

  1. Navigate to the UCSC Genome Browser main page.
  2. In the top navigation panel click “Genomes”
  3. The default page should be the Human genome browser. If you are interested in a different organism you can certainly change it using the drop-down boxes. There should be an input box labeled “position or search term”. Right click in the box.
  4. In the pop-up menu select “Add a Keyword for This Search…”. An “Add Bookmark” window will appear.
  5. In the “Name” box type a descriptive name. In this case use “UCSC Human Search”.
  6. In the “Keyword” box type the keyword you want to use. In this case use “ucsc”.
  7. Press the “Add” button to save this search.

Let’s test the keyword. Open a new blank Firefox tab by pressing CTRL+T or File -> New Tab. In the address bar type “ucsc MECP2″ and press enter. The “ucsc” keyword triggers the query “MECP2″ to be run through the search box we saved. After a few seconds a window for the UCSC browser should appear listing possible genes matching the symbol MECP2. If you had navigated to the UCSC Browser directly and typed MECP2 directly in the search box the results would have been the same.

What about direct chromosome positions? Let’s try it. Clear the text from the URL bar, type “ucsc  chr1:1-20000000″, and press enter. The page should change to show the first 20,000,000 base pairs of chromosome 1.

What other uses could it have? What about a “pubmed” keyword search? Or an Ensembl search? It can be particularly powerful of you combine these searches. If you were researching Rett Syndrome, you could in one tab search for “pubmed Rett Syndrome”. After reading a few papers and finding information on MECP2 in Rett Syndrome all you have do is hit CTRL+T to open another tab. Then type “ucsc MECP2″ to find it in the genome browser. If you had a saved search for the NCBI Protein database you could go even further by opening yet another tab and typing “protein MECP2_HUMAN” (assuming your keyword was protein). The result would be a page about the MECP2 protein in humans where you could get the amino acid sequence. Your specific search set would depend on what databases you search most frequently in your research.

This kind of time savings can really add up. Plus you can show off your cool new hack to friends when they’re trying to search for something.

Exporting Vector NTI Data — The Hail Mary

Posted by: Eli Roberson  :  Category: Science, Technical

VNTI Is Dead

The golden age of Vector NTI has ended, and free software licenses are no longer available to academics. This move has been disturbing to many, and support for deactivated licenses haven’t been the best so far. But after sending a plea to the tech support services associated with VNTI, they’ve come through with some help.

To answer an oft answered question, DNA/RNA/Protein sequences CANNOT be exported after a license is expired. I know, I know, bad programming practice and bad PR practice. BUT if your data is locked in you can get a temporary license to export everything. For DNA / RNA molecules you can export into GenBank, EMBL, and FASTA file formats. For protein sequences you can export into GenPept, SWISS-PROT, or Protein FASTA format. File export DOES NOT work for Enzymes, Oligos, Gel Markers, Citations, BLAST Results, or Analysis Results. Those of you with extensive Oligo libraries will want to contact Tech Support directly for assistance in exporting or moving these files. Sorry guys. It may or may not be supported.

Exporting DNA/RNA Molecules

  1. Open your VNTI Database.
  2. Go to ‘DNA/RNA Molecules’ from the drop down box.
  3. Select all the molecules you want to export. For everything, select one molecule and either press CTRL+A or use ‘Edit’ -> ‘Select All’.
  4. Go to ‘Edit’ -> ‘Copy To’ -> ‘File…’. Make sure to choose the format you want. If you want all three, just repeat the process for each one.

Exporting Protein Sequences

The process is identical to exporting DNA / RNA molecules, except the Protein Molecules library must be used.

Getting a Temporary License

To get your temporary license e-mail Technical Support at bioinfosupport[AT]invitrogen.com. In your message just explain that you’ve been a user of the VNTI free license, but the license expired and you need a temporary one to export all your data.

Now, I’m glad that Life Sciences / Invitrogen has come through with some help for the community. Do I agree with the change in marketing? No. Do I think the transition was handled gracefully? No. But they could have elected to lock everyone’s data in permanently, and have instead elected to extend the olive branch. Hope this helps some of your out there with trapped data.

The New York Times Nods to R

Posted by: Eli Roberson  :  Category: General

In a previous Bioinformatics Toolchest post I discussed R, a statistical programming software that I’m a big fan of. Today the New York Times had a business computing article discussing the use of R in academics and business. From my own experience, I think the use of R is pretty extensive in the right academic circles. But there are some corporate giants such as Google and Pfizer that are listed as users of R as well. Historically most of these types of statistical analyses for academics, government, and corporate uses were performed using SAS. I don’t have any problem with SAS, but I’m a proponent of the current shift toward using free software. Still being a student, I can’t afford to pay for software. Having free alternatives like R available give me a chance to use powerful software at an affordable cost, while still giving a wealth of features.

The grant environment for academic research is pretty hostile right now. In a time when funds are drying up, I know that I’d rather use a free alternative than pay for an individual license for each computer in my lab.

Vector NTI Is Dead // Long Live Vector NTI

Posted by: Eli Roberson  :  Category: General

Previously in the Bioinformatics Toolchest series I talked about Vector NTI as a great tool available free to researchers. Unfortunately I’m going to have to reverse that recommendation. The cornerstones of my reasoning were that the tool worked well and was freely available to academic researchers. However, Vector NTI is no longer free for researchers.

Vector NTI was offered by Invitrogen. But Invitrogen will not exist much longer. Invitrogen and Applied Biosystems have finalized a merger to become the mega-company Life Technologies. I’m personally not a big fan of the big companies swallowing each other to create even bigger companies with less competition. But I thought, hey, probably not that bad. A few weeks ago I received an e-mail since I am a registered Vector NTI user. It stated that on December 15, 2008 Vector NTI 11 would be out. This is exciting for me. New Vector features, bug fixes, streamlining, should be good. Wrong. The new version comes with the discontinuation for free licenses for researchers. Why you ask? Good question. The original FAQ they published has disappeared since then, but is here courtesy of the Google cache. An excerpt from that FAQ follows

6. Why has Invitrogen discontinued the free v10 license program?
Over the past three years, the v10 free license program has been an overwhelming success by the sheer number of researchers using this version of the software.  In that time, you have told us very clearly you want added features, easier licensing, and more personalized technical support.  In response, we have completely redesigned both the software and our licensing options for academic researchers.  Vector NTI AdvanceTM 11 contains major new cloning, design and search functionality, a completely updated interface, support on Intel-based Macs as well as Windows® Vista, and new, cost-effective 1-year and 3-year license options exclusively for academic researchers.  These new license options also include personalized Technical Support by email, and are delivered directly to you by email without the need to register or log in separately.  At significantly reduced prices compared with our Commercial Licenses, these new options respect the current grant funding and other realities of academic research.

Personally I would rather have the free license with the option to purchase a tech support contract, or pay a higher rate for per use support. Who knows the real reason the licenses were discontinued. Maybe too many researchers asking for assistance. Maybe restructuring for additional money during and after the merger. Either way, Vector NTI is no longer a viable option for those looking for free tools. However, if you have liked the tool in the past and need it’s features for your research, it’s still a good application. If you’re willing to pay the price. Any suggestions for alternative free tools are encouraged in the comments.

Thanks for the informative NIH Access Battle

Posted by: Yasmeen  :  Category: General

Dear ER,

Your blurb was very informative and I agree, access to the public is a good thing.  It can demonstrate progress and reassure skeptics that scientists are not using funding and research simply for themselves and not the public good.  My only caveat to this is that can other groups copy and steal information? I know its unethical but we know people have done it in the past and well, human nature is a strange thing.

NIH Public Access: The Battle Begins

Posted by: Eli Roberson  :  Category: News, Science

Previously I tried to get the word out on a change to the NIH policy for grant supported research that required researchers to transfer a copy of the final work to a repository (PMC) that provides free access to the article. My personally biased opinion is the policy was a great move, and that making scientific knowledge more highly available to everyone is a good thing.

Some publishers have already stepped up to embrace the new policy by transferring the paper to PMC for you, some well before the 1 year deadline. Others have no coherent plan and charge large fees for a paper to be transferred to PMC. For example, the American Psychological Society charges Wellcome Trust supported researchers $4,000 to send a copy of their paper to PubMed Central.

There already is controversy about the policy in Congress. House Bill HR 6845 was introduced (you can find it by querying ‘HR 6845′ here) as the Fair Copyright in Research Works Act. After glancing over it, it seems that the bill intends to reverse the NIH policy decision by making sure funding agencies can’t force the funded individuals to put their works in a public archive. While on the outset that may sound like it’s protecting the researcher by not ‘forcing’ them to make their work available, it seems to me it’s actually protection for publishers that don’t want to modify their business model. You publish with us, you transfer copyright to us, we get paid for others to view the work. That worked fine for a long time. But the world has changed. We now live in a world where information is instantly available. How about instead of reversing a policy that makes more information available to more people we try to work out a new publishing model?

Who knows where this whole thing will end up? I don’t have a clue. What I do know is that making scientific works available (even if after a waiting period) to a wider audience of researchers is a good thing that spurs more research and greater innovation. But that’s just my two cents.