Handy One Liners (awk)

Posted by: Abbas  :  Category: Technical
Compiled by Eric Pement

Latest version of this file (in English) is usually at:


This file will also be available in other languages:
   Chinese  - http://ximix.org/translation/awk1line_zh-CN.txt   


   Unix: awk '/pattern/ {print "$1"}'    # standard Unix shells
DOS/Win: awk '/pattern/ {print "$1"}'    # compiled with DJGPP, Cygwin
         awk "/pattern/ {print \"$1\"}"  # GnuWin32, UnxUtils, Mingw

Note that the DJGPP compilation (for DOS or Windows-32) permits an awk
script to follow Unix quoting syntax '/like/ {"this"}'. HOWEVER, if the
command interpreter is CMD.EXE or COMMAND.COM, single quotes will not
protect the redirection arrows (<, >) nor do they protect pipes (|).
These are special symbols which require "double quotes" to protect them
from interpretation as operating system directives. If the command
interpreter is bash, ksh or another Unix shell, then single and double
quotes will follow the standard Unix usage.

Users of MS-DOS or Microsoft Windows must remember that the percent
sign (%) is used to indicate environment variables, so this symbol must
be doubled (%%) to yield a single percent sign visible to awk.

If a script will not need to be quoted in Unix, DOS, or CMD, then I
normally omit the quote marks. If an example is peculiar to GNU awk,
the command 'gawk' will be used. Please notify me if you find errors or
new commands to add to this list (total length under 65 characters). I
usually try to put the shortest script first. To conserve space, I
normally use '1' instead of '{print}' to print each line. Either one
will work.


 # double space a file
 awk '1;{print ""}'
 awk 'BEGIN{ORS="\n\n"};1'

Ggplot2 Tutelage

Posted by: Abbas  :  Category: Technical

For those interested in some ggplot2 tutelage, Hadley Wickham (the creator of ggplot2) recently posted a 2 hour short course on data visualization with R (via ggplot2).

A blog post describing it is here:

The actual video is here:

The supposed slides (because you really can’t see the details all to well in the video) are here:

Metagenomics Resources

Posted by: Abbas  :  Category: Science, Technical

Next Gen. Sequencing

Posted by: Abbas  :  Category: Science

With IBM tossing it’s hat into the ring of “next-next-generation” sequencing, I’m starting to get lost as to which generation is which. For the moment, I’m sort of lumping things together, while I wait to see how the field plays out. In my mind, first generation is anything that requires chain termination, Second generation is chemical based pyrosequencing, and third generation is single molecule sequencing based on a nano-scale mechanical process. It’s a crude divide, but it seems to have some consistency.

At any rate, I decided I’d collect a few videos to illustrate each one. For Sanger, there are a LOT of videos, many of which are quite excellent, but I only wanted one. (Sorry if I didn’t pick yours.) For second and third generation DNA sequencing videos, the selection kind of flattens out, and two of them come from corporate sites, rather than youtube – which seems to be the general consensus repository of technology videos.

Personally, I find it interesting to see how each group is selling themselves. You’ll notice some videos press heavily on the technology, while others focus on the workflow.

As an aside, I also find it interesting to look for places where the illustrations don’t make sense… there’s a lovely place in the 454 video where two strands of DNA split from each other on the bead, leaving the two full strands and a complete primer sequence… mysterious! (Yes, I do enjoy looking for inconsistencies when I go to the movies.)

Ok, get out your popcorn.

First Generation:
Sanger Entry: Link

Second Generation:
Pyrosequencing Entry: Link

Read more…

NYTimes on Probiotics

Posted by: Abbas  :  Category: News, Science

There was an article on probiotics in the New York Times today. By Tara Parker-Pope it addresses some important issues rarely covered in the press about probiotics (see Well – Probiotics – Looking Underneath the Yogurt Label – NYTimes.com).

On the one hand, the article does a decent job of pointing out that there is great strain to strain variation among microbes labelled as probiotics. In this regard there is a great quote by Gregor Reid:

Lactobacillus is just the bacterium,” said Gregor Reid, director of the Canadian Research and Development Center for Probiotics. “To say a product contains Lactobacillus is like saying you’re bringing George Clooney to a party. It may be the actor, or it may be an 85-year-old guy from Atlanta who just happens to be named George Clooney. With probiotics, there are strain-to-strain differences.”

Read more…

New Look at Cancer Biology

Posted by: Abbas  :  Category: News, Science

Sure, James Watson has been known, especially recently, to say some outrageous things. But here is something I think everyone, scientists and the public should read – an opinoin piece in the NY Times today by Watson ( Op-Ed Contributor – To Fight Cancer, Know the Enemy – NYTimes.com)

This piece is worth reading because it contains some critical ideas and wisdom which has been missing in discussions of the fight against cancer.

First, Watson discusses the critical importance of basic science and says that when he expressed this importance to the National Cancer Institute advisory board many years ago, he was eventually booted off.

Second, he discusses how we have only recently begun to understand the basic biology of cancer (he also mentions how the human genome project has helped in this). The genome project will, he says, allow for the determination of most/all of the major genetic changes that occur in cancer cells.

Read more…

The Pervasive Effects of an Antibiotic on the Gut

Posted by: Abbas  :  Category: Science

The Pervasive Effects of an Antibiotic on the Human Gut Microbiota, as Revealed by Deep 16S rRNA Sequencing

Dethlefsen L, Huse S, Sogin ML, Relman DA
PLoS Biology Vol. 6, No. 11, e280 doi:10.1371/journal.pbio.0060280
A paper in PLOS Biology from the Relman lab investigates the effect of a treatment with the antibiotic ciprofloxacin on the bacteria in the intestine. They collected over 7,000 full-length 16S rDNA sequences (1100-1400 bp) by Sanger sequencing and over 900,000 reads (~250 bp) from 454 sequencing of the V3 and the V6 regions.
There are many important results in this paper, but it is particularly relevant that 454 sequencing reveals more taxonomic variation with greater stability than traditional sequencing. In my own work, I have found that sequence variants that occur only once in the experiment cannot be used to differentiate samples. Deep sequencing reveals more taxa, and also reduces the frequency of singletons. A rare sequence variant (OTU) that occurs only once in the ~7000 full-length sequences occurs about 65 times in the 454 data set, providing more than enough “probability of detection” to be used for comparisons between samples.
“This set of 7,208 sequences is among the largest datasets of full-length 16S rRNA sequences from the human microbiota (or any environment), the rarefaction curves for V6 and V3 tag pyrosequencing eventually rise higher and display more curvature toward the horizontal than the OTU0.01 curve. These features show that a single run of the [454] FLX sequencer targeting V6 or V3 tags from the human gut microbiota can reveal more taxa, and capture a larger proportion of the detectable taxa, than a more extensive effort directed toward full-length 16S rRNA clone sequencing.”

Converting between Unix and Windows text files?

Posted by: Abbas  :  Category: General, Technical

The format of Windows and Unix text files differs slightly. In Windows, lines end with both the line feed and carriage return ASCII characters, but Unix uses only a line feed. As a consequence, some Windows applications will not show the line breaks in Unix-format files. Likewise, Unix programs may display the carriage returns in Windows text files with Ctrl-m ( ^M ) characters at the end of each line.

There are many ways to solve this problem. This document provides instructions for using FTP, screen capture, unix2dos and dos2unix, tr, awk, Perl, and vi to do the conversion. Before you use these utilities, the files you are converting must first be on a Unix computer.

Note: In the instructions below, replace unixfile.txt with the name of the Unix file you are transferring, and replace winfile.txt with the name of the Windows file you are transferring.


When using an FTP program to move a text file between Unix and Windows, be sure the file is transferred in ASCII format. This will ensure that the document is transformed into a text format appropriate for the host. Some FTP programs, especially graphical applications like Hummingbird FTP, do this automatically. If you are using FTP from the command line, however, before you begin the file transfer, be sure to enter at the FTP prompt:


Note: You need to use a client that supports secure FTP to transfer files to and from Indiana University’s central systems. For more, see At IU, what SSH/SFTP clients are supported and where can I get them?

Screen capture

You can also convert files from Unix to Windows format when transferring them to a PC with a communications program by selecting ASCII text download. Select this option with your communications program to capture all the text subsequently displayed to your screen, and then enter at the Unix prompt:

cat unixfile.txt

Most communications programs will add carriage returns to the stream of text as they save it to your computer’s hard drive. Once the file has finished displaying, abort the text download.

Note: This method may be slow for large text files. Also, no error checking is performed on the file as it is transferred.

dos2unix and unix2dos

On systems using Solaris, the utilities dos2unix and unix2dos are available. These utilities provide a straightforward method for converting files from the Unix command line.

To use either command, simply type the command followed by the name of the file you wish to convert, and the name of a file which will contain the converted results. Thus, to convert a Windows file to a Unix file, at the Unix prompt, enter:

dos2unix winfile.txt unixfile.txt

Read more…

Linux: Who’s on the server???

Posted by: Eli Roberson  :  Category: Science, Technical

Linux? You geeks use Linux?

If you work in science, and you work on big datasets (such as analyzing next generation sequencing data), chances are that you use Linux for some of your work. I frequent several of our lab’s Red Hat servers for data analysis and code development purposes. However, these aren’t just my servers to use. Other lab members and, depending on the server, IT staff use them too. I try to remember to check and see who is on and what they’re running before getting too involved with something that’s going to hog memory or processor time. But, of course, I don’t always remember.

I decided to automate this process to take the remembering part out. By adding in a shell script + some code in my profile file, my ssh login immediately displays relevant information without having to invoke it manually.

Shell Script

The code is based on the Bash shell, so it may our may not apply to your ssh login. I keep the shell script in my /home/user directory with the name “.greeting.sh”. Adding the leading period just makes it invisible to standard “ls” queries so it doesn’t add to the clutter in my home directory. The code for the “.greeting.sh” follows between the lines of # signs:


UCNT=`users | wc -w`
PROC=`ps aux|awk ‘NR > 0 { s +=$3 }; END {printf(“%d\n”, s + 0.5);}’`
MPCT=`free | grep Mem | awk ‘{printf(“%d\n”, $3 / $2 * 100 + 0.5);}’`

echo “$TIME”
echo “Shell: $MYSHELL”
echo “Hello $UNAME! Welcome to $HOST!”

if [ $UCNT -ge 2 ]
echo “$UCNT users are currently logged into $HOST:”
echo “$ULST”
echo “No other users currently logged in.”

echo “System Status:”

if [ $PROC -ge 80 ]
echo “High processor usage at ${PROC}%”
elif [ $PROC -ge 50 ]
echo “Medium processor usage at ${PROC}%”
echo “Low processor usage at ${PROC}%”

if [ $MPCT -ge 80 ]
echo “High memory usage at ${MPCT}%”
elif [ $MPCT -ge 50 ]
echo “Medium memory usage at ${MPCT}%”
echo “Low memory usage at ${MPCT}%”


exit 0

For example, the code above prints the following when logging in: The date, a greeting, the hostname, my current shell, whether other users are logged in (and the list of users if others are on), and information about current processor and memory usage. I customize this script depending on the primary use of the server. If you have a server that should always be running a certain program, add a line that looks for that program. If it were called “myprogram” you could add the following line to the program:

PROG=`ps aux | grep -v grep | grep myprogram | wc -l`

If the program is running, then it will return 1 (if only one instance is running), or 0 if it isn’t running. By adding in some language later testing if $PROG -ge 1, a message could print saying the program was running or not.

Take note! Don’t forget to alter the permissions on the script to allow execution, using something like “chmod +x .greeting.sh”. Also note that the variables are defined using backticks (same key as the ~ on standard US QWERTY keyboards), not single quotes.

Automatically running

The script isn’t much use if you have to run it manually (if I remembered to do that, why would I need a script?), so I like to set the script to run automatically immediately following an ssh login. As I said before, I use Bash on most of the Linux servers I use. For this shell, there is a file called “.bash_profile” in the home directory of each user. This profile file is executed on every ssh connection to set some common environment variables, like PATH. By adding in code to run the greeting script, the output from the script will be displayed immediately after login. Example code to add to the bottom of your profile file:

if [ -e "/home/user/.greeting.sh" ]

That’s all there is to it. A simple, but powerfull script to automatically give you information on server login. Feel free to your system and purpose.

Craig Venter: On the verge of creating synthetic life

Posted by: Abbas  :  Category: General, News, Science