<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The Bioinformatics Blog &#187; Technical</title>
	<atom:link href="http://bioinformatics.whatheblog.com/category/technical/feed/" rel="self" type="application/rss+xml" />
	<link>http://bioinformatics.whatheblog.com</link>
	<description>One base pair at a time...</description>
	<lastBuildDate>Sun, 28 Mar 2010 16:51:14 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.5</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Handy One Liners (awk)</title>
		<link>http://bioinformatics.whatheblog.com/2010/03/handy-one-liners-awk/</link>
		<comments>http://bioinformatics.whatheblog.com/2010/03/handy-one-liners-awk/#comments</comments>
		<pubDate>Sun, 28 Mar 2010 16:47:01 +0000</pubDate>
		<dc:creator>Abbas</dc:creator>
				<category><![CDATA[Technical]]></category>
		<category><![CDATA[awk]]></category>

		<guid isPermaLink="false">http://bioinformatics.whatheblog.com/?p=44</guid>
		<description><![CDATA[Compiled by Eric Pement

Latest version of this file (in English) is usually at:
   http://www.pement.org/awk/awk1line.txt

This file will also be available in other languages:
   Chinese  - http://ximix.org/translation/awk1line_zh-CN.txt   

USAGE:

   Unix: awk '/pattern/ {print "$1"}'    # standard Unix shells
DOS/Win: awk '/pattern/ {print "$1"}'    # [...]]]></description>
			<content:encoded><![CDATA[<pre>Compiled by Eric Pement

Latest version of this file (in English) is usually at:
   http://www.pement.org/awk/awk1line.txt

This file will also be available in other languages:
   Chinese  - http://ximix.org/translation/awk1line_zh-CN.txt   

USAGE:

   Unix: awk '/pattern/ {print "$1"}'    # standard Unix shells
DOS/Win: awk '/pattern/ {print "$1"}'    # compiled with DJGPP, Cygwin
         awk "/pattern/ {print \"$1\"}"  # GnuWin32, UnxUtils, Mingw

Note that the DJGPP compilation (for DOS or Windows-32) permits an awk
script to follow Unix quoting syntax '/like/ {"this"}'. HOWEVER, if the
command interpreter is CMD.EXE or COMMAND.COM, single quotes will not
protect the redirection arrows (&lt;, &gt;) nor do they protect pipes (|).
These are special symbols which require "double quotes" to protect them
from interpretation as operating system directives. If the command
interpreter is bash, ksh or another Unix shell, then single and double
quotes will follow the standard Unix usage.

Users of MS-DOS or Microsoft Windows must remember that the percent
sign (%) is used to indicate environment variables, so this symbol must
be doubled (%%) to yield a single percent sign visible to awk.

If a script will not need to be quoted in Unix, DOS, or CMD, then I
normally omit the quote marks. If an example is peculiar to GNU awk,
the command 'gawk' will be used. Please notify me if you find errors or
new commands to add to this list (total length under 65 characters). I
usually try to put the shortest script first. To conserve space, I
normally use '1' instead of '{print}' to print each line. Either one
will work.

FILE SPACING:

 # double space a file
 awk '1;{print ""}'
 awk 'BEGIN{ORS="\n\n"};1'
<span id="more-44"></span>
 # double space a file which already has blank lines in it. Output file
 # should contain no more than one blank line between lines of text.
 # NOTE: On Unix systems, DOS lines which have only CRLF (\r\n) are
 # often treated as non-blank, and thus 'NF' alone will return TRUE.
 awk 'NF{print $0 "\n"}'

 # triple space a file
 awk '1;{print "\n"}'

NUMBERING AND CALCULATIONS:

 # precede each line by its line number FOR THAT FILE (left alignment).
 # Using a tab (\t) instead of space will preserve margins.
 awk '{print FNR "\t" $0}' files*

 # precede each line by its line number FOR ALL FILES TOGETHER, with tab.
 awk '{print NR "\t" $0}' files*

 # number each line of a file (number on left, right-aligned)
 # Double the percent signs if typing from the DOS command prompt.
 awk '{printf("%5d : %s\n", NR,$0)}'

 # number each line of file, but only print numbers if line is not blank
 # Remember caveats about Unix treatment of \r (mentioned above)
 awk 'NF{$0=++a " :" $0};1'
 awk '{print (NF? ++a " :" :"") $0}'

 # count lines (emulates "wc -l")
 awk 'END{print NR}'

 # print the sums of the fields of every line
 awk '{s=0; for (i=1; i&lt;=NF; i++) s=s+$i; print s}'

 # add all fields in all lines and print the sum
 awk '{for (i=1; i&lt;=NF; i++) s=s+$i}; END{print s}'

 # print every line after replacing each field with its absolute value
 awk '{for (i=1; i&lt;=NF; i++) if ($i &lt; 0) $i = -$i; print }'
 awk '{for (i=1; i&lt;=NF; i++) $i = ($i &lt; 0) ? -$i : $i; print }'

 # print the total number of fields ("words") in all lines
 awk '{ total = total + NF }; END {print total}' file

 # print the total number of lines that contain "Beth"
 awk '/Beth/{n++}; END {print n+0}' file

 # print the largest first field and the line that contains it
 # Intended for finding the longest string in field #1
 awk '$1 &gt; max {max=$1; maxline=$0}; END{ print max, maxline}'

 # print the number of fields in each line, followed by the line
 awk '{ print NF ":" $0 } '

 # print the last field of each line
 awk '{ print $NF }'

 # print the last field of the last line
 awk '{ field = $NF }; END{ print field }'

 # print every line with more than 4 fields
 awk 'NF &gt; 4'

 # print every line where the value of the last field is &gt; 4
 awk '$NF &gt; 4'

STRING CREATION:

 # create a string of a specific length (e.g., generate 513 spaces)
 awk 'BEGIN{while (a++&lt;513) s=s " "; print s}'

 # insert a string of specific length at a certain character position
 # Example: insert 49 spaces after column #6 of each input line.
 gawk --re-interval 'BEGIN{while(a++&lt;49)s=s " "};{sub(/^.{6}/,"&amp;" s)};1'

ARRAY CREATION:

 # These next 2 entries are not one-line scripts, but the technique
 # is so handy that it merits inclusion here.

 # create an array named "month", indexed by numbers, so that month[1]
 # is 'Jan', month[2] is 'Feb', month[3] is 'Mar' and so on.
 split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec", month, " ")

 # create an array named "mdigit", indexed by strings, so that
 # mdigit["Jan"] is 1, mdigit["Feb"] is 2, etc. Requires "month" array
 for (i=1; i&lt;=12; i++) mdigit[month[i]] = i

TEXT CONVERSION AND SUBSTITUTION:

 # IN UNIX ENVIRONMENT: convert DOS newlines (CR/LF) to Unix format
 awk '{sub(/\r$/,"")};1'   # assumes EACH line ends with Ctrl-M

 # IN UNIX ENVIRONMENT: convert Unix newlines (LF) to DOS format
 awk '{sub(/$/,"\r")};1'

 # IN DOS ENVIRONMENT: convert Unix newlines (LF) to DOS format
 awk 1

 # IN DOS ENVIRONMENT: convert DOS newlines (CR/LF) to Unix format
 # Cannot be done with DOS versions of awk, other than gawk:
 gawk -v BINMODE="w" '1' infile &gt;outfile

 # Use "tr" instead.
 tr -d \r &lt;infile &gt;outfile            # GNU tr version 1.22 or higher

 # delete leading whitespace (spaces, tabs) from front of each line
 # aligns all text flush left
 awk '{sub(/^[ \t]+/, "")};1'

 # delete trailing whitespace (spaces, tabs) from end of each line
 awk '{sub(/[ \t]+$/, "")};1'

 # delete BOTH leading and trailing whitespace from each line
 awk '{gsub(/^[ \t]+|[ \t]+$/,"")};1'
 awk '{$1=$1};1'           # also removes extra space between fields

 # insert 5 blank spaces at beginning of each line (make page offset)
 awk '{sub(/^/, "     ")};1'

 # align all text flush right on a 79-column width
 awk '{printf "%79s\n", $0}' file*

 # center all text on a 79-character width
 awk '{l=length();s=int((79-l)/2); printf "%"(s+l)"s\n",$0}' file*

 # substitute (find and replace) "foo" with "bar" on each line
 awk '{sub(/foo/,"bar")}; 1'           # replace only 1st instance
 gawk '{$0=gensub(/foo/,"bar",4)}; 1'  # replace only 4th instance
 awk '{gsub(/foo/,"bar")}; 1'          # replace ALL instances in a line

 # substitute "foo" with "bar" ONLY for lines which contain "baz"
 awk '/baz/{gsub(/foo/, "bar")}; 1'

 # substitute "foo" with "bar" EXCEPT for lines which contain "baz"
 awk '!/baz/{gsub(/foo/, "bar")}; 1'

 # change "scarlet" or "ruby" or "puce" to "red"
 awk '{gsub(/scarlet|ruby|puce/, "red")}; 1'

 # reverse order of lines (emulates "tac")
 awk '{a[i++]=$0} END {for (j=i-1; j&gt;=0;) print a[j--] }' file*

 # if a line ends with a backslash, append the next line to it (fails if
 # there are multiple lines ending with backslash...)
 awk '/\\$/ {sub(/\\$/,""); getline t; print $0 t; next}; 1' file*

 # print and sort the login names of all users
 awk -F ":" '{print $1 | "sort" }' /etc/passwd

 # print the first 2 fields, in opposite order, of every line
 awk '{print $2, $1}' file

 # switch the first 2 fields of every line
 awk '{temp = $1; $1 = $2; $2 = temp}' file

 # print every line, deleting the second field of that line
 awk '{ $2 = ""; print }'

 # print in reverse order the fields of every line
 awk '{for (i=NF; i&gt;0; i--) printf("%s ",$i);print ""}' file

 # concatenate every 5 lines of input, using a comma separator
 # between fields
 awk 'ORS=NR%5?",":"\n"' file

SELECTIVE PRINTING OF CERTAIN LINES:

 # print first 10 lines of file (emulates behavior of "head")
 awk 'NR &lt; 11'

 # print first line of file (emulates "head -1")
 awk 'NR&gt;1{exit};1'

  # print the last 2 lines of a file (emulates "tail -2")
 awk '{y=x "\n" $0; x=$0};END{print y}'

 # print the last line of a file (emulates "tail -1")
 awk 'END{print}'

 # print only lines which match regular expression (emulates "grep")
 awk '/regex/'

 # print only lines which do NOT match regex (emulates "grep -v")
 awk '!/regex/'

 # print any line where field #5 is equal to "abc123"
 awk '$5 == "abc123"'

 # print only those lines where field #5 is NOT equal to "abc123"
 # This will also print lines which have less than 5 fields.
 awk '$5 != "abc123"'
 awk '!($5 == "abc123")'

 # matching a field against a regular expression
 awk '$7  ~ /^[a-f]/'    # print line if field #7 matches regex
 awk '$7 !~ /^[a-f]/'    # print line if field #7 does NOT match regex

 # print the line immediately before a regex, but not the line
 # containing the regex
 awk '/regex/{print x};{x=$0}'
 awk '/regex/{print (NR==1 ? "match on line 1" : x)};{x=$0}'

 # print the line immediately after a regex, but not the line
 # containing the regex
 awk '/regex/{getline;print}'

 # grep for AAA and BBB and CCC (in any order on the same line)
 awk '/AAA/ &amp;&amp; /BBB/ &amp;&amp; /CCC/'

 # grep for AAA and BBB and CCC (in that order)
 awk '/AAA.*BBB.*CCC/'

 # print only lines of 65 characters or longer
 awk 'length &gt; 64'

 # print only lines of less than 65 characters
 awk 'length &lt; 64'

 # print section of file from regular expression to end of file
 awk '/regex/,0'
 awk '/regex/,EOF'

 # print section of file based on line numbers (lines 8-12, inclusive)
 awk 'NR==8,NR==12'

 # print line number 52
 awk 'NR==52'
 awk 'NR==52 {print;exit}'          # more efficient on large files

 # print section of file between two regular expressions (inclusive)
 awk '/Iowa/,/Montana/'             # case sensitive

SELECTIVE DELETION OF CERTAIN LINES:

 # delete ALL blank lines from a file (same as "grep '.' ")
 awk NF
 awk '/./'

 # remove duplicate, consecutive lines (emulates "uniq")
 awk 'a !~ $0; {a=$0}'

 # remove duplicate, nonconsecutive lines
 awk '!a[$0]++'                     # most concise script
 awk '!($0 in a){a[$0];print}'      # most efficient script

CREDITS AND THANKS:

Special thanks to the late Peter S. Tillier (U.K.) for helping me with
the first release of this FAQ file, and to Daniel Jana, Yisu Dong, and
others for their suggestions and corrections.

For additional syntax instructions, including the way to apply editing
commands from a disk file instead of the command line, consult:

  "sed &amp; awk, 2nd Edition," by Dale Dougherty and Arnold Robbins
  (O'Reilly, 1997)

  "UNIX Text Processing," by Dale Dougherty and Tim O'Reilly (Hayden
  Books, 1987)

  "GAWK: Effective awk Programming," 3d edition, by Arnold D. Robbins
  (O'Reilly, 2003) or at http://www.gnu.org/software/gawk/manual/

To fully exploit the power of awk, one must understand "regular
expressions." For detailed discussion of regular expressions, see
"Mastering Regular Expressions, 3d edition" by Jeffrey Friedl (O'Reilly,
2006).

The info and manual ("man") pages on Unix systems may be helpful (try
"man awk", "man nawk", "man gawk", "man regexp", or the section on
regular expressions in "man ed").

USE OF '\t' IN awk SCRIPTS: For clarity in documentation, I have used
'\t' to indicate a tab character (0x09) in the scripts.  All versions of
awk should recognize this abbreviation.

#---end of file---</pre>
]]></content:encoded>
			<wfw:commentRss>http://bioinformatics.whatheblog.com/2010/03/handy-one-liners-awk/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Ggplot2 Tutelage</title>
		<link>http://bioinformatics.whatheblog.com/2010/03/ggplot2-tutelage/</link>
		<comments>http://bioinformatics.whatheblog.com/2010/03/ggplot2-tutelage/#comments</comments>
		<pubDate>Sat, 27 Mar 2010 19:12:52 +0000</pubDate>
		<dc:creator>Abbas</dc:creator>
				<category><![CDATA[Technical]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://bioinformatics.whatheblog.com/?p=40</guid>
		<description><![CDATA[For those interested in some ggplot2 tutelage, Hadley Wickham (the creator of ggplot2) recently posted a 2 hour short course on data visualization with R (via ggplot2).
A blog post describing it is here:
 http://blog.revolution-computing.com/2010/03/video-hadley-wickham-gives-a-short-course-on-graphics-with-r.html
The actual video is here:
 http://had.blip.tv/
The supposed slides (because you really can&#8217;t see the details all to well in the video) are [...]]]></description>
			<content:encoded><![CDATA[<p>For those interested in some ggplot2 tutelage, Hadley Wickham (the creator of ggplot2) recently posted a 2 hour short course on data visualization with R (via ggplot2).</p>
<p>A blog post describing it is here:<br />
<a href="http://blog.revolution-computing.com/2010/03/video-hadley-wickham-gives-a-short-course-on-graphics-with-r.html"> http://blog.revolution-computing.com/2010/03/video-hadley-wickham-gives-a-short-course-on-graphics-with-r.html</a></p>
<p>The actual video is here:<br />
<a href="http://had.blip.tv/"> http://had.blip.tv/</a></p>
<p>The supposed slides (because you really can&#8217;t see the details all to well in the video) are here:<br />
<a href="http://had.co.nz/rice-vis/"> http://had.co.nz/rice-vis/</a></p>
]]></content:encoded>
			<wfw:commentRss>http://bioinformatics.whatheblog.com/2010/03/ggplot2-tutelage/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Metagenomics Resources</title>
		<link>http://bioinformatics.whatheblog.com/2009/10/metagenomics-resources/</link>
		<comments>http://bioinformatics.whatheblog.com/2009/10/metagenomics-resources/#comments</comments>
		<pubDate>Tue, 20 Oct 2009 05:37:10 +0000</pubDate>
		<dc:creator>Abbas</dc:creator>
				<category><![CDATA[Science]]></category>
		<category><![CDATA[Technical]]></category>
		<category><![CDATA[genome. microbiome]]></category>
		<category><![CDATA[metagenomics]]></category>

		<guid isPermaLink="false">http://bioinformatics.whatheblog.com/?p=39</guid>
		<description><![CDATA[
MEGAN MEtaGenome ANalyzer. A stand-alone metagenome analysis tool.
Metagenomics and Our Microbial Planet A website on metagenomics and the vital role of microbes on Earth from the National Academies.
The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet A report released by the National Research Council in March 2007. Also, see the Report In [...]]]></description>
			<content:encoded><![CDATA[<ul>
<li><a class="external text" rel="nofollow" href="http://www-ab.informatik.uni-tuebingen.de/software/megan/">MEGAN</a> MEtaGenome ANalyzer. A stand-alone metagenome analysis tool.</li>
<li><a class="external text" rel="nofollow" href="http://dels.nas.edu/metagenomics/">Metagenomics and Our Microbial Planet</a> A website on metagenomics and the vital role of microbes on Earth from the <a class="external text" rel="nofollow" href="http://nationalacademies.org/">National Academies.</a></li>
<li><a class="external text" rel="nofollow" href="http://books.nap.edu/catalog.php?record_id=11902">The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet</a> A report released by the National Research Council in March 2007. Also, see the <a class="external text" rel="nofollow" href="http://dels.nas.edu/dels/rpt_briefs/metagenomics_brief_final.pdf">Report In Brief.</a></li>
<li><a class="external text" rel="nofollow" href="http://img.jgi.doe.gov/m">IMG/M</a> The Integrated Microbial Genomes system, for metagenome analysis by the DOE-JGI.</li>
<li><a class="external text" rel="nofollow" href="http://camera.calit2.net/index.php">CAMERA</a> Cyberinfrastructure for Metagenomics, data repository and tools for metagenomics research.</li>
<li><a class="external text" rel="nofollow" href="http://www.scq.ubc.ca/?p=509">A good overview of metagenomics from the Science Creative Quarterly</a></li>
<li><a class="external text" rel="nofollow" href="http://www.genomesonline.org/gold.cgi?want=Metagenomes">list of Metagenome Projects from genomesonline.org</a></li>
<li><a class="external text" rel="nofollow" href="http://metagenomics.nmpdr.org/">MG-RAST</a> publicly available, free, metagenomics annotation pipeline and repository for pyrosequences, Sanger sequences, and other sequence approaches.</li>
<li><a class="mw-redirect" title="Human microbiome project" href="http://en.wikipedia.org/wiki/Human_microbiome_project">Human microbiome project</a></li>
<li><a class="external text" rel="nofollow" href="http://www.metahit.eu/">MetaHIT</a> official website for the EU-funded project : Metagenomics of the Human Intestinal Tract</li>
<li><a class="external text" rel="nofollow" href="http://annotathon.univ-mrs.fr/">Annotathon</a> Bioinformatics Training Through Metagenomic Sequence Annotation</li>
<li><a class="external text" rel="nofollow" href="http://www.highveld.com/pages/metagenomics.html">Metagenomics</a> Metagenomics research and applications</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://bioinformatics.whatheblog.com/2009/10/metagenomics-resources/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Converting between Unix and Windows text files?</title>
		<link>http://bioinformatics.whatheblog.com/2009/04/how-do-i-convert-between-unix-and-windows-text-files/</link>
		<comments>http://bioinformatics.whatheblog.com/2009/04/how-do-i-convert-between-unix-and-windows-text-files/#comments</comments>
		<pubDate>Sat, 11 Apr 2009 22:09:48 +0000</pubDate>
		<dc:creator>Abbas</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[Technical]]></category>
		<category><![CDATA[converting]]></category>
		<category><![CDATA[mac]]></category>
		<category><![CDATA[newline]]></category>
		<category><![CDATA[unix]]></category>
		<category><![CDATA[windows]]></category>

		<guid isPermaLink="false">http://bioinformatics.whatheblog.com/?p=34</guid>
		<description><![CDATA[The format of Windows and Unix text files differs slightly. In Windows, lines end with both the line feed and carriage return ASCII characters, but Unix uses only a line feed.  As a consequence, some Windows applications will not show the line breaks in Unix-format files.  Likewise, Unix programs may display the carriage [...]]]></description>
			<content:encoded><![CDATA[<p>The format of Windows and <a href="http://kb.iu.edu/data/agat.html">Unix</a> text files differs slightly. In Windows, lines end with both the line feed and carriage return <a href="http://kb.iu.edu/data/afht.html">ASCII</a> characters, but Unix uses only a line feed.  As a consequence, some Windows applications will not show the line breaks in Unix-format files.  Likewise, Unix programs may display the carriage returns in Windows text files with <code>Ctrl-m</code> (<code> ^M </code>) characters at the end of each line.</p>
<p>There are many ways to solve this problem. This document provides instructions for using <a href="http://kb.iu.edu/data/aerg.html">FTP</a>, screen capture, <a href="http://kb.iu.edu/data/acux.html">unix2dos</a> and <a href="http://kb.iu.edu/data/acux.html">dos2unix</a>, <code>tr</code>, <a href="http://kb.iu.edu/data/afja.html">awk</a>, <a href="http://kb.iu.edu/data/afhp.html">Perl</a>, and <a href="http://kb.iu.edu/data/adxz.html">vi</a> to do the conversion.  Before you use these utilities, the files you are converting must first be on a Unix computer.</p>
<p><strong>Note:</strong> In the instructions below, replace <code>unixfile.txt</code> with the name of the Unix file you are transferring, and replace <code>winfile.txt</code> with the name of the Windows file you are transferring.</p>
<h3>FTP</h3>
<p>When using an FTP program to move a text file between Unix and Windows, be sure the file is transferred in <a href="http://kb.iu.edu/data/afht.html">ASCII</a> format. This will ensure that the document is transformed into a text format appropriate for the host.  Some FTP programs, especially graphical applications like Hummingbird FTP, do this automatically.  If you are using FTP from the command line, however, before you begin the file transfer, be sure to enter at the FTP prompt:</p>
<p><span class="example"> ascii</span></p>
<p><strong>Note:</strong> You need to use a client that supports secure FTP to transfer files to and from Indiana University&#8217;s central systems. For more, see <a href="http://kb.iu.edu/data/ahjh.html">At IU, what SSH/SFTP clients are supported and where can I get them?</a></p>
<h3>Screen capture</h3>
<p>You can also convert files from Unix to Windows format when transferring them to a PC with a communications program by selecting ASCII text download.  Select this option with your communications program to capture all the text subsequently displayed to your screen, and then enter at the Unix prompt:</p>
<p><span class="example"> cat unixfile.txt</span></p>
<p>Most communications programs will add carriage returns to the stream of text as they save it to your computer&#8217;s hard drive.  Once the file has finished displaying, abort the text download.</p>
<p><strong>Note:</strong> This method may be slow for large text files. Also, no error checking is performed on the file as it is transferred.</p>
<h3><code>dos2unix</code> and <code>unix2dos</code></h3>
<p>On systems using <a href="http://kb.iu.edu/data/agjq.html">Solaris</a>, the utilities <code>dos2unix</code> and <code>unix2dos</code> are available.  These utilities provide a straightforward method for converting files from the Unix command line.</p>
<p>To use either command, simply type the command followed by the name of the file you wish to convert, and the name of a file which will contain the converted results.  Thus, to convert a Windows file to a Unix file, at the Unix prompt, enter:</p>
<p><span class="example"> dos2unix winfile.txt unixfile.txt</span></p>
<p><span id="more-34"></span></p>
<p>To convert a Unix file to Windows, enter:</p>
<p><span class="example"> unix2dos unixfile.txt winfile.txt</span></p>
<p><strong>Note:</strong> These utilities are available only on Solaris systems.  To determine what variety of Unix is running on your computer, see <a href="http://kb.iu.edu/data/aaya.html">In Unix, how can I display information about the operating system?</a></p>
<h3><code>tr</code></h3>
<p>You can use <code>tr</code> to remove all carriage returns and <code>Ctrl-z</code> (<code> ^Z </code>) characters from a Windows file by entering:</p>
<p><span class="example"> tr -d &#8216;\15\32&#8242; &lt; winfile.txt &gt; unixfile.txt</span></p>
<p>You cannot use <code>tr</code> to convert a document from Unix format to Windows.</p>
<h3><code>awk</code></h3>
<p>To use <a href="http://kb.iu.edu/data/afja.html">awk</a> to convert a Windows file to Unix, at the Unix prompt, enter:</p>
<p><span class="example"> awk &#8216;{ sub(&#8221;\r$&#8221;, &#8220;&#8221;); print }&#8217; winfile.txt &gt; unixfile.txt</span></p>
<p>To convert a Unix file to Windows using <code>awk</code>, at the command line, enter:</p>
<p><span class="example"> awk &#8217;sub(&#8221;$&#8221;, &#8220;\r&#8221;)&#8217; unixfile.txt &gt; winfile.txt</span></p>
<p>On some systems, the version of <code>awk</code> may be old and not include the function <code>sub</code>.  If so, try the same command, but with <code>gawk</code> or <code>nawk</code> replacing <code>awk</code>.</p>
<h3>Perl</h3>
<p>To convert a Windows text file to a Unix text file using <a href="http://kb.iu.edu/data/afhp.html">Perl</a>, at the Unix <a href="http://kb.iu.edu/data/agvf.html">shell</a> prompt, enter:</p>
<p><span class="example"> perl -p -e &#8217;s/\r$//&#8217; &lt; winfile.txt &gt; unixfile.txt</span></p>
<p>To convert from a Unix text file to a Windows text file with Perl, at the Unix shell prompt, enter:</p>
<p><span class="example"> perl -p -e &#8217;s/\n/\r\n/&#8217; &lt; unixfile.txt &gt; winfile.txt</span></p>
<p>You must use single quotation marks in either command line.  This prevents your shell from trying to evaluate anything inside.  Perl is installed on all <a href="http://kb.iu.edu/data/ahaw.html">UITS</a> shared central Unix systems.</p>
<h3>vi</h3>
<p>In <a href="http://kb.iu.edu/data/adxz.html">vi</a>, you can remove the carriage return ( <code>^M</code> ) characters with the following command:</p>
<p><span class="example"> :1,$s/^M//g</span></p>
<p><strong>Note:</strong> To input the <code>^M</code> character, press <code>Ctrl-v </code>, then press <code>Enter</code> or <code>return</code>.</p>
]]></content:encoded>
			<wfw:commentRss>http://bioinformatics.whatheblog.com/2009/04/how-do-i-convert-between-unix-and-windows-text-files/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Linux: Who&#8217;s on the server???</title>
		<link>http://bioinformatics.whatheblog.com/2009/03/linux-whos-on-the-server/</link>
		<comments>http://bioinformatics.whatheblog.com/2009/03/linux-whos-on-the-server/#comments</comments>
		<pubDate>Thu, 19 Mar 2009 15:19:16 +0000</pubDate>
		<dc:creator>Eli Roberson</dc:creator>
				<category><![CDATA[Science]]></category>
		<category><![CDATA[Technical]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[ssh]]></category>

		<guid isPermaLink="false">http://bioinformatics.whatheblog.com/?p=33</guid>
		<description><![CDATA[Linux? You geeks use Linux?
If you work in science, and you work on big datasets (such as analyzing next generation sequencing data), chances are that you use Linux for some of your work. I frequent several of our lab&#8217;s Red Hat servers for data analysis and code development purposes. However, these aren&#8217;t just my servers [...]]]></description>
			<content:encoded><![CDATA[<h2>Linux? You geeks use Linux?</h2>
<p>If you work in science, and you work on big datasets (such as <a href="2009/02/next-gen-tools/">analyzing</a> <a href="solid.appliedbiosystems.com">next</a> <a href="http://www.454.com">generation</a> <a href="http://www.illumina.com/pages.ilmn?ID=203">sequencing</a> data), chances are that you use <a href="http://en.wikipedia.org/wiki/Linux">Linux</a> for some of your work. I frequent several of our lab&#8217;s Red Hat servers for data analysis and code development purposes. However, these aren&#8217;t just my servers to use. Other lab members and, depending on the server, IT staff use them too. I try to remember to check and see who is on and what they&#8217;re running before getting too involved with something that&#8217;s going to hog memory or processor time. But, of course, I don&#8217;t always remember.</p>
<p>I decided to automate this process to take the remembering part out. By adding in a shell script + some code in my profile file, my ssh login immediately displays relevant information without having to invoke it manually.</p>
<h2>Shell Script</h2>
<p>The code is based on the Bash shell, so it may our may not apply to your ssh login. I keep the shell script in my /home/user directory with the name &#8220;.greeting.sh&#8221;. Adding the leading period just makes it invisible to standard &#8220;ls&#8221; queries so it doesn&#8217;t add to the clutter in my home directory. The code for the &#8220;.greeting.sh&#8221; follows between the lines of # signs:</p>
<p>##################################################<br />
#!/bin/bash</p>
<p>UNAME=`whoami`<br />
TIME=`date`<br />
HOST=`hostname`<br />
UCNT=`users | wc -w`<br />
ULST=`users`<br />
PROC=`ps aux|awk &#8216;NR &gt; 0 { s +=$3 }; END {printf(&#8221;%d\n&#8221;, s + 0.5);}&#8217;`<br />
MPCT=`free | grep Mem | awk &#8216;{printf(&#8221;%d\n&#8221;, $3 / $2 * 100 + 0.5);}&#8217;`<br />
MYSHELL=`echo $SHELL`</p>
<p>echo<br />
echo &#8220;$TIME&#8221;<br />
echo &#8220;Shell: $MYSHELL&#8221;<br />
echo &#8220;Hello $UNAME! Welcome to $HOST!&#8221;</p>
<p>if [ $UCNT -ge 2 ]<br />
then<br />
echo &#8220;$UCNT users are currently logged into $HOST:&#8221;<br />
echo &#8220;$ULST&#8221;<br />
else<br />
echo &#8220;No other users currently logged in.&#8221;<br />
fi</p>
<p>echo &#8220;System Status:&#8221;</p>
<p>if [ $PROC -ge 80 ]<br />
then<br />
echo &#8220;High processor usage at ${PROC}%&#8221;<br />
elif [ $PROC -ge 50 ]<br />
then<br />
echo &#8220;Medium processor usage at ${PROC}%&#8221;<br />
else<br />
echo &#8220;Low processor usage at ${PROC}%&#8221;<br />
fi</p>
<p>if [ $MPCT -ge 80 ]<br />
then<br />
echo &#8220;High memory usage at ${MPCT}%&#8221;<br />
elif [ $MPCT -ge 50 ]<br />
then<br />
echo &#8220;Medium memory usage at ${MPCT}%&#8221;<br />
else<br />
echo &#8220;Low memory usage at ${MPCT}%&#8221;<br />
fi</p>
<p>echo</p>
<p>exit 0<br />
##################################################</p>
<p>For example, the code above prints the following when logging in: The date, a greeting, the hostname, my current shell, whether other users are logged in (and the list of users if others are on), and information about current processor and memory usage. I customize this script depending on the primary use of the server. If you have a server that should always be running a certain program, add a line that looks for that program. If it were called &#8220;myprogram&#8221; you could add the following line to the program:</p>
<p>PROG=`ps aux | grep -v grep | grep myprogram | wc -l`</p>
<p>If the program is running, then it will return 1 (if only one instance is running), or 0 if it isn&#8217;t running. By adding in some language later testing if $PROG -ge 1, a message could print saying the program was running or not.</p>
<p>Take note! Don&#8217;t forget to alter the permissions on the script to allow execution, using something like &#8220;chmod +x .greeting.sh&#8221;. Also note that the variables are defined using backticks (same key as the ~ on standard US QWERTY keyboards), not single quotes.</p>
<h2>Automatically running</h2>
<p>The script isn&#8217;t much use if you have to run it manually (if I remembered to do that, why would I need a script?), so I like to set the script to run automatically immediately following an ssh login. As I said before, I use Bash on most of the Linux servers I use. For this shell, there is a file called &#8220;.bash_profile&#8221; in the home directory of each user. This profile file is executed on every ssh connection to set some common environment variables, like PATH. By adding in code to run the greeting script, the output from the script will be displayed immediately after login. Example code to add to the bottom of your profile file:</p>
<p>##################################################<br />
if [ -e "/home/user/.greeting.sh" ]<br />
then<br />
/home/user/.greeting.sh<br />
fi<br />
##################################################</p>
<p>That&#8217;s all there is to it. A simple, but powerfull script to automatically give you information on server login. Feel free to your system and purpose.</p>
]]></content:encoded>
			<wfw:commentRss>http://bioinformatics.whatheblog.com/2009/03/linux-whos-on-the-server/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Next Generation Seq Tools</title>
		<link>http://bioinformatics.whatheblog.com/2009/02/next-gen-tools/</link>
		<comments>http://bioinformatics.whatheblog.com/2009/02/next-gen-tools/#comments</comments>
		<pubDate>Wed, 25 Feb 2009 07:46:50 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Technical]]></category>
		<category><![CDATA[next-gen]]></category>
		<category><![CDATA[sequencing]]></category>

		<guid isPermaLink="false">http://bioinformatics.whatheblog.com/?p=31</guid>
		<description><![CDATA[Something I came across.
Integrated solutions
* CLCbio Genomics Workbench &#8211; de novo and reference assembly of Sanger, 454, Solexa, Helicos, and SOLiD data. Commercial next-gen-seq software that extends the CLCbio Main Workbench software. Includes SNP detection, browser and other features. Runs on Windows, Mac OS X and Linux.
* NextGENe &#8211; de novo and reference assembly of [...]]]></description>
			<content:encoded><![CDATA[<p>Something I came across.</p>
<p>Integrated solutions<br />
* CLCbio Genomics Workbench &#8211; de novo and reference assembly of Sanger, 454, Solexa, Helicos, and SOLiD data. Commercial next-gen-seq software that extends the CLCbio Main Workbench software. Includes SNP detection, browser and other features. Runs on Windows, Mac OS X and Linux.</p>
<p>* NextGENe &#8211; de novo and reference assembly of Illumina and SOLiD data. Uses a novel Condensation Assembly Tool approach where reads are joined via &#8220;anchors&#8221; into mini-contigs before assembly. Requires Win or MacOS.</p>
<p>* SeqMan Genome Analyser &#8211; Software for Next Generation sequence assembly of Illumina, 454 Life Sciences and Sanger data integrating with Lasergene Sequence Analysis software for additional analysis and visualization capabilities. Can use a hybrid templated/de novo approach. Early release commercial software. Compatible with Windows® XP X64 and Mac OS X 10.4.</p>
<p><span id="more-31"></span></p>
<p>Align/Assemble to a reference</p>
<p>* Bowtie &#8211; Ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of 25 million reads per hour on a typical workstation with 2 gigabytes of memory. Link to discussion thread here. Written by Ben Langmead and Cole Trapnell.</p>
<p>* ELAND &#8211; Efficient Large-Scale Alignment of Nucleotide Databases. Whole genome alignments to a reference genome. Written by Illumina author Anthony J. Cox for the Solexa 1G machine.</p>
<p>* EULER &#8211; Short read assembly. By Mark J. Chaisson and Pavel A. Pevzner from UCSD (published in Genome Research).</p>
<p>* Exonerate &#8211; Various forms of alignment (including Smith-Waterman-Gotoh) of DNA/protein against a reference. Authors are Guy St C Slater and Ewan Birney from EMBL. C for POSIX.</p>
<p>* GMAP &#8211; GMAP (Genomic Mapping and Alignment Program) for mRNA and EST Sequences. Developed by Thomas Wu and Colin Watanabe at Genentec. C/Perl for Unix.</p>
<p>* MOSAIK &#8211; Reference guided aligner/assembler. Written by Michael Strömberg at Boston College.</p>
<p>* MAQ &#8211; Mapping and Assembly with Qualities (renamed from MAPASS2). Particularly designed for Illumina-Solexa 1G Genetic Analyzer, and has preliminary functions to handle ABI SOLiD data. Written by Heng Li from the Sanger Centre.</p>
<p>* MUMmer &#8211; MUMmer is a modular system for the rapid whole genome alignment of finished or draft sequence. Released as a package providing an efficient suffix tree library, seed-and-extend alignment, SNP detection, repeat detection, and visualization tools. Version 3.0 was developed by Stefan Kurtz, Adam Phillippy, Arthur L Delcher, Michael Smoot, Martin Shumway, Corina Antonescu and Steven L Salzberg &#8211; most of whom are at The Institute for Genomic Research in Maryland, USA. POSIX OS required.</p>
<p>* Novocraft &#8211; Tools for reference alignment of paired-end and single-end Illumina reads. Uses a Needleman-Wunsch algorithm. Available free for evaluation, educational use and for use on open not-for-profit projects. Requires Linux or Mac OS X.</p>
<p>* RMAP &#8211; Assembles 20 &#8211; 64 bp Solexa reads to a FASTA reference genome. By Andrew D. Smith and Zhenyu Xuan at CSHL. (published in BMC Bioinformatics). POSIX OS required.</p>
<p>* SeqMap &#8211; Works like ELand, can do 3 or more bp mismatches and also INDELs. Written by Hui Jiang from the Wong lab at Stanford. Builds available for most OS&#8217;s.</p>
<p>* SHRiMP &#8211; Assembles to a reference sequence. Developed with Applied Biosystem&#8217;s colourspace genomic representation in mind. Authors are Michael Brudno and Stephen Rumble at the University of Toronto.</p>
<p>* Slider- An application for the Illumina Sequence Analyzer output that uses the probability files instead of the sequence files as an input for alignment to a reference sequence or a set of reference sequences.. Authors are from BCGSC. Paper is here.</p>
<p>* SOAP &#8211; SOAP (Short Oligonucleotide Alignment Program). A program for efficient gapped and ungapped alignment of short oligonucleotides onto reference sequences. Author is Ruiqiang Li at the Beijing Genomics Institute. C++ for Unix.</p>
<p>* SSAHA &#8211; SSAHA (Sequence Search and Alignment by Hashing Algorithm) is a tool for rapidly finding near exact matches in DNA or protein databases using a hash table. Developed at the Sanger Centre by Zemin Ning, Anthony Cox and James Mullikin. C++ for Linux/Alpha.</p>
<p>* SXOligoSearch &#8211; SXOligoSearch is a commercial platform offered by the Malaysian based Synamatix. Will align Illumina reads against a range of Refseq RNA or NCBI genome builds for a number of organisms. Web Portal. OS independent.</p>
<p>de novo Align/Assemble<br />
* MIRA2 &#8211; MIRA (Mimicking Intelligent Read Assembly) is able to perform true hybrid de-novo assemblies using reads gathered through 454 sequencing technology (GS20 or GS FLX). Compatible with 454, Solexa and Sanger data. Linux OS required.</p>
<p>* SHARCGS &#8211; De novo assembly of short reads. Authors are Dohm JC, Lottaz C, Borodina T and Himmelbauer H. from the Max-Planck-Institute for Molecular Genetics.</p>
<p>* SSAKE &#8211; Version 2.0 of SSAKE (23 Oct 2007) can now handle error-rich sequences. Authors are René Warren, Granger Sutton, Steven Jones and Robert Holt from the Canada&#8217;s Michael Smith Genome Sciences Centre. Perl/Linux.</p>
<p>* VCAKE &#8211; De novo assembly of short reads with robust error correction. An improvement on early versions of SSAKE.</p>
<p>* Velvet &#8211; Velvet is a de novo genomic assembler specially designed for short read sequencing technologies, such as Solexa or 454. Need about 20-25X coverage and paired reads. Developed by Daniel Zerbino and Ewan Birney at the European Bioinformatics Institute (EMBL-EBI).</p>
<p>SNP/Indel Discovery<br />
* ssahaSNP &#8211; ssahaSNP is a polymorphism detection tool. It detects homozygous SNPs and indels by aligning shotgun reads to the finished genome sequence. Highly repetitive elements are filtered out by ignoring those kmer words with high occurrence numbers. More tuned for ABI Sanger reads. Developers are Adam Spargo and Zemin Ning from the Sanger Centre. Compaq Alpha, Linux-64, Linux-32, Solaris and Mac</p>
<p>* PolyBayesShort &#8211; A re-incarnation of the PolyBayes SNP discovery tool developed by Gabor Marth at Washington University. This version is specifically optimized for the analysis of large numbers (millions) of high-throughput next-generation sequencer reads, aligned to whole chromosomes of model organism or mammalian genomes. Developers at Boston College. Linux-64 and Linux-32.</p>
<p>* PyroBayes &#8211; PyroBayes is a novel base caller for pyrosequences from the 454 Life Sciences sequencing machines. It was designed to assign more accurate base quality estimates to the 454 pyrosequences. Developers at Boston College.</p>
<p>Genome Annotation/Genome Browser/Alignment Viewer/Assembly Database<br />
* STADEN &#8211; Includes GAP4. GAP5 once completed will handle next-gen sequencing data. A partially implemented test version is available here<br />
* EagleView &#8211; An information-rich genome assembler viewer. EagleView can display a dozen different types of information including base quality and flowgram signal. Developers at Boston College.</p>
<p>* XMatchView &#8211; A visual tool for analyzing cross_match alignments. Developed by Rene Warren and Steven Jones at Canada&#8217;s Michael Smith Genome Sciences Centre. Python/Win or Linux.</p>
<p>* SAM &#8211; Sequence Assembly Manager. Whole Genome Assembly (WGA) Management and Visualization Tool. It provides a generic platform for manipulating, analyzing and viewing WGA data, regardless of input type. Developers are Rene Warren, Yaron Butterfield, Asim Siddiqui and Steven Jones at Canada&#8217;s Michael Smith Genome Sciences Centre. MySQL backend and Perl-CGI web-based frontend/Linux.</p>
<p>CHiP-Seq/BS-Seq<br />
* FindPeaks &#8211; perform analysis of ChIP-Seq experiments. It uses a naive algorithm for identifying regions of high coverage, which represent Chromatin Immunoprecipitation enrichment of sequence fragments, indicating the location of a bound protein of interest. Original algorithm by Matthew Bainbridge, in collaboration with Gordon Robertson. Current code and implementation by Anthony Fejes. Authors are from the Canada&#8217;s Michael Smith Genome Sciences Centre. JAVA/OS independent. Latest versions available as part of the Vancouver Short Read Analysis Package</p>
<p>* CHiPSeq &#8211; Program used by Johnson et al. (2007) in their Science publication</p>
<p>* BS-Seq &#8211; The source code and data for the &#8220;Shotgun Bisulphite Sequencing of the Arabidopsis Genome Reveals DNA Methylation Patterning&#8221; Nature paper by Cokus et al. (Steve Jacobsen&#8217;s lab at UCLA). POSIX.</p>
<p>* SISSRs &#8211; Site Identification from Short Sequence Reads. BED file input. Raja Jothi @ NIH. Perl.</p>
<p>* QuEST &#8211; Quantitative Enrichment of Sequence Tags. Sidow and Myers Labs at Stanford. From the 2008 publication Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. (C++)</p>
<p>Alternate Base Calling<br />
* Rolexa &#8211; R-based framework for base calling of Solexa data. Project publication</p>
<p>* Alta-cyclic &#8211; &#8220;a novel Illumina Genome-Analyzer (Solexa) base caller&#8221;</p>
]]></content:encoded>
			<wfw:commentRss>http://bioinformatics.whatheblog.com/2009/02/next-gen-tools/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Bioinformatics Tool Chest: Why You Should Be Using Firefox</title>
		<link>http://bioinformatics.whatheblog.com/2009/02/bioinformatics-tool-chest-why-you-should-be-using-firefox/</link>
		<comments>http://bioinformatics.whatheblog.com/2009/02/bioinformatics-tool-chest-why-you-should-be-using-firefox/#comments</comments>
		<pubDate>Sat, 07 Feb 2009 04:05:48 +0000</pubDate>
		<dc:creator>Eli Roberson</dc:creator>
				<category><![CDATA[Science]]></category>
		<category><![CDATA[Technical]]></category>

		<guid isPermaLink="false">http://bioinformatics.whatheblog.com/?p=30</guid>
		<description><![CDATA[Firefox?!?!
I know what you&#8217;re thinking. &#8220;Come on. A browser? As a bioinformatics tool?&#8221; You might actually be surprised. I think that most people that do research spend at least some amount of time online trying to track down information. Maybe it&#8217;s  protein name, or DNA elements in a chromosome segment. Maybe it&#8217;s a certain paper [...]]]></description>
			<content:encoded><![CDATA[<h3>Firefox?!?!</h3>
<p>I know what you&#8217;re thinking. &#8220;Come on. A browser? As a bioinformatics tool?&#8221; You might actually be surprised. I think that most people that do research spend at least some amount of time online trying to track down information. Maybe it&#8217;s  protein name, or DNA elements in a chromosome segment. Maybe it&#8217;s a certain paper or topic through PubMed. Personally, I spend a good amount of time searching out answers. Furthermore, I switch between databases / websites between tabs to get information from different sources. Could there be a way to search faster?</p>
<h3>Keyword Search To The Rescue!</h3>
<p>Luckily, there is a faster way: the keyword search. Basically the keyword search will allow you to make a bookmark shortcut to any search box using a keyword. Once a keyword search has been saved that particular search can be invoked with just the keyword. I frequently use the UCSC Genome Browser for research, so I&#8217;ll use this as an example.</p>
<h3>How To</h3>
<ol>
<li>Navigate to the <a href="http://genome.ucsc.edu">UCSC Genome Browser</a> main page.</li>
<li>In the top navigation panel click &#8220;Genomes&#8221;</li>
<li>The default page should be the Human genome browser. If you are interested in a different organism you can certainly change it using the drop-down boxes. There should be an input box labeled &#8220;position or search term&#8221;. Right click in the box.</li>
<li>In the pop-up menu select &#8220;Add a Keyword for This Search&#8230;&#8221;. An &#8220;Add Bookmark&#8221; window will appear.</li>
<li>In the &#8220;Name&#8221; box type a descriptive name. In this case use &#8220;UCSC Human Search&#8221;.</li>
<li>In the &#8220;Keyword&#8221; box type the keyword you want to use. In this case use &#8220;ucsc&#8221;.</li>
<li>Press the &#8220;Add&#8221; button to save this search.</li>
</ol>
<p>Let&#8217;s test the keyword. Open a new blank Firefox tab by pressing CTRL+T or File -&gt; New Tab. In the address bar type &#8220;ucsc MECP2&#8243; and press enter. The &#8220;ucsc&#8221; keyword triggers the query &#8220;MECP2&#8243; to be run through the search box we saved. After a few seconds a window for the UCSC browser should appear listing possible genes matching the symbol MECP2. If you had navigated to the UCSC Browser directly and typed MECP2 directly in the search box the results would have been the same.</p>
<p>What about direct chromosome positions? Let&#8217;s try it. Clear the text from the URL bar, type &#8220;ucsc  chr1:1-20000000&#8243;, and press enter. The page should change to show the first 20,000,000 base pairs of chromosome 1.</p>
<p>What other uses could it have? What about a &#8220;pubmed&#8221; keyword search? Or an Ensembl search? It can be particularly powerful of you combine these searches. If you were researching Rett Syndrome, you could in one tab search for &#8220;pubmed Rett Syndrome&#8221;. After reading a few papers and finding information on MECP2 in Rett Syndrome all you have do is hit CTRL+T to open another tab. Then type &#8220;ucsc MECP2&#8243; to find it in the genome browser. If you had a saved search for the NCBI Protein database you could go even further by opening yet another tab and typing &#8220;protein MECP2_HUMAN&#8221; (assuming your keyword was protein). The result would be a page about the MECP2 protein in humans where you could get the amino acid sequence. Your specific search set would depend on what databases you search most frequently in your research.</p>
<p>This kind of time savings can really add up. Plus you can show off your cool new hack to friends when they&#8217;re trying to search for something.</p>
]]></content:encoded>
			<wfw:commentRss>http://bioinformatics.whatheblog.com/2009/02/bioinformatics-tool-chest-why-you-should-be-using-firefox/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Exporting Vector NTI Data &#8212; The Hail Mary</title>
		<link>http://bioinformatics.whatheblog.com/2009/01/exporting-vector-nti-data-the-hail-mary/</link>
		<comments>http://bioinformatics.whatheblog.com/2009/01/exporting-vector-nti-data-the-hail-mary/#comments</comments>
		<pubDate>Thu, 29 Jan 2009 00:09:59 +0000</pubDate>
		<dc:creator>Eli Roberson</dc:creator>
				<category><![CDATA[Science]]></category>
		<category><![CDATA[Technical]]></category>

		<guid isPermaLink="false">http://bioinformatics.whatheblog.com/?p=29</guid>
		<description><![CDATA[VNTI Is Dead
The golden age of Vector NTI has ended, and free software licenses are no longer available to academics. This move has been disturbing to many, and support for deactivated licenses haven&#8217;t been the best so far. But after sending a plea to the tech support services associated with VNTI, they&#8217;ve come through with [...]]]></description>
			<content:encoded><![CDATA[<h3>VNTI Is Dead</h3>
<p>The <a href="http://bioinformatics.whatheblog.com/?p=27">golden age of Vector NTI</a> has ended, and free software licenses are no longer available to academics. This move has been disturbing to many, and support for deactivated licenses haven&#8217;t been the best so far. But after sending a plea to the tech support services associated with VNTI, they&#8217;ve come through with some help.</p>
<p>To answer an oft answered question, DNA/RNA/Protein sequences CANNOT be exported after a license is expired. I know, I know, bad programming practice and bad PR practice. BUT if your data is locked in you can get a temporary license to export everything. For DNA / RNA molecules you can export into GenBank, EMBL, and FASTA file formats. For protein sequences you can export into GenPept, SWISS-PROT, or Protein FASTA format. File export DOES NOT work for Enzymes, Oligos, Gel Markers, Citations, BLAST Results, or Analysis Results. Those of you with extensive Oligo libraries will want to contact Tech Support directly for assistance in exporting or moving these files. Sorry guys. It may or may not be supported.</p>
<h3>Exporting DNA/RNA Molecules</h3>
<ol>
<li> Open your VNTI Database.</li>
<li> Go to &#8216;DNA/RNA Molecules&#8217; from the drop down box.</li>
<li> Select all the molecules you want to export. For everything, select one molecule and either press CTRL+A or use &#8216;Edit&#8217; -&gt; &#8216;Select All&#8217;.</li>
<li> Go to &#8216;Edit&#8217; -&gt; &#8216;Copy To&#8217; -&gt; &#8216;File&#8230;&#8217;. Make sure to choose the format you want. If you want all three, just repeat the process for each one.</li>
</ol>
<h3>Exporting Protein Sequences</h3>
<p>The process is identical to exporting DNA / RNA molecules, except the Protein Molecules library must be used.</p>
<h3>Getting a Temporary License</h3>
<p>To get your temporary license e-mail Technical Support at bioinfosupport[AT]invitrogen.com. In your message just explain that you&#8217;ve been a user of the VNTI free license, but the license expired and you need a temporary one to export all your data.</p>
<p>Now, I&#8217;m glad that Life Sciences / Invitrogen has come through with some help for the community. Do I agree with the change in marketing? No. Do I think the transition was handled gracefully? No. But they could have elected to lock everyone&#8217;s data in permanently, and have instead elected to extend the olive branch. Hope this helps some of your out there with trapped data.</p>
]]></content:encoded>
			<wfw:commentRss>http://bioinformatics.whatheblog.com/2009/01/exporting-vector-nti-data-the-hail-mary/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Bioinformatics Tool Chest: Bioconductor</title>
		<link>http://bioinformatics.whatheblog.com/2008/07/bioinformatics-tool-chest-bioconductor/</link>
		<comments>http://bioinformatics.whatheblog.com/2008/07/bioinformatics-tool-chest-bioconductor/#comments</comments>
		<pubDate>Sat, 12 Jul 2008 02:48:23 +0000</pubDate>
		<dc:creator>Eli Roberson</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[Technical]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://bioinformatics.whatheblog.com/?p=17</guid>
		<description><![CDATA[
Following up on the previous bioinformatics tool chest post, I thought I&#8217;d cover Bioconductor next. Bioconductor is actually an off-shoot of the R-project.
Now hold on, I know what you&#8217;re thinking. &#8220;But you talked about R last time, why do we have to talk about R again?!?&#8221; It&#8217;s simple really. Though bioconductor is a derivitive of [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: center;"><img src="http://wiki.fhcrc.org/wiki/bioc/images/bioclogo-small.jpg" alt="Image of Bioconductor Logo" width="99" height="79" /></p>
<p>Following up on the <a href="http://bioinformatics.whatheblog.com/?p=15">previous</a> bioinformatics tool chest post, I thought I&#8217;d cover <a href="http://www.bioconductor.org">Bioconductor</a> next. Bioconductor is actually an off-shoot of the<a href="http://www.r-project.org/"> R-project</a>.</p>
<p>Now hold on, I know what you&#8217;re thinking. &#8220;But you talked about R <strong>last</strong> time, why do we have to talk about R again?!?&#8221; It&#8217;s simple really. Though bioconductor is a derivitive of R, its purpose truly is unique enough to deserve its own post.</p>
<p>Bioconductor (or BioC) is an open-source derivitive of R focused on facilitating the analysis of genomic data. One might ask, why should I care? If you perform any kind of high-throughput SNP genotyping or gene expression analysis, this software suite gives you immediate access to free, open-source, extremely powerful data analysis options. Got Affymetrix CEL files for expression data? No problem. Bioconductor can load, normalize, analyze, and summarize that data for you. How about SNP genotyping data? Again no problem. Want to check the copy number of your SNP data? You&#8217;ll have several options. Many Bioconductor packages are built using S4 methods and classes (the exact definition of which are unimportant for this article). The advantage of that coding system is that you can use and extend existing classes to perform your own, custom designed analysis methods. And even better, once you&#8217;ve worked out a new method, you can incorporate it into a package and submit it to Bioconductor for everyone to use!</p>
<p>The bottom line is this: if you need powerful, customizable, freely available analysis software (and who doesn&#8217;t after spending ridulous amounts of money running many samples on high-throughput technology) then Bioconductor is a viable choice. If you have genomic data give BioC a try, and if it&#8217;s useful to you build your own packages for the whole community.</p>
]]></content:encoded>
			<wfw:commentRss>http://bioinformatics.whatheblog.com/2008/07/bioinformatics-tool-chest-bioconductor/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Bioinformatics Tool Chest: R Programming Language</title>
		<link>http://bioinformatics.whatheblog.com/2008/07/bioinformatics-tool-chest-r-programming-language/</link>
		<comments>http://bioinformatics.whatheblog.com/2008/07/bioinformatics-tool-chest-r-programming-language/#comments</comments>
		<pubDate>Thu, 03 Jul 2008 08:17:14 +0000</pubDate>
		<dc:creator>Eli Roberson</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[Technical]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://bioinformatics.whatheblog.com/?p=15</guid>
		<description><![CDATA[
Data
Scientists love data. Call it a character flaw, but most of us can&#8217;t get enough. More data, more! But the data alone are just the start. To really be useful, we have to do something with the data. Model. Summarize. Evangelize it. Something. Who hasn&#8217;t needed to plot a standard curve? Or find the mean [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: center;"><img src="http://cran.r-project.org/Rlogo.jpg" alt="" width="100" height="76" /></p>
<h2>Data</h2>
<p style="text-align: left;">Scientists love data. Call it a character flaw, but most of us can&#8217;t get enough. More data, more! But the data alone are just the start. To really be useful, we have to <strong>do</strong> something with the data. Model. Summarize. Evangelize it. Something. Who hasn&#8217;t needed to plot a standard curve? Or find the mean value of a series of numbers? What should you do when you have these questions.</p>
<h2>The Problem</h2>
<p style="text-align: left;">Many scientists turn to our friend Excel to solve these problems. It&#8217;s easy to work with, and you can even make graphs easily. That isn&#8217;t necessarily a good thing, as perfectly nice people make <a title="Bad Graphs" href="http://www.biostat.wisc.edu/~kbroman/presentations/graphs_uwpath08_handout.pdf">really bad graphs</a> because those fancy 3D features are so tantalizing. Everyone interested in bioinformatics or computational biology needs a tool in their tool chest that can handle:</p>
<ol>
<li>statistics</li>
<li>figure, graph creation</li>
<li>very large data</li>
</ol>
<h2>The Solution</h2>
<p>Look no further friends, your savior has arrive, and its name is <a title="R Homepage" href="http://cran.r-project.org/">R</a>. R is a free, cross-platform, open-source derivitive of the S language. In case you didn&#8217;t catch that last part: <strong>R is free</strong>. You can download R from the nearest mirror to get started.</p>
<h3>The Good</h3>
<ul>
<li>Freely available</li>
<li>Open-source &#8212; can compile it to your needs (OS, cpu, available memory, optimization levels)</li>
<li>Tons of add on packages</li>
<li>Scriptable</li>
<li>Ability to write own functions and packages</li>
<li>Able to handle large datasets</li>
<li>Interfaces with compiled languages</li>
<li>Can save plots as Post-scripts (print quality)</li>
<li>Extensive tutorials online along with mailing lists and archives for trouble shooting</li>
</ul>
<h3>The Bad</h3>
<ul>
<li>Command-line interface</li>
<li>Can be slow reading large files</li>
<li>Interpreted language (can be slower than compiled code)</li>
<li>No tech support line</li>
<li>Steep learning curve for beginners, especially non-programmers</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://bioinformatics.whatheblog.com/2008/07/bioinformatics-tool-chest-r-programming-language/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>
