Bio Database Access and Sequence Alignment

with Python and BioPython

Hit space for overview. Navigate with arrow keys.

What's an API?

Lots of Accessible Data

Over 150 Biological APIs listed at ProgrammableWeb
That means even more exist!
Over 20% are JSON APIs, denoting recent growth

Bioinformatics has it good

- Data from Programmable Web

But wait...

Why should I learn to use

APIs

when I can already access the data via the

web?

APIs can help if...

>3
You're doing something more than a few times

You're chaining together steps

You want structured and consistent research results

Tools

"Click" into the tiers!
- Data from Bioinformatics Survey

Our Focus

Python

BioPython

A Brief Introduction

Inspiration for Examples

Searching

from Bio import Entrez
Entrez.email = "bjorgep@students.wwu.edu"
search_handle = Entrez.esearch(db="nucleotide", term="Opuntia[orgn] and rpl16", usehistory="y")
opuntia_rpl16_search_results = Entrez.read(search_handle)
search_handle.close()

View Fullscreen | Opuntia (Wiki) | rpl16 (Wiki) | Search Results | Entrez Technical Docs

Retrieving

file = open("opuntia_rpl16.fasta", "w")
opuntia_rpl16_fasta_data = Entrez.efetch(db="nucleotide",rettype="fasta", retmode="text", webenv=opuntia_rpl16_search_results["WebEnv"], query_key=opuntia_rpl16_search_results["QueryKey"]).read()
file.write(opuntia_rpl16_fasta_data)
file.close()

View Fullscreen | Results File | Equivalent Web Data (Click FASTA for each result)

Functions

from Bio import Entrez
Entrez.email = "bjorgep@students.wwu.edu"

def search_and_retrieve_fasta(database, searchterm, filename, batch_size=3):
  # SEARCH
  search_results = Entrez.read( Entrez.esearch(db=database, term=searchterm, usehistory="y") )
  # RETRIEVE
  file = open(filename, "w")
  count = int(search_results['Count'])
  for start in range(0,count,batch_size):
    end = min(count, start+batch_size)
    print "Going to download record %i to %i" % (start+1, end)
    data = Entrez.efetch(db=database,rettype="fasta", retmode="text", retstart=start, retmax=batch_size, webenv=search_results["WebEnv"], query_key=search_results["QueryKey"]).read()
    file.write(data)
  file.close()

# You can now do all these steps in just one line!
search_and_retrieve_fasta("nucleotide", "Blossfeldia[orgn] and rpl16", "blossfeldia_rpl16.fasta")

View Fullscreen | Blossfeldia | Results File | Equivalent Web Data (Click FASTA for each result)

Parsing

from Bio import SeqIO
blossfeldia_rpl16_sequences = list(SeqIO.parse("blossfeldia_rpl16.fasta", "fasta"))

View Fullscreen | SeqIO Docs

Sequence Objects

first_blossfeldia_rpl16_sequence = blossfeldia_rpl16_sequences[0].seq
# GC %
from Bio.SeqUtils import GC
GC(first_blossfeldia_rpl16_sequence)

# DNA --> RNA --> DNA
first_blossfeldia_rpl16_sequence.transcribe()
first_blossfeldia_rpl16_sequence.back_transcribe()

# DNA Coding Strand --> Protein
first_blossfeldia_rpl16_sequence.translate()

View Fullscreen | Sequence Docs | SeqUtils Docs

Blasting

from Bio.Blast import NCBIWWW
from Bio import SeqIO
first_blossfeldia_rpl16_sequence = blossfeldia_rpl16_sequences[0].seq
result_handle = NCBIWWW.qblast("blastn", "nt", _first_blossfeldia_rpl16_sequence)
save_file = open("blast_search_on_first_blossfeldia_rpl16_sequence.xml", "w")
save_file.write(result_handle.read())
save_file.close()
result_handle.close()

View Fullscreen | XML Results

Sequence Alignment

# Clustal
import os
from Bio.Align.Applications import ClustalwCommandline

clustalw_cmd_line = ClustalwCommandline("clustalw2", infile="opuntia_rpl16.fasta")
stdout, stderr = clustalw_cmd_line()   #outputs two files opuntia_rpl16.aln, opuntia_rpl16.dnd

# Read Multiple Alignment
from Bio import AlignIO
opuntia_rpl16_alignment = AlignIO.read("opuntia_rpl16.aln", "clustal")
print opuntia_rpl16_alignment

# Draw Phylo Tree
from Bio import Phylo
opuntia_rpl16_tree = Phylo.read("opuntia_rpl16.dnd", "newick")
Phylo.draw(opuntia_rpl16_tree)

Do you have what it takes?

Absolutely Yes

Non-CS people know the tools

CS People know how to mashup the tools

Some more Resources

By Philip Bjorge

Put Together With:

Bioinformatics Database Access Presentation by Philip Bjorge is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
Based on a work at github.com.

Bio Database Access and Sequence Alignment

with Python and BioPython

Hit space for overview. Navigate with arrow keys.

What's an API?

Lots of Accessible Data

Bioinformatics has it good

But wait...

Why should I learn to use

APIs

when I can already access the data via the

web?

APIs can help if...

>3

Tools

Our Focus

Python

BioPython

A Brief Introduction

Searching

Retrieving

Functions

Parsing

Sequence Objects

Blasting

Sequence Alignment

Do you have what it takes?

Absolutely Yes

Some more Resources

By Philip Bjorge

Put Together With: