atom.xml

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

  <title><![CDATA[bio4j]]></title>
  <link href="http://bio4j.com/atom.xml" rel="self"/>
  <link href="http://bio4j.com/"/>
  <updated>2015-03-23T09:56:09+01:00</updated>
  <id>http://bio4j.com/</id>
  <author>
    <name><![CDATA[oh no sequences!]]></name>
    
  </author>
  <generator uri="http://octopress.org/">Octopress</generator>

  
  <entry>
    <title type="html"><![CDATA[Bio4j preprint available]]></title>
    <link href="http://bio4j.com/blog/2015/03/bio4j-preprint-available/"/>
    <updated>2015-03-22T18:20:00+01:00</updated>
    <id>http://bio4j.com/blog/2015/03/bio4j-preprint-available</id>
    <content type="html"><![CDATA[<p>A citable preprint in the <a href="http://biorxiv.org/">bioRxiv</a> describing Bio4j went online yesterday:</p>

<ul>
  <li><strong><a href="http://biorxiv.org/content/early/2015/03/20/016758">Bio4j: a high-performance cloud-enabled graph-based data platform</a></strong></li>
</ul>

<p>It serves (we hope) as a good introduction to what is Bio4j, and what it has to offer; especially so if, for getting a general idea of Bio4j, you would rather read prose than code. If you are using Bio4j for something that you want to publish, citing it is much easier now: all bioRxiv preprints are assigned a DOI. Comments, thoughts, opinions are all more than welcome! We will submit a paper based on this preprint to an open access journal. For completeness, here’s the citation info and the abstract:</p>

<hr />
<p><br /></p>

<h3 id="bio4j-a-high-performance-cloud-enabled-graph-based-data-platform">Bio4j: a high-performance cloud-enabled graph-based data platform</h3>

<p><em>Pablo Pareja-Tobes, Raquel Tobes, Marina Manrique, Eduardo Pareja, Eduardo Pareja-Tobes</em> <br />
<strong>bioRxiv</strong> – <strong>doi</strong>: <a href="http://dx.doi.org/10.1101/016758">10.1101/016758</a></p>

<!-- ### Abstract -->

<p><strong>Background.</strong> Next Generation Sequencing and other high-throughput technologies have brought a revolution to the bioinformatics landscape, by offering sheer amounts of data about previously unaccessible domains in a cheap and scalable way. However, fast, reproducible, and cost-effective data analysis at such scale remains elusive. A key need for achieving it is being able to access and query the vast amount of publicly available data, specially so in the case of knowledge-intensive, semantically rich data: incredibly valuable information about proteins and their functions, genes, pathways, or all sort of biological knowledge encoded in ontologies remains scattered, semantically and physically fragmented.</p>

<p><strong>Methods and Results.</strong> Guided by this, we have designed and developed Bio4j. It aims to offer a platform for the integration of semantically rich biological data using typed graph models. We have modeled and integrated most publicly available data linked with proteins into a set of interdependent graphs. Data querying is possible through a data model aware Domain Specific Language implemented in Java, letting the user write typed graph traversals over the integrated data. A ready to use cloud-based data distribution, based on the Titan graph database engine is provided; generic data import code can also be used for in-house deployment.</p>

<p><strong>Conclusion.</strong> Bio4j represents a unique resource for the current Bioinformatician, providing at once a solution for several key problems: data integration; expressive, high performance data access; and a cost-effective scalable cloud deployment model.</p>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Bio4j: updates]]></title>
    <link href="http://bio4j.com/blog/2015/03/bio4j-updates/"/>
    <updated>2015-03-11T17:18:00+01:00</updated>
    <id>http://bio4j.com/blog/2015/03/bio4j-updates</id>
    <content type="html"><![CDATA[<p>We’ve spent the past few months working <em>really</em> hard on Bio4j. There has not been a lot of updates here basically because there were too many new things happening :) </p>

<p>But now things are stabilizing and it’s about time we start to introduce all the new features and improvements we have in store. In this first post I just want to give an overview of Bio4j’s current state, going into more detail in subsequent posts.</p>

<h2 id="bio4j-now">Bio4j now</h2>

<h3 id="a-new-graph-schema-and-api">A new graph schema and API</h3>

<p>We have now a strongly typed graph schema and traversal API in <strong><a href="https://github.com/bio4j/bio4j">bio4j/bio4j</a></strong>, based on <strong><a href="https://github.com/bio4j/angulillos">angulillos</a></strong> (more about angulillos later). With it, you can write traversals over Bio4j data abstractly, and then execute them over any implementation. These queries are checked to be correct both structurally (no source of a vertex) and with respect to the Bio4j schema. Vertices and edges are now part of graphs, which can declare dependencies; writing your own extensions to the model is now much easier than before. As part of these changes we did a thorough graph-per-graph review of the Bio4j model, which resulted in some significant improvements.</p>

<p>Of course a schema is not that useful without actual data conforming to it; we also wrote generic importers for all graphs. These importers can be executed using any implementation of the angulillos API.</p>

<h3 id="a-titan-based-implementation-and-data-distribution">A Titan-based implementation and data distribution</h3>

<p>With much of the work already done at the level of bio4j/bio4j, providing a data distribution of Bio4j becomes pretty simple; you just need to</p>

<ol>
  <li>implement angulillos for your database technology of choice; this is what you have for <a href="http://thinkaurelius.github.io/titan/">Titan</a> in <strong><a href="https://github.com/bio4j/angulillos-titan">angulillos-titan</a></strong>.</li>
  <li>if your database has support for type definitions and schemas, create those corresponding to the Bio4j schema; what we do for each graph in <strong><a href="https://github.com/bio4j/bio4j-titan">bio4j-titan</a></strong></li>
</ol>

<p>We finished running the importing process for all graphs just a few hours ago. A pretty sizable <code>.tar</code> containing all the Titan files is available from an S3 bucket. With that you just need to spin an EC2 instance, download and extract that and start using Bio4j. Or, if you don’t want to use AWS, you can of course run the import process on your own infrastructure.</p>

<h3 id="angulillos-generic-typed-property-graphs-in-java">Angulillos: generic typed property graphs in Java</h3>

<p>Writing <em>correct</em> queries for Bio4j was becoming harder and harder as we integrated more databases and resources, and we had no way of expressing the graph schemas, even for documentation purposes. That is what <strong><a href="https://github.com/bio4j/angulillos">angulillos</a></strong> strives to solve. You can think of angulillos as a strongly typed version of the property graph model: first you describe a graph schema in terms of types, and then you can write generic traversals over it, which are guaranteed to be well-typed. This means that for example</p>

<ul>
  <li>you cannot retrieve the outgoing edges of and edge</li>
  <li>and you can get the tweets that a user tweeted, but not the users that a tweet follows!</li>
</ul>

<p>The API is really straightforward to implement, and its only dependency is Java 8 (for Streams and lambdas). <strong><a href="https://github.com/bio4j/angulillos-titan">angulillos-titan</a></strong> can serve as an example of how this can be done.</p>

<h3 id="the-future">The future</h3>

<p>The next post will be dedicated to a tentative roadmap, explaining what we are working on now; A (really nice) Scala API, data distribution and AWS deployment improvements, and new integrations of genomic data sources are coming in the following months!</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Bio4j goes to GSoC mentor summit 2014]]></title>
    <link href="http://bio4j.com/blog/2014/10/bio4j-goes-to-gsoc-mentor-summit-2014/"/>
    <updated>2014-10-29T17:18:00+01:00</updated>
    <id>http://bio4j.com/blog/2014/10/bio4j-goes-to-gsoc-mentor-summit-2014</id>
    <content type="html"><![CDATA[<p><img src="http://bio4j.com/images/bio4jGsoc.png" /></p>

<p>I just got home yesterday from San Francisco after attending together with <a href="https://twitter.com/eparejatobes">@eparejatobes</a> to the 10th edition of the Google Summer of Code mentor summit. It&#8217;s been a great experience that I would like to share with you all in this blog post ;)
For those of you who still don&#8217;t know what <a href="https://developers.google.com/open-source/soc/?csw=1">GSoC</a> is, here&#8217;s a debrief:</p>

<blockquote>
  <p>Google Summer of Code is a program that offers student developers stipends to write code for various open source projects. We work with many open source, free software, and technology-related groups to identify and fund projects over a three month period. </p>
</blockquote>

<p>This was Bio4j&#8217;s first year as a GSoC organization and we got three students who worked in the following projects:</p>

<ul>
  <li><a href="https://github.com/bio4j/dynamograph">dynamograph</a></li>
  <li><a href="https://github.com/bio4j/exporter">exporter</a></li>
  <li><a href="https://github.com/bio4j/el-grafo">el-grafo</a></li>
</ul>

<p>It also was my first experience as a mentor and I must say that I both learned and enjoyed it a lot during the process.</p>

<p>The events started on Friday with a complimentary visit to the theme park <em>Great America</em>, nice! followed by a really cool dinner reception at the <a href="http://www.thetech.org/">San Jose Tech Museum of Innovation</a> where we had surprise speakers such as Linus Torvals plus the opportunity of exploring the geeky exhibits from the museum while having some drinks.</p>

<p>We were supposed to dress smart for a change, which was interesting, seeing all these people wearing nice clothes :)</p>

<p><img class="right" src="http://bio4j.com/images/fotoTechMuseum.jpg" width="280" /></p>

<blockquote>
  <p>I must say that I had to watch around 20 minutes of youtube videos before I managed to get the knot tie right&#8230; xD</p>
</blockquote>

<p>Sessions started early the next day with more than eight simultaneous rooms <em>(without taking into account the impromptu sessions that were organized at the ballroom from time to time)</em> and went on till the evening.</p>

<p>It was the first time that I went to an <strong><a href="http://en.wikipedia.org/wiki/Unconference">unconference</a></strong> and I just loved it. 
It is actually great to have the opportunity to explore the different sessions and meet up with people on the way spontaneously, without all the rigidity that so many times comes with <em>&#8220;standard&#8221;</em> conferences. </p>

<p><img class="left" src="http://bio4j.com/images/stickers.jpg" width="260" /></p>

<p>Meeting in person people from the <a href="http://www.reactome.org/">Reactome database</a> project was cool since we plan to include this data source into Bio4j in the near future. It was also nice to see in person some of the guys that I&#8217;ve been following on twitter for a while like <a href="https://twitter.com/braincode">@braincode</a> among others.
I also found a good idea the fact of having both the sticker exchange table and the tea-room filled with chocolates from all over the world! The day ended with a quiz show that I unfortunately couldn&#8217;t join but, I read on twitter that it was quite funny.</p>

<p>On Sunday we opened the day with a trip to <a href="http://en.wikipedia.org/wiki/Googleplex">Googleplex</a> where we could see the actual place where the Google folks work on.</p>

<p><img class="right" src="http://bio4j.com/images/chocolates.png" width="240" /></p>

<p>There was some time left for a couple more sessions and then we unfortunately had to say bye to all the new acquaintances we made after attending the closing session at the hotel. </p>

<p>I would like to end this post by thanking all the people that helped out on the organization of this awesome summit.
Also a special thanks to <a href="https://twitter.com/fossygrl">@fossygirl</a>, great job!</p>

<p>Stay tuned for the next post, we will be releasing a shiny new version of Bio4j based on Titan very soon ;)</p>

<p><img src="http://bio4j.com/images/fotoGoogleAndroid.png" /></p>

<p><a href="https://twitter.com/pablopareja">@pablopareja</a></p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Bio4j accepted for Google Summer of Code 2014]]></title>
    <link href="http://bio4j.com/blog/2014/02/bio4j-accepted-for-google-summer-of-code-2014/"/>
    <updated>2014-02-25T17:18:00+01:00</updated>
    <id>http://bio4j.com/blog/2014/02/bio4j-accepted-for-google-summer-of-code-2014</id>
    <content type="html"><![CDATA[<p><img class="right" src="http://bio4j.com/images/GoogleSummer_2014logo.jpg" width="300" height="270" /></p>

<p>We are really excited to announce that <strong>Bio4j</strong> has been <strong>accepted</strong> as a <a href="https://www.google-melange.com/gsoc/org2/google/gsoc2014/bio4j">mentoring organization</a> for <strong><a href="https://www.google-melange.com/gsoc/homepage/google/gsoc2014">Google Summer of Code 2014</a></strong>. This was the first year we applied for it, and it feels just great being part of this inititative!</p>

<p>We think this is a great opportunity for students, giving them the opportunity to hack on pretty cool stuff around graph databases, bio big data and cloud computing.</p>

<h2 id="how-to-participate">how to participate</h2>

<p>If this sounds amazing and you are a student (PhD, masters, undergraduate, <a href="https://www.google-melange.com/gsoc/document/show/gsoc_program/google/gsoc2014/help_page#2._Whos_eligible_to_participate_as_a">whatever</a>) or know someone who is,</p>

<ol>
  <li><strong><a href="https://github.com/bio4j/gsoc14/wiki/ideas">check our ideas list</a></strong> and then</li>
  <li><strong>contact a potential mentor</strong> or if you don&#8217;t know who just <a href="https://github.com/eparejatobes">@eparejatobes</a> or <a href="https://github.com/pablopareja">@pablopareja</a></li>
</ol>

<p>You can read more about it in the <a href="https://github.com/bio4j/gsoc14/wiki">bio4j/gsoc14 wiki</a>.</p>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Berkeley Phylogenomics Group receives an NSF grant to develop a graph DB for Big Data challenges in genomics building on Bio4j]]></title>
    <link href="http://bio4j.com/blog/2013/11/new-bio4j-success-berkeley-phylogenomics-grant/"/>
    <updated>2013-11-12T12:00:00+01:00</updated>
    <id>http://bio4j.com/blog/2013/11/new-bio4j-success-berkeley-phylogenomics-grant</id>
    <content type="html"><![CDATA[<p>The <a href="http://phylogenomics.berkeley.edu/">Sjölander Lab</a> at the <a href="http://www.berkeley.edu/index.html">University of California, Berkeley</a>, has recently been awarded a <strong>250K</strong> US dollars <em>EAGER</em> grant from the National Science Foundation to build a graph database for Big Data challenges in genomics.  Naturally, <strong>they’re building on Bio4j</strong>.</p>

<p>The project “<strong>EAGER: Towards a self-organizing map and hyper-dimensional information network for the human genome</strong>” aims to create a graph database of genome and proteome data for the human genome and related species to allow biologists and computational biologists to mine the information in gene family trees, biological networks and other graph data that cannot be represented effectively in relational databases. For these goals, they will develop on top of the pioneering graph-based bioinformatics platform <strong>Bio4j</strong>. </p>

<p>”<em>We are excited to see how Bio4j is used by top research groups to build cutting-edge bioinformatics solutions</em>” said <strong>Eduardo Pareja</strong>, <strong><a href="http://www.era7bioinformatics.com">Era7 Bioinformatics</a> CEO</strong>. “<em>To reach an even broader user base, we are pleased to announce that we now provide versions for both Neo4j and Titan graph databases, for which we have developed another layer of abstraction for the domain model using Blueprints</em>.”</p>

<p>”<em>EAGER stands for Early-concept Grants for Exploratory Research</em>”, explained <strong>Professor Kimmen Sjölander</strong>, <strong>head of the <a href="http://phylogenomics.berkeley.edu/">Berkeley Phylogenomics Group</a></strong>: “<em>NSF awards these grants to support exploratory work in its early stages on untested, but potentially transformative, research ideas or approaches</em>”. “<em>My lab’s focus is on machine learning methods for Big Data challenges in biology, particularly for graphical data such as gene trees, networks, pathways and protein structures. The limitations of relational database technologies for graph data, particularly BIG graph data, restrict scientists’ ability to get any real information from that data. When we decided to switch to a graph database, we did a lot of research into the options. When we found out about Bio4j, we knew we’d found our solution. The Bio4j team has made our development tasks so much easier, and we look forward to a long and fruitful collaboration in this open-source project</em>”.</p>

<p>You can find more information here:</p>

<ul>
  <li><a href="http://era7bioinformatics.com/en/download_file.cfm?file=1695&amp;news=17"><strong>PHYLOGENOMICS_BERKELEY_BIO4J_ERA7_BIOINFORMATICS.pdf</strong></a></li>
</ul>

<p><a href="http://twitter.com/pablopareja"><strong>@pablopareja</strong></a></p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Bio4j 0.9 the billion relationships are here!]]></title>
    <link href="http://bio4j.com/blog/2013/10/bio4j-09-the-billion-relationships-is-here/"/>
    <updated>2013-10-15T06:33:27+02:00</updated>
    <id>http://bio4j.com/blog/2013/10/bio4j-09-the-billion-relationships-is-here</id>
    <content type="html"><![CDATA[<p>Hi everyone!</p>

<p>So <a href="https://github.com/bio4j/Bio4j/wiki/Bio4j-0.9"><strong>Bio4j 0.9</strong></a> finally made its way out and it’s here bringing you more than 1 billion relationships. These are approximately its main numbers:</p>

<ul>
  <li><strong>1.216.993.547</strong> relationships</li>
  <li><strong>190.625.351</strong> nodes</li>
  <li><strong>584.436.429</strong> properties</li>
</ul>

<p>A lot of new features and improvements have been incorporated including the following, <em>(I will go into more detail in later posts specifically dedicated to each of them)</em></p>

<h2 id="refurbishing-the-domain-model">Refurbishing the domain model</h2>

<p><img src="http://bio4j.com/images/domainModelThumbnail.png" style="float:right" />We have introduced a new level of abstraction for the domain model by decoupling the inner database implementation from the relationships among entities themselves. An interface has been developed for each node and relationship present in the database, including methods to access both the properties of the entity it represents and utility methods that allow to easily navigate to the entities that will be linked to it. 
All this can be found under the package <em>com.era7.bioinfo.bio4j.model</em></p>

<h2 id="new-blueprints-layer">New Blueprints layer</h2>

<p><img src="http://bio4j.com/images/blueprints.png" style="float:left" /> Apart from the set of interfaces we’ve developed another layer for the domain model using <a href="http://blueprints.tinkerpop.com/"><strong>Blueprints</strong></a>. This way we’re going one step further for making the domain model independent from the choice of database technology.</p>

<h2 id="new-titan-implementation">New Titan implementation</h2>

<p><img src="http://bio4j.com/images/titan.png" style="float:right" /> After the problems we had with the so called <a href="http://thinkaurelius.com/2012/10/25/a-solution-to-the-supernode-problem/"><em><strong>supernodes</strong></em></a> - which are quite common indeed, we decided to give a try to <a href="http://thinkaurelius.github.io/titan/"><strong>Titan Graph Database</strong></a> technology and see how it behaves in such situation. Both wrapper classes for each entity and importing programs have already been implemented. This new prototype needs however some testing but be sure you’ll be hearing more about this pretty soon! ;)</p>

<h2 id="bye-bye-reference-node">Bye bye reference node</h2>

<p>We decided to finally stop using the reference node for indexing purposes <em>(actually there’s no use for it anymore in Bio4j)</em>. 
I have to admit it, I never was a fan of it and it was about time to do it. So now auxiliary relationships such as, for instance, <em>MainTaxonRel</em> or <em>MainDatasetRel</em> have been replaced by a standard node index.</p>

<h2 id="bug-fixes">Bug fixes</h2>

<p>This new release comes with many fixes including:</p>

<ol>
  <li><strong>EnzymeNode</strong>: The node type property was not stored in previous releases.</li>
  <li><strong>DatasetNode</strong>: Name property was not properly indexed. </li>
  <li><strong>OrganismNode</strong>: NCBI tax-id property was not stored in some scenarios.</li>
  <li>Redundant sequence conflict feature relationships have been fixed.</li>
  <li>Duplicated submissions fixed</li>
  <li>ProteinUnpublishedObservationCitation relationship was missing</li>
  <li>The following node types were not properly indexed by their type till now: <em>BookNode, ArticleNode, OnlineArticleNode, SubmissionNode, PatentNode, PublisherNode, OnlineJournalNode, JournalNode</em></li>
</ol>

<h2 id="java-7">Java 7</h2>

<p>Bio4j uses Java 7 now ;)</p>

<p>OK, so that’s all for now, I’ll be posting much more information about this new release soon.</p>

<p>Cheers!</p>

<p><a href="http://twitter.com/pablopareja"><strong>@pablopareja</strong></a></p>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Bio4j modules, adapt the database to your own needs]]></title>
    <link href="http://bio4j.com/blog/2012/10/bio4j-modules-adapt-the-database-to-your-own-needs/"/>
    <updated>2012-10-30T05:33:27+01:00</updated>
    <id>http://bio4j.com/blog/2012/10/bio4j-modules-adapt-the-database-to-your-own-needs</id>
    <content type="html"><![CDATA[<p>Hi!</p>

<p><strong>Bio4j 0.8 includes</strong> a few <strong>different data sources</strong> and you may not always be interested in having all of them. For example you might be interested in playing around with the Gene Ontology DAG alone and let’s face it, having to import a ~105 GB database to do that wouldn’t make much sense…</p>

<p>That’s why <strong>the importing process is modular and customizable, allowing you to import just the data you are interested in</strong>. 
Here’s the big picture of where do entities and relationships come from in the general domain model:</p>

<p><a href="https://raw.github.com/bio4j/Bio4j/master/Bio4jDomainModelWithCardinality.jpg"><img src="http://bio4j.com/images/DomainModelWithDataSourceView.png" /></a></p>

<p>There’s however one thing that you have to <strong>keep in mind, you must be coherent when choosing the data sources</strong> you want to have included in your database; that’s to say, you cannot have for example the Uniref clusters without previously importing Uniprot KB, otherwise there wouldn’t be proteins to connect to when importing the clusters!</p>

<p>Here you have a basic schema showing the dependencies among the different modules:</p>

<p><a href="http://bio4j.com/images/ModuleDependencies.png"><img src="http://bio4j.com/images/ModuleDependencies.png" /></a></p>

<p><em>(Let me remind you that having here two data sources which are not connected by an arrow does NOT mean that they are not related/connected, but rather if it’s possible to import them alone or instead they need other data sources to be already present in the database )</em></p>

<p>I’m going to create a wiki page where I will be going into more detail regarding database size and importing process time depending on your modules choice, but meanwhile you can find some more information about how to do this in the <a href="https://github.com/bio4j/Bio4j/wiki/Importing-bio4j">Importing Bio4j wiki page</a>.</p>

<p>Have a good day!</p>

<p><a href="http://twitter.com/pablopareja"><strong>@pablopareja</strong></a></p>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Bio4j 0.8, some numbers]]></title>
    <link href="http://bio4j.com/blog/2012/10/bio4j-0-8-some-numbers/"/>
    <updated>2012-10-18T06:33:27+02:00</updated>
    <id>http://bio4j.com/blog/2012/10/bio4j-0-8-some-numbers</id>
    <content type="html"><![CDATA[<p>Hi everyone!</p>

<p>Bio4j 0.8 was recently released and now it’s time to have a deeper look at its numbers <em>(as you can see we are quickly approaching the 1 billion relationships and 100M nodes)</em>:</p>

<ul>
  <li>Number of Relationships: <strong>717.484.649</strong></li>
  <li>Number of Nodes: <strong>92.667.745</strong></li>
  <li>Relationship types: <strong>144</strong></li>
  <li>Node types: <strong>42</strong></li>
</ul>

<p>Ok, but how are those relationships and nodes distributed among the different types?  In this chart you can see the <strong>first 20 Relationship types</strong>:</p>

<p><a href="http://bio4j.com/images/bio4j08first20RelTypes.png"><img src="http://bio4j.com/images/bio4j08first20RelTypes.png" /></a></p>

<p>Here, the same thing but for the <strong>first 20 Node types</strong>:</p>

<p><a href="http://bio4j.com/images/bio4j08first20NodeTypes.png"><img src="http://bio4j.com/images/bio4j08first20NodeTypes.png" /></a></p>

<p>You can also check these two files including the numbers for all the existing types:</p>

<ul>
  <li><a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/releases/0.8/statistics/Bio4j08NodeStatistics.txt">Node statistics</a></li>
  <li><a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/releases/0.8/statistics/Bio4j08RelStatistics.txt">Relationship statiscis</a></li>
</ul>

<p>All this data was obtained with the program <a href="https://github.com/bio4j/Bio4jTools/blob/master/src/com/era7/bioinfo/bio4j/tools/GetNodeAndRelsStatistics.java"><strong>GetNodeAndRelsStatistics</strong></a>.</p>

<p>Have a good weekend!</p>

<p><a href="http://twitter.com/pablopareja"><strong>@pablopareja</strong></a></p>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Bio4j 0.8 is here!]]></title>
    <link href="http://bio4j.com/blog/2012/09/bio4j-0-8-is-here/"/>
    <updated>2012-09-22T18:50:52+02:00</updated>
    <id>http://bio4j.com/blog/2012/09/bio4j-0-8-is-here</id>
    <content type="html"><![CDATA[<p>Hi everyone!</p>

<p>I’m glad to announce the release of <a href="https://github.com/bio4j/Bio4j/wiki/Bio4j-0.8"><strong>Bio4j 0.8</strong></a> including more than <strong>5.488.000 new proteins</strong> and <strong>3.233.000 genes</strong> among others,  plus the following improvements and features:</p>

<h2 id="pfam-families">Pfam families</h2>

<p>Bio4j includes now all Pfam families included in Uniprot KB (both Swiss-Prot and TrEMBL). For that, both a new node type and relationship type have been created: </p>

<ul>
  <li>
    <p><a href="http://www.bio4j.com/docs/bio4j/apidocs/com/era7/bioinfo/bio4j/model/nodes/PfamNode.html">PfamNode</a></p>
  </li>
  <li>
    <p><a href="http://www.bio4j.com/docs/bio4j/apidocs/com/era7/bioinfo/bio4j/model/relationships/protein/ProteinPfamRel.html">ProteinPfamRel</a> (this relationship connects a protein and the respective Pfam families associated to it)</p>
  </li>
</ul>

<p>The following properties have been added to the Pfam node including:</p>

<ul>
  <li>ID</li>
  <li>Name</li>
</ul>

<p>Besides, an exact index for the Pfam family ID property has also been created <em>( pfam_id_index ).</em></p>

<h2 id="ncbi-taxonomy-tree-gi-index-improved">NCBI taxonomy tree GI index improved</h2>

<p>Old merged node IDs have been incorporated to the Gene Identifier &lt;–&gt; Taxonomy units index. That means that now all the pairs GI-TaxID which included old merged Tax-ID are also part of the index, resulting on a higher rate of hits when using the index.
For that we used the file <strong>meged.dmp</strong> provided in the <a href="ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz">official tax dump file</a> provided by the NCBI.</p>

<h2 id="bio4j-and-bio4jmodel-projects-unification">Bio4j and Bio4jModel projects unification</h2>

<p><a href="https://github.com/bio4j/Bio4j">Bio4j</a> project has absorbed <a href="https://github.com/bio4j/Bio4jModel">Bio4jModel</a> project from this release on.</p>

<p>Until now, Bio4jModel library included the core classes for the manipulation and traversal of the graph while Bio4j project only included the importing programs. I’ve been thinking for a while that this could be confusing and, since there was no real need to keep them as independent projects, I decided to put it all under Bio4j <em>(you just need one jar file now ;) ).</em> </p>

<h2 id="new-script-for-the-importing-process">New script for the importing process</h2>

<p>You don’t have to worry anymore about manually downloading/decompressing/etc… the sources for the DB in case you want to import Bio4j in your own cluster/machine. Just run the script <strong><a href="https://github.com/bio4j/Bio4j/blob/master/DownloadAndPrepareBio4jSources.sh">DownloadAndPrepareBio4jSources.sh</a></strong> and it will do it all for you.</p>

<h2 id="bug-fixes">Bug fixes</h2>

<ol>
  <li><strong>MetalIonBindingSiteFeature</strong> This feature relationship had an erroneous name assigned and it’s been fixed.</li>
</ol>

<p>Well, that’s all for now, I’ll be posting more information about this new release soon ;)</p>

<p>Cheers,</p>

<p><a href="http://www.twitter.com/pablopareja"><strong>@pablopareja</strong></a></p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[New Bio4j general domain model schema available]]></title>
    <link href="http://bio4j.com/blog/2012/05/new-bio4j-general-domain-model-schema-available/"/>
    <updated>2012-05-17T23:11:50+02:00</updated>
    <id>http://bio4j.com/blog/2012/05/new-bio4j-general-domain-model-schema-available</id>
    <content type="html"><![CDATA[<p>Hi everyone!</p>

<p>It’s been a few months already since I published the last post but that doesn’t mean that the development process of Bio4j was stopped, but rather, on the contrary, I have been working in the integration of Bio4j with other DB-related projects as well as pipelines and tools. Actually, I’m right now staying in the US for a couple of months working on the implementation and integration of a new database around Bio4j including grasses genomic data as part of a collaboration with the Ohio State University, (I promise to give more details about this and more in next posts).</p>

<p>Ok, but let’s get to the point of this post. Even though there already is available a web tool to explore Bio4j data structure (<a href="http://gotools.bio4j.com:8080/Bio4jExplorerServer/Bio4jExplorer.html"><strong>Bio4jExplorer</strong></a>), I was feeling that something else was missing in order to get the big picture of all the data included and how it’s interrelated. So I got to work and created this general domain model including all node types and relationships (also specifying their cardinality).</p>

<p><a href="https://raw.github.com/bio4j/Bio4j/master/Bio4jDomainModelWithCardinality.jpg"><img src="http://bio4j.com/images/Bio4jDomainModelWithCardinality.png" /></a></p>

<p>I didn’t include “auxiliary” relationships linked to the reference node in order to not pollute the schema with relationships that don’t have any semantic meaning but rather indexing purposes. Also, the text included in both boxes represents different relationships all linking the same nodes -specifically Protein with CommentType and FeatureType. I could have drawn them as the rest but then I would have ended up with a hairball instead of a meaningful schema.</p>

<p>As always, any feedback is welcome!</p>

<p><a href="http://twitter.com/pablopareja"><strong>@pablopareja</strong></a></p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Bio4jExplorer, new features and design!]]></title>
    <link href="http://bio4j.com/blog/2012/03/bio4jexplorer-new-features-and-design/"/>
    <updated>2012-03-09T21:57:56+01:00</updated>
    <id>http://bio4j.com/blog/2012/03/bio4jexplorer-new-features-and-design</id>
    <content type="html"><![CDATA[<p>Hello everyone,</p>

<p>I’m happy to announce a new set of features for our tool Bio4jExplorer plus some changes in its design. I hope this may help both potential and current users to get a better understanding of Bio4j DB structure and contents.</p>

<p><a href="http://gotools.bio4j.com:8080/Bio4jExplorerServer/Bio4jExplorer.html"><img src="http://bio4j.com/images/bio4jExplorerScreenshot-1024x712.png" /></a></p>

<h3 id="node--relationship-properties">Node &amp; Relationship properties</h3>

<p>You can now check with Bio4jExplorer the properties that has either a node or relationship in the table situated on the lower part of the interface. Five columns are included:</p>

<ul>
  <li><strong>Name:</strong> property name</li>
  <li><strong>Type:</strong> property type (<code>String</code>, <code>int</code>, <code>float</code>, <code>String[]</code>, …)</li>
  <li><strong>Indexed:</strong> either the property is indexed or not (yes/no)</li>
  <li><strong>Index name</strong>: name of the index associated to this property -if there’s any 
<strong>Index name</strong>: type of the index associated to this property -if there’s any </li>
</ul>

<p><img src="http://bio4j.com/images/bio4jExplorerPropertiesTable.png" /></p>

<h3 id="node--relationship-data-source">Node &amp; Relationship Data source</h3>

<p>You can also see now from which source a Node or Relationship was imported, <em>some examples would be Uniprot, Uniref, GO, RefSeq…</em></p>

<p><img src="http://bio4j.com/images/bio4jExplorerDataSourceLabel.png" /></p>

<h3 id="relationships-name-property">Relationships Name property</h3>

<p>With this new version you can directly check here the “internal” name of relationships without having to go to the respective javadoc documentation. </p>

<p><img src="h/images/bio4jExplorerRelationshipsNameProperty.png" /></p>

<p>This is quite useful when you are writing your Cypher or Gremlin queries, just check it, copy it, and paste it in your query.  An example using the relationship shown in the picture would be this query included in the <a href="https://github.com/bio4j/Bio4j/wiki/Bio4j-cypher-cheat-sheet">Bio4j Cypher cheatsheet</a>:</p>

<p><strong><em>Get proteins (accession and names) associated to an interpro motif (limited to 10 results)</em></strong></p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class=""><span class="line">&gt; 
</span><span class="line">START i=node:interpro_id_index(interpro_id_index = "IPR023306")
</span><span class="line"> MATCH i &lt;-[:**PROTEIN_INTERPRO**]- p
</span><span class="line"> return p.accession, p.fullname, p.name, p.short_name
</span><span class="line"> limit 10</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>The url for Bio4jExplorer is the same as before:</p>

<ul>
  <li><a href="http://gotools.bio4j.com:8080/Bio4jExplorerServer/Bio4jExplorer.html"><strong>http://gotools.bio4j.com:8080/Bio4jExplorerServer/Bio4jExplorer.html</strong></a></li>
</ul>

<p>In case you are interested on how the tool is implemented, please go to <a href="blog//2011/10/bio4jexplorer-familiarize-yourself-with-bio4j-nodes-and-relationships">the previous post about Bio4jExplorer</a> where you can find information about the different code repos and more info.</p>

<p><strong>If you want to check the files including the hard-coded information regarding how nodes, relationships, and indexes are organized</strong>, and which is the input for the program which creates the AWS SimpleDB domain, I just uploaded them to the bio4j-public S3 bucket. Please click on their names to download them:</p>

<ul>
  <li><a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/simple-db-files/NodesBio4j.txt"><strong>NodesBio4j.txt</strong></a></li>
  <li><a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/simple-db-files/NodeIndexesBio4j.txt"><strong>NodeIndexesBio4j.txt</strong></a></li>
  <li><a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/simple-db-files/NodePropertiesBio4j.txt"><strong>NodePropertiesBio4j.txt</strong></a></li>
  <li><a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/simple-db-files/RelationshipsBio4j.txt"><strong>RelationshipsBio4j.txt</strong></a></li>
  <li><a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/simple-db-files/RelationshipPropertiesBio4j.txt"><strong>RelationshipPropertiesBio4j.txt</strong></a></li>
  <li><a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/simple-db-files/RelationshipIndexesBio4j.txt"><strong>RelationshipIndexesBio4j.txt</strong></a></li>
</ul>

<p>I wish you all a great weekend!</p>

<p>I’ll have mine at the beach enjoying our great springy weather with lots of sun down here in Andalucia ;)</p>

<p><a href="http://www.twitter.com/pablopareja"><strong>@pablopareja</strong></a></p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Bio4j 0.7, some numbers]]></title>
    <link href="http://bio4j.com/blog/2012/03/bio4j-0-7-some-numbers/"/>
    <updated>2012-03-05T13:28:27+01:00</updated>
    <id>http://bio4j.com/blog/2012/03/bio4j-0-7-some-numbers</id>
    <content type="html"><![CDATA[<p>Hi everyone!</p>

<p>There have already been a good few posts showing different uses and applications of Bio4j, but what about Bio4j data itself?
Today I’m going to show you some <strong>basic statistics</strong> about the different types of nodes and relationships Bio4j is made up of.
Just as a heads up, here are the <strong>general numbers of Bio4j 0.7</strong> :</p>

<ul>
  <li>Number of Relationships: <strong>530.642.683</strong></li>
  <li>Number of Nodes: <strong>76.071.411</strong></li>
  <li>Relationship types: <strong>139</strong></li>
  <li>Node types: <strong>38</strong></li>
</ul>

<p>Ok, but how are those relationships and nodes distributed among the different types?  In this chart you can see the <strong>first 20 Relationship types</strong> (click on the image below to check the interactive chart):</p>

<p><a href="http://bio4j.com/imgs/release07/relsBarChart.html"><img src="http://bio4j.com/images/first20RelTypesChart-1024x797.png" /></a></p>

<p>Here, the same thing but for the <strong>first 20 Node types</strong> (click on the image below to check the interactive chart):</p>

<p><a href="http://bio4j.com/imgs/release07/nodesBarChart.html"><img src="http://bio4j.com/images/first20NodeTypesChart-1024x794.png" /></a></p>

<p>You can also check these two files including the numbers from all existing types:</p>

<ul>
  <li><a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/releases/0.7/statistics/Bio4j07NodeStatistics.txt">Node statistics</a></li>
  <li><a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/releases/0.7/statistics/Bio4j07RelStatistics.txt">Relationship statiscis</a></li>
</ul>

<p>All this data was obtained with the program <a href="https://github.com/bio4j/Bio4jTools/blob/develop/src/com/era7/bioinfo/bio4j/tools/GetNodeAndRelsStatistics.java"><strong>GetNodeAndRelsStatistics</strong></a>.</p>

<p>Have a good day!</p>

<p><a href="http://twitter.com/pablopareja"><strong>@pablopareja</strong></a></p>

<h2 id="comments">comments</h2>

<ul>
  <li>
    <p><strong>Patrick Durusau</strong>
Excellent!
Question: When I checked at PubMed, I did not find Neo4j cited in any of the medical literature. I am not a medical professional but am interested in what might promote Bio4j in the medical research community?
It is too good of a resource to be unnoticed.
Patrick</p>

    <ul>
      <li><strong>ppareja</strong>
Hi Patrick,
I’m glad you liked the post.
It’s true that Bio4j may not have caught the attention of many people yet who could definitely make a good use out of it. What are the reasons for that? Well, I think it could be a mixture of factors.
Some people don’t like too much learning new technologies/strategies/workflows… and tend to stick to things they already know as long as possible – which is totally respectable and undestandable. Other people though, may simply not have found about it yet… It’s also possible that due to the lack of a well structured project documentation, potential users get lost in their way when trying to figure out what’s Bio4j about and/or miss the parts they could be interested in.
I could keep on going with more possible reasons that are coming to my mind but still, couldn’t be really objective – it’s me who created this project  :D
The point you bring up is actually one of the reasons why we value so much any sort of feedback for the project, (specially constructive ‘bad’ feedback that help us realize its weaknesses)
Let me know if you come up with an idea to let more people know about Bio4j !
Pablo</li>
    </ul>
  </li>
</ul>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Bio4j REST Server configures itself now thanks to the updated CF template]]></title>
    <link href="http://bio4j.com/blog/2012/02/bio4j-rest-server-configures-itself-now-thanks-to-the-updated-cf-template/"/>
    <updated>2012-02-24T14:21:39+01:00</updated>
    <id>http://bio4j.com/blog/2012/02/bio4j-rest-server-configures-itself-now-thanks-to-the-updated-cf-template</id>
    <content type="html"><![CDATA[<p>Hi all,</p>

<p>I just wanted to write a very short post informing about the changes in the <a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/Bio4jBasicRestServerTemplate.txt"><strong>Bio4jBasicRestServerTemplate</strong></a>. </p>

<blockquote>
  <p>Template what!? </p>
</blockquote>

<p>If that’s what you’re thinking, please go <a href="http://blog.ohnosequences.com/2011/12/neo4j-server-and-aws-become-good-friends/">here</a> to get an idea of what’s this all about.</p>

<p>From now on, this CloudFormation template adapts the server configuration files:</p>

<ul>
  <li><code>neo4j-wrapper.conf</code></li>
  <li><code>neo4j.properties</code></li>
</ul>

<p>to the characteristics of the instance type the server is running in, so that it can make the best out of it.</p>

<blockquote>
  <p>These configurations assume that the server is running alone in the machine.</p>
</blockquote>

<p>For that I created these two new mappings in the template:</p>

<ul>
  <li><code>AWSInstanceType2WrapperConfFile</code></li>
  <li><code>AWSInstanceType2Neo4jPropertiesFile</code></li>
</ul>

<p>Default configuration values are available in the <strong>bio4j-public S3 bucket</strong>. For example in order to have access to the server configuration files of a <code>m1.xlarge</code> instance, just go to this url:</p>

<ul>
  <li><a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/server/conf-files/m1.xlarge/neo4j-wrapper.conf">neo4j-wrapper.conf - m1.xlarge</a></li>
</ul>

<p>same thing for the other file:</p>

<ul>
  <li><a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/server/conf-files/m1.xlarge/neo4j.properties">neo4j.properties - m1.xlarge</a></li>
</ul>

<p>If you want to check the conf files for any other instance type, you just have to change the <strong>instance type name</strong> in the urls linked above.</p>

<p>Have a good weekend!</p>

<p><strong><a href="http://www.twitter.com/pablopareja">@pablopareja</a></strong></p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Finding the lowest common ancestor of a set of NCBI taxonomy nodes with Bio4j]]></title>
    <link href="http://bio4j.com/blog/2012/02/finding-the-lowest-common-ancestor-of-a-set-of-ncbi-taxonomy-nodes-with-bio4j/"/>
    <updated>2012-02-08T21:20:53+01:00</updated>
    <id>http://bio4j.com/blog/2012/02/finding-the-lowest-common-ancestor-of-a-set-of-ncbi-taxonomy-nodes-with-bio4j</id>
    <content type="html"><![CDATA[<p>I don’t know if you have ever heard of the <a href="http://en.wikipedia.org/wiki/Lowest_common_ancestor"><strong>lowest common ancestor problem</strong></a> in graph theory and computer science but it’s actually pretty simple. As its name says, it consists of finding the common ancestor for two different nodes which has the lowest level possible in the tree/graph.</p>

<p>Even though it is normally defined for only two nodes given <strong>it can easily be extended for a set of nodes with an arbitrary size</strong>. This is a quite common scenario that can be found across multiple fields and **taxonomy **is one of them.</p>

<p>The reason I’m talking about all this is because today I ran into the need to make use of such algorithm as part of some improvements in our <strong>metagenomics</strong> <a href="http://www.era7bioinformatics.com/en/metagenomics_mg7.html">MG7 method</a>. After doing some research looking for existing solutions, I came to the conclusion that I should implement my own, I couldn’t find any applicable implementation that was thought for more than just <strong>two</strong> nodes.</p>

<p>Ok, but let’s get into detail and see my algorithm:</p>

<p>We start from a set of nodes with an arbitrary length -<em>4 in this sample</em>, which are spread through the taxonomy tree:</p>

<p><img src="http://bio4j.com/images/LCAfirstStep.png" /></p>

<p>We fetch then the first node from the set and calculate its whole ancestor list to the main root of the taxonomy.</p>

<p><img src="http://bio4j.com/images/LCAsecondStep.png" /></p>

<p>Now that we have the list, we take the second node of the set and check if it’s contained in it, if not, we keep going up through its ancestors until we find a hit. Once the hit has been found, we get rid of the previous elements in the list (if any) so that they are not taken into account for the next iterations in the algorithm.</p>

<p><img src="http://bio4j.com/images/LCAthirdStep.png" /></p>

<p>We keep going trough our node set, and C also removes some elements of the list…</p>

<p><img src="http://bio4j.com/images/LCAfourthStep.png" /></p>

<p>Finally we reach the last node of our set, but no element is removed from our list as a result.</p>

<p><img src="http://bio4j.com/images/LCAfifthStep.png" /></p>

<p>The last thing we have to do is simply get the first element of the resulting list and there we have our lowest common ancestor!</p>

<p><img src="http://bio4j.com/images/LCAsixthStep.png" /></p>

<p>This algorithm is encapsulated in the class <a href="https://github.com/bio4j/Bio4jTools/blob/develop/src/com/era7/bioinfo/bio4j/tools/algo/TaxonomyAlgo.java"><strong>TaxonomyAlgo</strong></a>, specifically in the static method <code>lowestCommonAncestor()</code> that expects a list of <strong>NCBITaxonNode</strong> as parameter and returns their LCA node.</p>

<p>You can also check the class <a href="https://github.com/bio4j/Bio4jTools/blob/develop/src/com/era7/bioinfo/bio4j/tools/taxonomy/LowestCommonAncestorTest.java"><strong>LowestCommonAncestorTest</strong></a> where a simple test program that makes use of this method is implemented. </p>

<p>This program expects as parameters:</p>

<ol>
  <li>Bio4j DB folder</li>
  <li>An arbitrary number of NCBI taxonomy IDs representing the node set</li>
</ol>

<p>The Scientific name and the NCBI tax ID of the LCA are printed in the console as result.</p>

<p>Enjoy!</p>

<p><a href="http://twitter.com/pablopareja"><strong>@pablopareja</strong></a></p>

<h2 id="comments">comments</h2>

<ul>
  <li>
    <p><strong>Paul Agapow</strong>
Oddly enough, I had to solve this exact problem a few years ago (to see how much of a tree is left after an extinction, for calculating the biodiversity impact) and then just a few weeks ago (but for the unrooted case). Both times I was sure this had to be a solved problem, but there were no obvious solution out there.</p>

    <ul>
      <li><strong>Pablo Pareja</strong>
Hi Paul,
I was also quite surprised there wasn’t any ‘official’/obvious solution for this, specially when I’d say it’s quite a common problem.
Now that you mention it, I think I’ll extend the implementation for the unrooted case as well.
By the way, just out of curiosity, what kind of solution did you come up with in the end?</li>
    </ul>
  </li>
  <li>
    <p><strong>Victor de Jager</strong>
Hi Pablo,
interesting post. I solved a very similar problem a few years ago using an early version of the ETE toolkit. http://ete.cgenomics.org/
It’s a well documented with plenty of examples.</p>

    <ul>
      <li><strong>ppareja</strong>
Hi Victor,
Thanks for the link. Just a quick question, is it open-source?</li>
    </ul>
  </li>
  <li>
    <p><strong>Jaime</strong>
Hi,
You may be interested in this python script based on the ETE library: https://github.com/jhcepas/ncbi_taxonomy
BTW, ETE is free software</p>
  </li>
  <li>
    <p><strong>Miguel</strong>
The LCA problem is closely related to the Range Minimum Query problem in graph theory. Working on metagenomics I had to implement a fast algorithm to search for LCA of an arbitrary number of leafs in a taxonomic tree. Given that the tree is always the same, you can pre-process it for fast searches. I ended up implemented the Sparse table algorithm for RMQ explained here:
[](http://community.topcoder.com/tc?module=Static&amp;d1=tutorials&amp;d2=lowestCommonAncestor)
You say in your post that you couldn’t find any solution out there for more than 2 nodes. The reason is simple: the LCA of N nodes can be decomposed to N-1 times the LCAs of 2 nodes (for example, the LCA of 3 nodes is the LCA of one of them and the LCA of the other 2).</p>

    <ul>
      <li><strong>ppareja</strong>
Hi Miguel,
Thanks for the link ;)
In my case though I didn’t want to do any pre-processing on purpose. Having everything stored as a graph gives you a great advantage both in terms of speed and scalability and I didn’t want to throw that away. On the other hand this sort of algorithm is one that could be applied to other sub-graphs of Bio4j, not only the taxonomy tree, so once you implement it in this way it’d be trivial to adapt it to other such cases.
I know that the problem can be decomposed so that you end up with a set of 2-nodes problems, what I meant however was that I expected to find algorithms for this problem with some sort of specific optimizations when dealing with a big set of nodes, not only two. For example somehow not passing again through nodes already visited, which will happen when you do decomposing the problem in “isolated” pairs of nodes.</li>
    </ul>
  </li>
</ul>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Mining Bio4j data: finding topological patterns in PPI networks]]></title>
    <link href="http://bio4j.com/blog/2012/01/mining-bio4j-data-finding-topological-patterns-in-ppi-networks/"/>
    <updated>2012-01-24T16:42:56+01:00</updated>
    <id>http://bio4j.com/blog/2012/01/mining-bio4j-data-finding-topological-patterns-in-ppi-networks</id>
    <content type="html"><![CDATA[<p>Hi everyone!</p>

<p>After writing <a href="http://blog.bio4j.com/2011/12/using-bio4j-neo4j-graph-algo-component-for-finding-protein-protein-interaction-paths/"><strong>this post</strong></a> on December, I’ve been thinking of doing something similar, yet different, using Neo4j Cypher query language.</p>

<p>That’s where I came up with the idea of looking for <strong>topological patterns</strong> through a large <strong>sub-set of the Protein-Protein interactions network</strong> included in Bio4j; -rather than focusing in a few proteins selected a priori.</p>

<p>I decided to mine the data in order to find <strong>circuits/simple cycles of length 3</strong> where <strong>at least one protein is from Swiss-Prot dataset</strong>:</p>

<p><img src="http://bio4j.com/images/PPICircuit.png" /></p>

<p>I would like to point out that the <strong>direction</strong> here <strong>is important</strong> and these two cycles:</p>

<ul>
  <li><code>A --&gt; B --&gt; C --&gt; A</code></li>
  <li><code>A --&gt; C --&gt; B --&gt; A</code></li>
</ul>

<p>are <strong>not</strong> the same. Ok, so once this has been said, let’s see how the Cypher query looks like:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class=""><span class="line">START d=node:dataset_name_index(dataset_name_index = "Swiss-Prot")
</span><span class="line">MATCH d &lt;-[r:PROTEIN_DATASET]- p, 
</span><span class="line">circuit = (p) -[:PROTEIN_PROTEIN_INTERACTION]-&gt; (p2) -[:PROTEIN_PROTEIN_INTERACTION]-&gt; (p3) -[:PROTEIN_PROTEIN_INTERACTION]-&gt; (p)
</span><span class="line"> return p.accession, p2.accession, p3.accession</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>As you can see it’s really simple and straightforward. In the first two lines we match the proteins from Swiss-Prot dataset for later retrieving the ones which form a 3-length cycle as described before. Once the query has finished, you should be getting something like this:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
</pre></td><td class="code"><pre><code class=""><span class="line">cypher&gt; 
</span><span class="line">==&gt; +---------------------------------------------------------+
</span><span class="line">p.accession | p2.accession | p3.accession | 
</span><span class="line">==&gt; +---------------------------------------------------------+
</span><span class="line">Q08465 P35189 P3421
</span><span class="line">Q08465 P34218 P35189
</span><span class="line">Q8GXA4 Q8L7E5 Q9LE82
</span><span class="line">Q8GXA4 Q9FH18 Q8L7E5
</span><span class="line">....
</span><span class="line">==&gt; +---------------------------------------------------------+
</span><span class="line">==&gt; 6632 rows, 1019211 ms</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>As you can see the query took <strong>about 17 minutes</strong> to be completed <strong>in a 100% fresh DB</strong> -there was no information cached at all yet; with a <a href="http://aws.amazon.com/ec2/instance-types/"><strong>m1.large</strong> AWS machine</a> -this machine has <strong>7.5GB</strong> of <strong>RAM</strong>.</p>

<p>Not bad, right!? </p>

<p>We have to beware of something though, this query returns cycles such as:</p>

<ul>
  <li><code>A --&gt; B --&gt; C --&gt; A</code></li>
  <li><code>B --&gt; C --&gt; A --&gt; B</code></li>
</ul>

<p>as different cycles when they are actually not. That’s why I developed a <a href="https://github.com/bio4j/Bio4jTools/blob/develop/src/com/era7/bioinfo/bio4j/tools/RemoveRepetitionsFromPPICircuits.java"><strong>simple program</strong></a> to remove these repetitions as well as for fetching some statistics information.
After running the program you get two files:</p>

<ol>
  <li><strong>PPICircuitsLength3NoRepeats</strong> file: download it <a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/PPICircuitsBlogPost/PPICircuitsL3SwissProtNoRepeats.txt">here</a></li>
  <li><strong>PPICircuitsProteinsFreq</strong> file: download it <a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/PPICircuitsBlogPost/PPICircuitsL3SwissProtProteinsFreq.txt">here</a>.</li>
</ol>

<p>The <strong>final circuits found</strong> were reduced after performing the filtering to <strong>2226 records</strong>.</p>

<p>Finally, I also created a really simple chart including the absolute frequency of the first 20 proteins with more occurrences in the cycles that were found.</p>

<p><img src="http://bio4j.com/images/proteinsFrequencyChart.png" /></p>

<p>Well, that’s all for now. Have a good day!</p>

<p><a href="http://www.twitter.com/pablopareja"><strong>@pablopareja</strong></a></p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Bio4j release 0.7 is out!]]></title>
    <link href="http://bio4j.com/blog/2012/01/bio4j-release-0-7-is-out/"/>
    <updated>2012-01-11T17:50:52+01:00</updated>
    <id>http://bio4j.com/blog/2012/01/bio4j-release-0-7-is-out</id>
    <content type="html"><![CDATA[<p>Hi!</p>

<p>I’m happy to announce that the version 0.7 of Bio4j has been released. Check out the wide set of new features, tools and improvements:</p>

<h2 id="expasy-enzyme-database-integration">Expasy Enzyme database integration</h2>

<p>From now on you have the whole <a href="http://enzyme.expasy.org"><strong>Enzyme DB</strong></a> included in Bio4j. For that, both a new node type and relationship type have been created: </p>

<ul>
  <li><a href="http://www.bio4j.com/docs/bio4jmodel/apidocs/com/era7/bioinfo/bio4jmodel/nodes/EnzymeNode.html">EnzymeNode</a>
-<a href="http://www.bio4j.com/docs/bio4jmodel/apidocs/com/era7/bioinfo/bio4jmodel/relationships/protein/ProteinEnzymaticActivityRel.html">ProteinEnzymaticActivityRel</a> (this relationship connects a protein and the respective enzyme nodes associated to it)</li>
</ul>

<p>All properties found have been incorporated to the enzyme node including:</p>

<ul>
  <li>ID</li>
  <li>Official name</li>
  <li>Alternate names</li>
  <li>Cofactors</li>
  <li>Comments</li>
  <li>Catalytic activity</li>
  <li>Prosite cross-references</li>
</ul>

<h2 id="node-type-indexing">Node type indexing</h2>

<p>From now on, every node present in the database has a property <em><strong>nodeType</strong></em> including its type which has been indexed. That way you can now access all nodes belonging to an specific type really easily. </p>

<h2 id="availability-in-all-regions">Availability in all Regions</h2>

<p><a href="http://aws.amazon.com"><img class="right" src="http://d36cz9buwru1tt.cloudfront.net/logo_aws.gif" /></a></p>

<p>The AWS region you are based in won’t be a problem for using Bio4j anymore. EBS Snapshots have been created in all regions as well as CloudFormation templates have been updated so that they can now be used regardless the region where you want to create the stack. </p>

<blockquote>
  <p>Only Asia Pacific (Singapore) <code>ap-southeast-1</code> region is not ready due to ongoing issues from AWS side regarding extremely slow S3 object downloading. Hope we can find a work around for this soon!</p>
</blockquote>

<h2 id="new-cloudformation-templates">New CloudFormation templates</h2>

<h3 id="basic-bio4j-instance-updated">Basic Bio4j instance (updated)</h3>

<p>The basic Bio4j instance template has been updated so that now you can use it from all zones. Check out more info about this in the <a href="http://blog.bio4j.com/2011/12/bio4j-aws-cloudformation-your-own-fresh-baked-db-in-less-than-a-minute/"><strong>updated blog post</strong></a></p>

<h3 id="basic-bio4j-rest-server">Basic Bio4j REST server</h3>

<p>A new template has been developed so that you can easily deploy your Neo4j-Bio4j REST server in less than a minute.</p>

<p>This template is available in the following address:</p>

<ul>
  <li><a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/Bio4jBasicRestServerTemplate.txt"><strong>https://s3-eu-west-1.amazonaws.com/bio4j-public/Bio4jBasicRestServerTemplate.txt</strong></a></li>
</ul>

<p>The steps you should follow to create the stack are really simple. Actually, you can follow as a guide <a href="http://blog.ohnosequences.com/2011/12/neo4j-server-and-aws-become-good-friends/"><strong>this blog post</strong></a> about the template I created for deploying a general Neo4j server, <em>only one or two parameters vary</em></p>

<h2 id="bio4j-rest-server">Bio4j REST server</h2>

<p>Once you get your server running thanks to the useful template I just mentioned before, using Neo4j WebAdmin with Bio4j as source you will be able to:</p>

<h3 id="explore-you-database-with-the-data-browser">Explore you database with the Data browser</h3>

<p>Using the data browser tab of the Web administration tool you can explore in real-time the contents of Bio4j!</p>

<p><img src="http://bio4j.com/images/Bio4jDataBrowser-1024x699.png" /></p>

<p>In order to get visualizations like the one shown above, you should make use of <strong>visualization profiles</strong>. There you can specify different styles associated to customizable rules which can be expressed in terms of the node properties. Here’s a screenshot showing how the visualization profile I used for the visualization above looks like:</p>

<p><img src="http://bio4j.com/images/Bio4jDataBrowserVizProfile-1024x752.png" /></p>

<blockquote>
  <p>Just beware of one thing, the behavior of the tool is such that it does not distinguish between highly connected nodes and more isolated ones. Because of this, clicking nodes such as <strong>Trembl</strong> dataset node is not advisable unless you want to see it freeze forever -<em>this node has more than 15 million relationships linking it to proteins</em>.</p>
</blockquote>

<h2 id="run-queries-with-cypher">Run queries with Cypher</h2>

<p>Cypher what?!</p>

<blockquote>
  <p><a href="http://docs.neo4j.org/chunked/milestone/cypher-query-lang.html"><img class="right" src="http://a1.twimg.com/profile_images/195275920/square-logo-no-text-2_normal.png" /></a>
<strong>Cypher **is a **declarative language</strong> which allows for expressive and efficient querying of the graph store without having to write traversers in code. It <strong>focuses on the clarity of expressing what to retrieve</strong> from a graph, <strong>not how to do it</strong>, in contrast to imperative languages like Java, and scripting languages like Gremlin.</p>
</blockquote>

<p>A query to retrieve protein interaction circuits of length 3 with proteins belonging to Swiss-Prot dataset (limited to 5 results) would look like this in Cypher:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class=""><span class="line">START d=node:dataset_name_index(dataset_name_index = "Swiss-Prot")
</span><span class="line"> MATCH d &lt;-[r:PROTEIN_DATASET]- p, 
</span><span class="line"> circuit = (p) -[:PROTEIN_PROTEIN_INTERACTION]-&gt; (p2) -[:PROTEIN_PROTEIN_INTERACTION]-&gt; (p3) -[:PROTEIN_PROTEIN_INTERACTION]-&gt; (p)
</span><span class="line"> return p.accession, p2.accession, p3.accession, p.accession
</span><span class="line"> limit 5</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>If you want to check out more examples of Bio4j + Cypher, check our <a href="https://github.com/bio4j/Bio4j/wiki/Bio4j-cypher-cheat-sheet"><strong>Bio4j cypher cheat sheet</strong></a> that we will be updating from time to time.</p>

<h2 id="querying-bio4j-with-gremlin">Querying Bio4j with Gremlin</h2>

<p>Gremlins? What do they have to do with Bio4j!?</p>

<blockquote>
  <p><a href="https://github.com/tinkerpop/gremlin/wiki"><img class="right" src="https://raw.github.com/tinkerpop/gremlin/master/doc/images/gremlin-standing-small.png" /></a>
<strong>Gremlin is a graph traversal language that can be natively used in various JVM languages</strong> - it currently provides native support for Java, Groovy, and Scala. However, it can express in a few lines of code what it would take many, many lines of code in Java to express.</p>
</blockquote>

<p>Querying proteins associated to the interpro motif with id <code>IPR023306</code> in Bio4j with Gremlin would look like this: (limited to 5 results)</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class=""><span class="line">gremlin&gt; g.idx('interpro_id_index')[['interpro_id_index':'IPR023306']].inE('PROTEIN_INTERPRO').outV.accession[0..4]
</span><span class="line">==&gt; E2GK26
</span><span class="line">==&gt; G3PMS4
</span><span class="line">==&gt; G3Q865
</span><span class="line">==&gt; G3PIL8
</span><span class="line">==&gt; G3NNA4
</span><span class="line">gremlin&gt; </span></code></pre></td></tr></table></div></figure></notextile></div>

<p>If you want to check out more examples of Bio4j + Gremlin, check our <a href="https://github.com/bio4j/Bio4j/wiki/Bio4j-gremlin-cheat-sheet"><strong>Bio4j gremlin cheat sheet</strong></a> that we will be updating from time to time.</p>

<h2 id="bug-fixes">Bug fixes</h2>

<ol>
  <li><strong>Dataset nodes</strong> There was a bug in the importing process which resulted in the creation of a new dataset node everytime a new Uniprot entry was stored. Now everything’s fine!</li>
</ol>

<p>So that’s all for now! Hope you enjoy all this changes and new features I’ve been working on in the last couple of months. As always, feel free to give any feedback you may have, I’m looking forward to it ;)</p>

<p><a href="http://www.twitter.com/pablopareja"><strong>@pablopareja</strong></a></p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Using Bio4j + Neo4j Graph-algo component for finding protein-protein interaction paths]]></title>
    <link href="http://bio4j.com/blog/2011/12/using-bio4j-neo4j-graph-algo-component-for-finding-protein-protein-interaction-paths/"/>
    <updated>2011-12-19T22:35:41+01:00</updated>
    <id>http://bio4j.com/blog/2011/12/using-bio4j-neo4j-graph-algo-component-for-finding-protein-protein-interaction-paths</id>
    <content type="html"><![CDATA[<p>Hi all!</p>

<p>Today I managed to find some time to check out the <a href="http://wiki.neo4j.org/content/Graph-algo"><strong>Graph-algo component</strong></a> from Neo4j and after playing with it plus Bio4j a bit, I have to say it seems pretty cool.
For those who don’t know what I’m talking about, here you have the description you can find in Neo4j wiki:</p>

<blockquote>
  <p>This is a component that offers implementations of common graph algorithms on top of Neo4j. It is mostly focused around finding paths, like finding the shortest path between two nodes, but it also contains a few different centrality measures, like betweenness centrality for nodes.</p>
</blockquote>

<p>The algorithm for finding the <strong>shortest path between two nodes</strong> caught my attention and I started to wonder how could I give it a try applying it to the data included in Bio4j. I realized then that <strong>protein-protein interactions</strong> could be a good candidate so I got down to work and created the utility method:</p>

<ul>
  <li><code>findShortestInteractionPath(ProteinNode proteinSource, ProteinNode proteinTarget, int maxDepth, int maxResultsNumber)</code></li>
</ul>

<p>for getting at most <code>maxResultsNumber</code> paths between <code>proteinSource</code> and <code>proteinTarget</code> with a maximum path depth of <code>maxDepth</code>.
You can check the <a href="https://github.com/bio4j/Bio4jTools/blob/develop/src/com/era7/bioinfo/bio4j/tools/algo/InteractionsPathFinder.java"><strong>source code here</strong> </a></p>

<p>I also did a <strong><a href="https://github.com/bio4j/Bio4jTools/blob/develop/src/com/era7/bioinfo/bio4j/tools/algo/FindInteractionPaths.java">small test program</a></strong> which prints out the paths found between two proteins.</p>

<p>Even though I’ve missed having a wider choice of algorithms, it’s really cool having at least this small set of algorithms already implemented, abstracting you from the low level coding. 
Apart from that, I’ve been thinking how <strong>Bio4j could open a lot of doors for topology/network analysis around all the data it includes</strong>. Such analysis could otherwise be quite hard to perform due to several reasons like the lack of data-integration between different datasources and the inner storage paradigm limiting topology/network analysis among others… </p>

<p><strong>With Bio4j however, you just have to move around the nodes and get the information you’re looking for!</strong> ;)</p>

<p><a href="http://www.twitter.com/pablopareja"><strong>@pablopareja</strong></a></p>

<h2 id="comments">comments</h2>

<ul>
  <li>
    <p><strong>alper yilmaz</strong>
it’s getting more interesting.. :)
that’s what I meant by “data-mining” during our skype conference.. cool..</p>
  </li>
  <li>
    <p><strong>Roji</strong>
I follow neo4j which much itrneest. It is a novel approach, however i think property searches are very important and neo4j is not very good at this.So for example, implementing a complete social website with millions of users would not be feasible with neo4j i think. I am not sure off course.What is also itrneesting is the upcoming of native XML database. They also solve the imdependace mismatch to a certain expend. However their model are trees not graphs, graphs are more general in this sense, but i think more optimizations are possible if you choose trees.</p>

    <ul>
      <li><strong>ppareja</strong>
  Hi Roji,
  Could you provide some reasons why you think property searches are not good with Neo4j?
  Regarding XML databases and other tree-oriented options, they definitely are great for many use cases, however when you have to deal with highly connected data they may not be enough. The case depicted in this blog post is a good example, even just modelling protein-protein interactions would not be possible with a tree – you get plenty of cycles which cannot be expressed with that paradigm…</li>
    </ul>
  </li>
</ul>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Bio4j + AWS CloudFormation = your own fresh baked DB in less than a minute!]]></title>
    <link href="http://bio4j.com/blog/2011/12/bio4j-aws-cloudformation-your-own-fresh-baked-db-in-less-than-a-minute/"/>
    <updated>2011-12-08T17:37:17+01:00</updated>
    <id>http://bio4j.com/blog/2011/12/bio4j-aws-cloudformation-your-own-fresh-baked-db-in-less-than-a-minute</id>
    <content type="html"><![CDATA[<p><strong>UPDATE:</strong> You can now use this template from **all zones but <code>ap-southeast-1</code>! **</p>

<p>Hi!</p>

<p>So this week it was time to finally start using <strong><a href="http://aws.amazon.com/cloudformation/">CloudFormation</a></strong> together with Bio4j. For those not familiar with this <strong><a href="http://aws.amazon.com/">AWS</a></strong> service, quoting from their site: </p>

<blockquote>
  <p>AWS CloudFormation gives developers and systems administrators an easy way to create and manage a collection of related AWS resources, provisioning and updating them in an orderly and predictable fashion.</p>
</blockquote>

<p>This is really useful because thanks to CloudFormation templates, you don’t have to worry about manually launching an instance, create a volume, attach it, do some stuff, and then free the resources… <strong>You can encapsulate all this tasks in a template</strong> reducing all the tasks to <strong>just two</strong>: </p>

<ol>
  <li>**create **the stack</li>
  <li>**delete **the satck whenever you are done with it</li>
</ol>

<p>This template is available in the following address:</p>

<ul>
  <li><a href="https://s3-eu-west-1.amazonaws.com/bio4j-public/Bio4jBasicInstanceTemplate.txt"><strong>Bio4jBasicInstanceTemplate.txt</strong></a></li>
</ul>

<p>So, let’s see how easy it actually is to create your stack. First you should go to the <code>CloudFormation</code> tab in the amazon console and click the button: <code>Create New Stack</code>:</p>

<p><img src="http://bio4j.com/images/CloudFormationCreateStackScreenShot.jpg" /></p>

<p>You will see this new window now where you should choose the option <strong>Provide a template URL’</strong> and paste there the URL I just provided before. You should also give your stack a name filling the field <code>Stack name</code>. Then click <code>Continue</code>.</p>

<p><img src="http://bio4j.com/images/CreateStackSecondStepScreenShot.jpg" /></p>

<p>Ok, now you should be seeing this:</p>

<p><img src="http://bio4j.com/images/CreateStackThirdStepScreenShot1.jpg" /></p>

<p>Provide then your key-pair name, availability zone, and finally enter the type of instance you want to launch.
Once you clicked continue you’ll see a review of all the parameters you entered so far like:</p>

<p><img src="http://bio4j.com/images/CreateStackFourthStepScreenShot1.jpg" /></p>

<p>Check everything is as you wish and click continue.
You should be seeing then something like this:</p>

<p><img src="http://bio4j.com/images/CreateStackFifthStepScreenShot.jpg" /></p>

<p>Now you just have to wait for about 30 seconds until after refreshing the stack state changes to green color and says “CREATE_COMPLETE”. Click on the output tab and you will see the IP address you need to connect with SSH to your new instance.</p>

<p><img src="http://bio4j.com/images/CreateStackSixthStepScreenShot.jpg" /></p>

<p>So now you just have to connect to your instance and you should have your fresh backed Bio4j DB under the folder <code>/mnt/bio4j_volume/bio4jdb</code> ;)</p>

<p>Whenever you are done, just select delete stack in the amazon console and don’t worry about terminating your instance or deleting your volume, they will do it for you!</p>

<blockquote>
  <p>The only thing you have to do is umount the volumes you have attached, it seems that CloudFormation cannot do that for you right now…</p>
</blockquote>

<p><a href="http://www.twitter.com/pablopareja"><strong>@pablopareja</strong></a></p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Cool GO annotation visualizations with Gephi + Bio4j]]></title>
    <link href="http://bio4j.com/blog/2011/11/cool-go-annotation-visualizations-with-gephi-bio4j/"/>
    <updated>2011-11-29T17:46:56+01:00</updated>
    <id>http://bio4j.com/blog/2011/11/cool-go-annotation-visualizations-with-gephi-bio4j</id>
    <content type="html"><![CDATA[<p>Hi everyone!</p>

<p>After a few months without finding the opportunity to play with <a href="http://gephi.org">Gephi</a>, it was already time to dedicate a lab day to this.
I thought that a good feature would be having the equivalent <strong>.gexf file</strong> for the current graph representation available at the tab “GoAnnotation Graph Viz”; so that you could play around with it in Gephi adapting it to your specific needs.
Then I got down to work and this is what happened:</p>

<p>First of all I was really happy to see how there was a new version of Gephi (0.8) as well as a good bunch of new (at least for me… :D) layout algorithms plugins available like Parallel Force Atlas, Circular Layout or Layered Layout. So once I have downloaded and installed everything I started to have some fun with it and get to know how filters work, <em>(I haven’t used these ones before).</em> 
Even though I got stuck a couple of times trying to figure out how to use some of them, I easily solved these small setbacks thanks to the great support found in the <a href="https://forum.gephi.org/">Gephi forums</a>, where they quickly answered my newbie questions, thanks Gephi team!</p>

<p>As a source for the graph I used the <a href="https://s3-eu-west-1.amazonaws.com/pablo-tests/EHECAnnotationVersion2.xml">public EHEC GO annotations</a> we did for the <strong><a href="https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/wiki">E. coli O104:H4 Genome Analysis Crowdsourcing</a></strong> we coordinated last summer and chose the <strong>Molecular Function sub-ontology</strong> for the visualization.</p>

<p>When I first loaded the gexf file in Gephi without applying any kind of filters this is what I got:</p>

<p><img src="http://bio4j.com/images/EHECMolFuncDraft.png" /></p>

<p>As you (maybe) can see, the size of GO term nodes is proportional to the number of proteins they annotate; still it pretty much looks just like a big hair-ball…</p>

<p>Then I applied the following set of filters:</p>

<p><img src="http://bio4j.com/images/EHECMolFuncFilters.jpg" /></p>

<p>in order to get the GO terms with at least 6 protein annotations plus the proteins which are annotating these terms <em>(their neighborhoods)</em>; and this is what it looked like (after applying a <em>Parallel Force Atlas</em> layout algorithm):</p>

<p><img src="http://bio4j.com/images/EHECMolFuncPreliminary.png" /></p>

<p>I decided then to get rid of the protein labels, since they were way too many and not so useful to be seen; for that I used the option: “Hide nodes/edges labels if not in filtered graph”.
After doing this and applying the black background preview setting, the visualization finally looked pretty decent:</p>

<p><img src="http://bio4j.com/images/EHECMolFuncFinal.png" /></p>

<p>Please go <a href="http://bio4j.com/imgs/EHEC_MolecularFunction_SeaDragon/">here</a> to check the version exported with <a href="https://gephi.org/plugins/seadragon/">Sea Dragon plugin</a> where you can zoom and move around!</p>

<p>Well, if you like the result <em>(or you don’t but you want to play with this and get a better viz!)</em>, I just uploaded a new version of <a href="http://gotools.bio4j.com:8080/Bio4jTestServer/Bio4jGoToolsWeb.html">Bio4j GO Tools</a> viewer where you can download the corresponding .gexf file for your GO annotations XML file. 
Just press the button highlighted in the screenshot and enter the URL for your GO annotations XML file:</p>

<p><img src="http://bio4j.com/images/gexfButtonBio4jGOToolsScreenshot.jpg" /></p>

<p>You can use the public EHEC GO annotation results URL I used as a sample for this post: </p>

<ul>
  <li><code>https://s3-eu-west-1.amazonaws.com/pablo-tests/EHECAnnotationVersion2.xml</code></li>
</ul>

<p>So, that’s all for now, please let me know if you play around with this and get some cool visualizations!</p>

<p><a href="https://twitter.com/pablopareja"><strong>@pablopareja</strong></a></p>

<h2 id="comments">comments</h2>

<ul>
  <li>
    <p><strong>Amrit</strong> 
Good to know it. Does it take expression data also. I have expression data with gene name and probe I’d only. Would you mind to suggest whether it work or not for this kind of data. Thank u so much for your help.</p>

    <ul>
      <li><strong>Pablo Pareja</strong> 
Hi Amrit,
There is no restriction for the input data, the only thing is that the tool expects Uniprot accessions as parameters. You would just need to map your gene names to Uniprot accessions using a ID mapping tool such as that available at uniprot website:
http://www.uniprot.org/
(ID mapping tab)
Cheers,
Pablo</li>
    </ul>
  </li>
</ul>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Bio4j Thanksgiving treats!]]></title>
    <link href="http://bio4j.com/blog/2011/11/bio4j-thanksgiving-treats/"/>
    <updated>2011-11-23T17:53:43+01:00</updated>
    <id>http://bio4j.com/blog/2011/11/bio4j-thanksgiving-treats</id>
    <content type="html"><![CDATA[<p>Hi all!</p>

<p>Thanksgiving is almost here and we got just in time a lot of special treats for you:</p>

<h2 id="new-github-organization">New github organization</h2>

<p>All bio4j related repositories are now under the organization <a href="https://github.com/bio4j">bio4j</a> in github. </p>

<h2 id="new-wikis">New wiki(s)</h2>

<p>The old wiki hosted at wiki.bio4j.com has been moved to the corresponding <a href="https://github.com/bio4j/Bio4j/wiki">Bio4j repository wiki</a>. 
More information has been added as well as structuring the previous data. Besides that, new wikis are being created for each bio4j related tools repositories. </p>

<h2 id="ncbi-taxonomy">NCBI taxonomy</h2>

<p>We’re happy to announce the official incorporation of <a href="http://www.ncbi.nlm.nih.gov/Taxonomy/">NCBI taxonomy</a> data into Bio4j DB, as well as an index for retrieving NCBI taxons from gene identifiers (GI); so there’s no need anymore to parse that huge <a href="ftp://ftp.ncbi.nih.gov/pub/taxonomy/">gi_taxid_nucl NCBI table</a> in order to achieve that. There’re no changes made to Uniprot taxonomy but you can now navigate to the equivalent NCBI taxon using the relationship <a href="http://www.bio4j.com/docs/bio4jmodel/apidocs/com/era7/bioinfo/bio4jmodel/relationships/ncbi/NCBITaxonRel.html">NCBITaxonRel</a>.</p>

<h2 id="reactome-terms">Reactome terms</h2>

<p>We’ve included Reactome terms references included in Uniprot files, so from now on you can retrieve both all terms associated to a specific protein as well as all proteins associated to an specific term.</p>

<h2 id="new-ebs-snapshot-for-this-release">New EBS snapshot for this release</h2>

<p>For those using <a href="http://aws.amazon.com">AWS</a> (or willing to…) there’s a new public EBS snapshot containing the last version of Bio4j DB.
The snapshot details are the following:</p>

<ul>
  <li>Snapshot id: <code>snap-aa5cd3c2</code></li>
  <li>Snapshot region: EU West (Ireland)</li>
  <li>Snapshot size: 100 GB</li>
</ul>

<p>Bio4j DB is under the folder <code>bio4jdb</code>.
In order to use it, just create a <a href="http://www.bio4j.com/docs/bio4jmodel/apidocs/com/era7/bioinfo/bio4jmodel/util/Bio4jManager.html">Bio4jManager</a> instance and start navigating the graph!</p>

<h2 id="up-2011-bio4j-presentation">UP 2011 Bio4j presentation</h2>

<p><a href="http://up-con.com/"><img src="http://d34uahzum2tefy.cloudfront.net/UP2011_1_LOGOv2.png" title="UP Cloud Computing Conference 2011" /></a></p>

<p>We’re really pleased to announce our presence in this year’s UP 2011 Cloud Computing Conference. The presentation will be held on day 4 Thursday, December 8 2011. Check the agenda for the conference <a href="http://up-con.com/agenda">here</a>.</p>

<p>Enjoy!</p>

<p>and happy Thanksgiving!  ;)</p>

<p><a href="http://twitter.com/pablopareja"><strong>@pablopareja</strong></a></p>

]]></content>
  </entry>
  
</feed>