Tree of Life
The Tree of Life (ToL) Web Project is a collection of information about Biodiversity, compiled collaboratively, by hundreds of expert and amateur contributors, with a goal to maintain information concerning every Specie and its Group of Organisms on Earth, living or extinct. Though the Web Project is a collection of web pages complete with text corpus and images, what better way to navigate and browse the hierarchy of Life connections, that follow phylogenetic branching patterns between Groups of Organisms, than to model and query it all in a Graph Database?! I choose Neo4j, the leader in Graph Databases, for the same.
Note: This post assumes you have a basic understanding of Neo4j and its query language Cypher
The ToL Project exposes two web services, a Group ID service that takes a single string query parameter, group, which is the name of the group that you want to locate in the ToL database, and a Tree Structure service, that returns a tree structure complete with all of its descendants for a group in question. The response of the Tree Structure service is contained in a single XML element, the <TREE> element. The <TREE> element contains a single element, <NODE>, which then contains zero or more <NODES> elements. This pattern repeats itself out. While the current tree structure is by no means complete, it does extend the actual leaves of the tree (species, subspecies, populations, or strains) in some branches. This is denoted by a special LEAF attribute on the NODE elements. Now the Graph can be constructed by looking at the Tree Lineage (ANCESTORWITHPAGE <NODE> attribute) or Tree Branches (<NODE><NODES>…) or both.
You first get to downloading the ToL xml feed using;
http://tolweb.org/onlinecontributors/app?service=external&page=xml/TreeStructureService&node_id=1&optype=0
Once that’s done, you load the xml feed into Neo4j using an apoc utility procedure. I’m going to first construct the Graph taking the Tree Lineage into consideration. Since all legit nodes have a NAME attribute value to them, we use an XPATH expression to load these alone.
CALL apoc.periodic.iterate(
'CALL apoc.load.xml("file:///lifeonearth.xml","//NODE[string(NAME)!=\'\']")
YIELD value
WITH value.ID AS ID,
[x IN value._children WHERE x._type="NAME"|x._text][0] AS name,
[x IN value._children WHERE x._type="DESCRIPTION"|x._text][0] AS desc,
value.EXTINCT AS extinct,
value.ANCESTORWITHPAGE AS ancestorID,
apoc.coll.flatten([x IN value._children WHERE x._type = \'OTHERNAMES\'|[y IN x._children WHERE y._type = \'OTHERNAME\' | [z IN y._children WHERE z._type = \'NAME\' | z._text]]][0]) AS synonyms
RETURN trim(ID) AS ID, toLower(trim(name)) AS name, toLower(trim(desc)) AS desc, trim(extinct) AS extinct, trim(ancestorID) AS ancestorID,
[x IN synonyms WHERE x IS NOT NULL | trim(toLower(x))] AS synonyms',
'MERGE (n:Specie {id:ID})
SET n.name = name,
n.desc = desc,
n.extinct = extinct,
n.synonyms = CASE WHEN size(synonyms) > 0 THEN synonyms ELSE NULL END
MERGE (m:Specie {id:ancestorID})
MERGE (n)-[:BELONGS_TO]->(m)',
{batchSize:10000, parallel:true}
)
We label the extinct Species in the ToL.
MATCH (n:Specie)
WHERE n.extinct = '2'
SET n:Extinct
With that, we can start to explore the ToL! Let’s start with examining how deep the Tree is;
MATCH (n:Specie {id: "1"})
CALL apoc.path.expandConfig(n, {
relationshipFilter: "<BELONGS_TO",
uniqueness: "RELATIONSHIP_GLOBAL",
minLevel: 1
})
YIELD path
RETURN length(path) AS level, COUNT(path) AS species
ORDER BY level╒═══════╤═════════╕
│"level"│"species"│
╞═══════╪═════════╡
│1 │5 │
├───────┼─────────┤
│2 │64 │
├───────┼─────────┤
│3 │714 │
├───────┼─────────┤
│4 │821 │
├───────┼─────────┤
│5 │2373 │
├───────┼─────────┤
│6 │961 │
├───────┼─────────┤
│7 │1072 │
├───────┼─────────┤
│8 │1104 │
├───────┼─────────┤
│9 │1157 │
├───────┼─────────┤
│10 │1722 │
├───────┼─────────┤
│11 │2553 │
├───────┼─────────┤
│12 │2610 │
├───────┼─────────┤
│13 │5509 │
├───────┼─────────┤
│14 │5916 │
├───────┼─────────┤
│15 │9712 │
├───────┼─────────┤
│16 │12596 │
├───────┼─────────┤
│17 │11550 │
├───────┼─────────┤
│18 │13621 │
├───────┼─────────┤
│19 │3607 │
├───────┼─────────┤
│20 │1218 │
├───────┼─────────┤
│21 │1078 │
├───────┼─────────┤
│22 │251 │
├───────┼─────────┤
│23 │996 │
├───────┼─────────┤
│24 │1625 │
├───────┼─────────┤
│25 │1451 │
├───────┼─────────┤
│26 │346 │
├───────┼─────────┤
│27 │879 │
├───────┼─────────┤
│28 │176 │
├───────┼─────────┤
│29 │395 │
└───────┴─────────┘
That’s 29 levels deep. Let’s look at what ‘we’, the Modern Humans ~ Homo Sapiens lineage, looks like.
Next, we could validate some of the Specie structures as depicted in the ToL Web Project. I’m going with some ancestors in the ‘Modern Humans’ lineage, and then closing with the one in question.
MATCH (n:Specie {id:’2374'})
RETURN
n.name AS specie,
apoc.text.join([p=(n)-[:BELONGS_TO*]->(m)|[x IN NODES(p)|x.name][-1]],’ >> ‘) AS `containing groups`,
[(n)-[:BELONGS_TO]->(p)<-[:BELONGS_TO]-(s)|s.name] AS others, [(n)<-[:BELONGS_TO]-(m)|m.name] AS subgroups
MATCH (n:Specie {id:’15963'})
RETURN
n.name AS specie,
apoc.text.join([p=(n)-[:BELONGS_TO*]->(m)|[x IN NODES(p)|x.name][-1]],’ >> ‘) AS `containing groups`,
[(n)-[:BELONGS_TO]->(p)<-[:BELONGS_TO]-(s)|s.name] AS others, [(n)<-[:BELONGS_TO]-(m)|m.name] AS subgroups
MATCH (n:Specie {id:’16421'})
RETURN
n.name AS specie,
apoc.text.join([p=(n)-[:BELONGS_TO*]->(m)|[x IN NODES(p)|x.name][-1]],’ >> ‘) AS `containing groups`,
[(n)-[:BELONGS_TO]->(p)<-[:BELONGS_TO]-(s)|s.name] AS others, [(n)<-[:BELONGS_TO]-(m)|m.name] AS subgroups
Observe that ‘Homo Sapiens’ is a leaf node in the ToL and has no descendants or subgroups of its own.
Everything in Biology makes more sense in the light of Phylogeny. Broad knowledge of relationships between Species is fundamental to providing crucial information in the Discovery of Medicines, combatting of Diseases, Crop Improvement, Conservation Efforts (both of Life Species and Biodiversity Hotspots), Discovery of Cryptic Species, Response to Climate Change, Forensics, Ecosystem Services, and Moral Responsibility & Mental Healing etc. Thus, knowledge of relationships matters! Which is why it makes sense to model the Phylogeny in the ToL Graph, to understand how Species have evolved with time, how they ‘re closely related, what traits they may have inherited down the chain etc. More on the Topic here;
For the Phylogenetic Tree, we’re going to load all <NODE> elements and map all of the branches using the <NODES> element.
//create phylogenetic tree
CALL apoc.periodic.iterate(
'CALL apoc.load.xml("file:///lifeonearth.xml","//NODE")
YIELD value
WITH value.ID AS ID,
[x IN value._children WHERE x._type="NAME"|x._text][0] AS name,
[x IN value._children WHERE x._type="DESCRIPTION"|x._text][0] AS desc,
value.EXTINCT AS extinct,
value.ANCESTORWITHPAGE AS ancestorID,
value.CONFIDENCE AS confidence,
value.PHYLESIS AS phylesis,
value.INCOMPLETESUBGROUPS AS incompletesubgroups,
apoc.coll.flatten([x IN value._children WHERE x._type = \'OTHERNAMES\'|[y IN x._children WHERE y._type = \'OTHERNAME\' | [z IN y._children WHERE z._type = \'NAME\' | z._text]]][0]) AS synonyms,
apoc.coll.flatten([x IN value._children WHERE x._type = \'NODES\'|[y IN x._children WHERE y._type = \'NODE\' | y.ID]]) AS children
RETURN trim(ID) AS ID, toLower(trim(name)) AS name, toLower(trim(desc)) AS desc, trim(extinct) AS extinct, trim(ancestorID) AS ancestorID, confidence, phylesis, incompletesubgroups,
[x IN synonyms WHERE x IS NOT NULL | trim(toLower(x))] AS synonyms, children',
'MERGE (n:Specie {id:ID})
SET n.name = name,
n.desc = desc,
n.extinct = extinct,
n.confidence = confidence,
n.phylesis = phylesis,
n.incompletesubgroups = incompletesubgroups,
n.synonyms = CASE WHEN size(synonyms) > 0 THEN synonyms ELSE NULL END,
n.children = CASE WHEN size(children) > 0 THEN children ELSE NULL END',
{batchSize:10000, parallel:true}
)//create branches
CALL apoc.periodic.iterate(
'MATCH (n:Specie)
WHERE EXISTS(n.children)
RETURN ID(n) AS ID',
'MATCH (p:Specie) WHERE ID(p) = ID
WITH p
UNWIND p.children AS x
MATCH (c:Specie {id:x})
MERGE (c)-[:OF_BRANCH]->(p)',
{batchSize:1000}
)//label extinct nodes
MATCH (n:Specie)
WHERE n.extinct = '2'
SET n:Extinct
Next is to compare whether it all checks out by taking an example Branch, say, ‘Eutheria’, from the ToL Web Project. We’ll use Neo4j Bloom for its hierarchical layout feature (that makes it easier to visualize levels and map them), with the following search phrase;
MATCH p=(n:Specie)-[:OF_BRANCH*]->(pb:Specie {id:$specie})
WHERE EXISTS((n)-[:BELONGS_TO]->(:Specie {id:$ancestor}))
RETURN p
We’re now ready to explore Phylogeny! Let’s start by looking at how deep the Phylogenetic Tree is;
MATCH (n:Specie {id: "1"})
CALL apoc.path.expandConfig(n, {
relationshipFilter: "<OF_BRANCH",
uniqueness: "RELATIONSHIP_GLOBAL",
minLevel: 1
})
YIELD path
RETURN length(path) AS level, COUNT(path) AS species
ORDER BY level╒═══════╤═════════╕
│"level"│"species"│
╞═══════╪═════════╡
│1 │4 │
├───────┼─────────┤
│2 │37 │
├───────┼─────────┤
│3 │294 │
├───────┼─────────┤
│4 │289 │
├───────┼─────────┤
│5 │358 │
├───────┼─────────┤
│6 │179 │
├───────┼─────────┤
│7 │111 │
├───────┼─────────┤
│8 │228 │
├───────┼─────────┤
│9 │372 │
├───────┼─────────┤
│10 │307 │
├───────┼─────────┤
│11 │373 │
├───────┼─────────┤
│12 │438 │
├───────┼─────────┤
│13 │608 │
├───────┼─────────┤
│14 │625 │
├───────┼─────────┤
│15 │836 │
├───────┼─────────┤
│16 │538 │
├───────┼─────────┤
│17 │622 │
├───────┼─────────┤
│18 │646 │
├───────┼─────────┤
│19 │1031 │
├───────┼─────────┤
│20 │1200 │
├───────┼─────────┤
│21 │621 │
├───────┼─────────┤
│22 │693 │
├───────┼─────────┤
│23 │719 │
├───────┼─────────┤
│24 │729 │
├───────┼─────────┤
│25 │653 │
├───────┼─────────┤
│26 │1196 │
├───────┼─────────┤
│27 │2582 │
├───────┼─────────┤
│28 │2989 │
├───────┼─────────┤
│29 │2549 │
├───────┼─────────┤
│30 │5397 │
├───────┼─────────┤
│31 │1889 │
├───────┼─────────┤
│32 │732 │
├───────┼─────────┤
│33 │1419 │
├───────┼─────────┤
│34 │1181 │
├───────┼─────────┤
│35 │2595 │
├───────┼─────────┤
│36 │1341 │
├───────┼─────────┤
│37 │1054 │
├───────┼─────────┤
│38 │1779 │
├───────┼─────────┤
│39 │1477 │
├───────┼─────────┤
│40 │1464 │
├───────┼─────────┤
│41 │1696 │
├───────┼─────────┤
│42 │1735 │
├───────┼─────────┤
│43 │1643 │
├───────┼─────────┤
│44 │2047 │
├───────┼─────────┤
│45 │3339 │
├───────┼─────────┤
│46 │4314 │
├───────┼─────────┤
│47 │3145 │
├───────┼─────────┤
│48 │2836 │
├───────┼─────────┤
│49 │1945 │
├───────┼─────────┤
│50 │2918 │
├───────┼─────────┤
│51 │2958 │
├───────┼─────────┤
│52 │2122 │
├───────┼─────────┤
│53 │2135 │
├───────┼─────────┤
│54 │1738 │
├───────┼─────────┤
│55 │1539 │
├───────┼─────────┤
│56 │1419 │
├───────┼─────────┤
│57 │855 │
├───────┼─────────┤
│58 │832 │
├───────┼─────────┤
│59 │934 │
├───────┼─────────┤
│60 │495 │
├───────┼─────────┤
│61 │538 │
├───────┼─────────┤
│62 │271 │
├───────┼─────────┤
│63 │294 │
├───────┼─────────┤
│64 │271 │
├───────┼─────────┤
│65 │242 │
├───────┼─────────┤
│66 │234 │
├───────┼─────────┤
│67 │247 │
├───────┼─────────┤
│68 │287 │
├───────┼─────────┤
│69 │317 │
├───────┼─────────┤
│70 │206 │
├───────┼─────────┤
│71 │260 │
├───────┼─────────┤
│72 │181 │
├───────┼─────────┤
│73 │190 │
├───────┼─────────┤
│74 │204 │
├───────┼─────────┤
│75 │125 │
├───────┼─────────┤
│76 │86 │
├───────┼─────────┤
│77 │86 │
├───────┼─────────┤
│78 │52 │
├───────┼─────────┤
│79 │34 │
├───────┼─────────┤
│80 │19 │
├───────┼─────────┤
│81 │21 │
├───────┼─────────┤
│82 │15 │
├───────┼─────────┤
│83 │11 │
├───────┼─────────┤
│84 │8 │
├───────┼─────────┤
│85 │25 │
├───────┼─────────┤
│86 │34 │
├───────┼─────────┤
│87 │86 │
├───────┼─────────┤
│88 │247 │
├───────┼─────────┤
│89 │477 │
├───────┼─────────┤
│90 │346 │
├───────┼─────────┤
│91 │346 │
├───────┼─────────┤
│92 │420 │
├───────┼─────────┤
│93 │325 │
├───────┼─────────┤
│94 │507 │
├───────┼─────────┤
│95 │516 │
├───────┼─────────┤
│96 │563 │
├───────┼─────────┤
│97 │390 │
├───────┼─────────┤
│98 │448 │
├───────┼─────────┤
│99 │428 │
├───────┼─────────┤
│100 │402 │
├───────┼─────────┤
│101 │429 │
├───────┼─────────┤
│102 │308 │
├───────┼─────────┤
│103 │168 │
├───────┼─────────┤
│104 │141 │
├───────┼─────────┤
│105 │126 │
├───────┼─────────┤
│106 │64 │
├───────┼─────────┤
│107 │89 │
├───────┼─────────┤
│108 │51 │
├───────┼─────────┤
│109 │61 │
├───────┼─────────┤
│110 │20 │
├───────┼─────────┤
│111 │74 │
├───────┼─────────┤
│112 │31 │
├───────┼─────────┤
│113 │47 │
├───────┼─────────┤
│114 │29 │
├───────┼─────────┤
│115 │28 │
├───────┼─────────┤
│116 │30 │
├───────┼─────────┤
│117 │30 │
├───────┼─────────┤
│118 │30 │
├───────┼─────────┤
│119 │16 │
├───────┼─────────┤
│120 │8 │
├───────┼─────────┤
│121 │2 │
├───────┼─────────┤
│122 │4 │
└───────┴─────────┘
That’s 122 levels deep! Imagine figuring that out in a Relational Database. You’d never see the bottom of it (!). Other things to do could be denoting the ‘Phylesis’ (i.e. degree of relationship certainty) of the tree branches (and visualize them in Neo4j Bloom using conditional formatting), studying how many ‘Polytomies’ (dichotomous or polytomous, in terms of number of descendants) exist, how many Labeled branches (group names that are noteworthy, even though they do not carry sufficient documented information of their own) exist, how many extinct branches (extinct taxa w/ extinct descendants) etc.
//fetch extinct species with atleast one extinct descendent
MATCH p=(n:Specie:Extinct)<-[:BELONGS_TO*]-(m)
WHERE ANY(x IN NODES(p) WHERE x <> n AND x:Extinct)
RETURN n.id AS ID, n.name AS extinct_specie, apoc.coll.toSet(apoc.coll.flatten(COLLECT([x IN NODES(p) WHERE x <> n AND x:Extinct |x.name]))) AS extinct_descendants//fetch extinct species with atleast one extinct descendent and surviving descendents
MATCH p=(n:Specie:Extinct)<-[:BELONGS_TO*]-(m)
WHERE ANY(x IN NODES(p) WHERE x <> n AND x:Extinct)
WITH n.id AS ID, n.name AS extinct_specie, apoc.coll.toSet(apoc.coll.flatten(COLLECT([x IN NODES(p) WHERE x <> n AND x:Extinct |x.name]))) AS extinct_descendants, apoc.coll.toSet(apoc.coll.flatten(COLLECT([x IN NODES(p) WHERE x <> n AND NOT x:Extinct |x.name]))) AS non_extinct_descendants
WHERE size(non_extinct_descendants) > 0
RETURN ID, extinct_specie, extinct_descendants, non_extinct_descendants//bloom search phrase
MATCH path=(:Specie {id:$specie})<-[:BELONGS_TO*]-()
RETURN path
Organisms have evolved through the ages from ancestral forms into more derived forms. The notion that all of Life is genetically connected via a vast Phylogenetic Tree is fascinating. Take for example the ancestor of humans and ladybugs, ‘Bilateria’, that at some point evolved into ‘Arthropods’ and ‘Deuterostomia’ (Vertebrates).
//bloom search phrase
MATCH path=shortestPath((:Specie {id:$specie_one})-[:BELONGS_TO*]-(:Specie {id:$specie_two}))
RETURN path
You could traverse the Tree of Life up or down exploring the diversity of organisms while being reminded of the genetic connectedness of all of Life.
More interestingly though, Humans are colonized by multiple microorganisms, with approximately the same order of magnitude of non-human cells as human cells. The Human Microbiome is the aggregate of all Microbiota that reside on or within human tissue and biofluids along with the corresponding anatomical sites in which they reside. Types of Human Microbiota include Bacteria, Archaea, Fungi, Protists and Viruses. Understanding these interactions between organisms in the Tree of Life could provide for more meaningful analysis. I hope to explore the Human Microbiome Project (HMP) and expand on the Tree of Life in a future post.
P.S.
The Encyclopedia of Life (EoL) is a parallel initiative in the online documentation of Life on Earth. Since the EoL’s emphasis is on the development of Specie pages, the ToL continues to strengthen its focus on Phylogenetic information and the documentation of deeper branches in the Tree of Life. What’s worth mentioning is that the EoL Project is powered by Neo4j! If it interests you, explore their data model, study sample queries and search the EoL database.
Another early ToL project leveraging Neo4j in a big way to construct the Taxonomy (from multiple sources such as NCBI, GBIF etc.) and build and serve the ToL Graph from many Phylogenetic Trees is the opentreeoflife project, available for download here. It’s a truly impressive & powerful application of synthesizing multiple published Phylogenetic Trees along with Taxonomic Data.
Thank you to Tom Geudens@Neo4j for introducing me to the ToL website.