What’s cooking with GDS

Priya Jacob
11 min readApr 5, 2021

“Data Science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains.” -Wikipedia

Yes, but you probably knew that already. How about Graph Data Science? Did you know we could do all of that on connected data that we call “A Graph”? Graph Data Science is meant to exploit relationships and network structure in data to answer complex Qs about system dynamics and power predictions. And, Neo4j has a slick GDS library to do just that!

I thought we’d take off from where Mark Needham last left on the What’s cooking series, to satisfy my passion with Graphs and the Culinary Arts! I reckoned it’d also be a fun way to work with data that’s easy to interpret and relate to, hungry anyone? Let’s get started!

We’re using the same BBC Good Food data feed that was used from before by Lju & Mark. We’ve only added Recipe Cuisines, Courses & Skill Levels to the mix, or the original schema that was.

Now, unstructured data in the real world often comes with its set of challenges and I thought we’d do ourselves a favor by disambiguating Ingredients to start with. For this, we leverage some basic utility functions from the Neo4j APOC library to fuzzy-match Ingredients that could be potential duplicates.

CALL apoc.periodic.iterate
("MATCH (i:Ingredient) RETURN i",
"MATCH (j:Ingredient)
WHERE i <> j
WITH i, j,
apoc.text.sorensenDiceSimilarity(apoc.text.clean(i.name), apoc.text.clean(j.name)) AS sorensen, apoc.text.jaroWinklerDistance(apoc.text.clean(i.name), apoc.text.clean(j.name)) AS jaro, apoc.text.levenshteinSimilarity(apoc.text.clean(i.name), apoc.text.clean(j.name)) AS levenshtein, apoc.coll.avg([apoc.text.sorensenDiceSimilarity(apoc.text.clean(i.name), apoc.text.clean(j.name)),apoc.text.jaroWinklerDistance(apoc.text.clean(i.name), apoc.text.clean(j.name)), apoc.text.levenshteinSimilarity(apoc.text.clean(i.name), apoc.text.clean(j.name))]) AS avg
WHERE avg >= 0.75
MERGE (i)-[r:FUZZY_MATCH]-(j)
SET r.score = avg",
{batchSize: 10, parallel: false})

So let’s see what that got us, looks ok overall (a sample set below from 900+ Ingredients for which a match was found), but there could be a few false positives (!) in there, which for now do not skew what we plan on analyzing.

Now, we could do the same for Keywords, but that’s for another day. For now, we proceed.

I thought about some of the analysis we could perform on the data and here’s what I came up with;

  • create recipes that are similar based on their ingredients, keywords/tags, cuisine, collections, courses, skill level & diet type, run community detection on similar recipes to form clusters of recipes, make inferences
  • create associations between ingredients based on #recipes shared amongst them, run community detection on ingredient associations to form clusters of ingredients, make inferences
  • create associations between cuisines based on #ingredients shared amongst them, run community detection on cuisine associations to form clusters of cuisines, make inferences
  • from recipe similarity, run page rank or degree centrality to find popular recipes, check how do they differ from the recipe rating data
  • find how two ingredients are connected using shortest path, find how two cuisines are connected using shortest path, find how two recipes are connected using shortest path, find how two chefs are connected using shortest path
  • find similar chefs based on their recipe collections, recipe courses, recipe cuisines, run community detection on similar chefs to form clusters of chefs
  • are there possible missing links between, say ingredients? that could potentially find their way in a recipe?

Exciting enough? Let’s find out!

We’ll be employing the latest & greatest GDS library for our analysis. It pays to be on the latest because of the many improvements & additions/graduations being planned with every release.

We start with finding similar Recipes using more than just Ingredients! Exploiting Keywords/Tags, Cuisines and Collections, Courses/Diet Types & Skill Levels to enrich the underlying data set and employ what essentially becomes a bipartite graph for similarity compute. We use a native graph projection on which we then run the classic NodeSimilarity algorithm and then do the same alternatively using a Cypher graph projection on which we run the quintessential Jaccard similarity algorithm.

//create a native projection bipartite graphCALL gds.graph.create
('recipe_bipartite',['Recipe','Ingredient','Keyword','Cuisine','Collection','DietType','Course','SkillLevel'],
'*')
//using NodeSimilarity GDS algorithm in stream mode for one recipe in questionCALL gds.nodeSimilarity.stream('recipe_bipartite',
{degreeCutoff: 20,
similarityCutoff: 0.25,
topK:100})
YIELD node1, node2, similarity
WHERE node1 <> node2 AND gds.util.asNode(node1).id = '5312021'
WITH gds.util.asNode(node1).id AS recipeId1, gds.util.asNode(node1).name AS recipe1, gds.util.asNode(node2).name AS recipe2, gds.util.asNode(node2).id AS recipeId2, similarity
ORDER BY similarity DESC
RETURN recipeId1, recipe1, COLLECT(recipe2) AS similarRecipes, COUNT(recipe2) AS noOfSimilarRecipes
//ORDER BY size(similarRecipes) DESC
//LIMIT 100
//using classic Jaccard Similarity GDS algorithm and comparing resultsMATCH (r1:Recipe {id:'5312021'})
WITH r1,
[(r1)-[:INGREDIENT]->(i1:Ingredient)|ID(i1)]+[(r1)-[:INGREDIENT]->(:Ingredient)-[:FUZZY_MATCH]-(i1:Ingredient)|ID(i1)]+
[(r1)-[:INGREDIENT]->(i1:Ingredient)|ID(i1)]+[(r1)-[:INGREDIENT]->(:Ingredient)-[:FUZZY_MATCH]-(i1:Ingredient)|ID(i1)]+
[(r1)-[:INGREDIENT]->(i1:Ingredient)|ID(i1)]+[(r1)-[:INGREDIENT]->(:Ingredient)-[:FUZZY_MATCH]-(i1:Ingredient)|ID(i1)]+
[(r1)-[:INGREDIENT]->(i1:Ingredient)|ID(i1)]+[(r1)-[:INGREDIENT]->(:Ingredient)-[:FUZZY_MATCH]-(i1:Ingredient)|ID(i1)]+
[(r1)-[:KEYWORD]->(k1:Keyword)|ID(k1)]+
[(r1)-[:KEYWORD]->(k1:Keyword)|ID(k1)]+
[(r1)-[:KEYWORD]->(k1:Keyword)|ID(k1)]+
[(r1)-[:CUISINE]->(c1:Cuisine)|ID(c1)]+
[(r1)-[:CUISINE]->(c1:Cuisine)|ID(c1)]+
[(r1)-[:COLLECTION]->(n1:Collection)|ID(n1)]+
[(r1)-[:COLLECTION]->(n1:Collection)|ID(n1)]+
[(r1)-[:DIET_TYPE]->(d1:DietType)|ID(d1)]+
[(r1)-[:COURSE]->(o1:Course)|ID(o1)]+
[(r1)-[:SKILL_LEVEL]->(l1:SkillLevel)|ID(l1)]
AS list1
MATCH (r2:Recipe)
WHERE r1 <> r2
WITH r1, list1, r2,
[(r2)-[:INGREDIENT]->(i2:Ingredient)|ID(i2)]+[(r2)-[:INGREDIENT]->(:Ingredient)-[:FUZZY_MATCH]-(i2:Ingredient)|ID(i2)]+
[(r2)-[:INGREDIENT]->(i2:Ingredient)|ID(i2)]+[(r2)-[:INGREDIENT]->(:Ingredient)-[:FUZZY_MATCH]-(i2:Ingredient)|ID(i2)]+
[(r2)-[:INGREDIENT]->(i2:Ingredient)|ID(i2)]+[(r2)-[:INGREDIENT]->(:Ingredient)-[:FUZZY_MATCH]-(i2:Ingredient)|ID(i2)]+
[(r2)-[:INGREDIENT]->(i2:Ingredient)|ID(i2)]+[(r2)-[:INGREDIENT]->(:Ingredient)-[:FUZZY_MATCH]-(i2:Ingredient)|ID(i2)]+
[(r2)-[:KEYWORD]->(k2:Keyword)|ID(k2)]+
[(r2)-[:KEYWORD]->(k2:Keyword)|ID(k2)]+
[(r2)-[:KEYWORD]->(k2:Keyword)|ID(k2)]+
[(r2)-[:CUISINE]->(c2:Cuisine)|ID(c2)]+
[(r2)-[:CUISINE]->(c2:Cuisine)|ID(c2)]+
[(r2)-[:COLLECTION]->(n2:Collection)|ID(n2)]+
[(r2)-[:COLLECTION]->(n2:Collection)|ID(n2)]+
[(r2)-[:DIET_TYPE]->(d2:DietType)|ID(d2)]+
[(r2)-[:COURSE]->(o2:Course)|ID(o2)]+
[(r2)-[:SKILL_LEVEL]->(l2:SkillLevel)|ID(l2)] AS list2
WHERE apoc.node.degree.out(r2) >= 20 //degreeCutOff
WITH r1.id AS recipeId1, r1.name AS recipe1, r2.name AS recipe2, gds.alpha.similarity.jaccard(list1, list2) AS similarity
WHERE similarity >= 0.25 //similarityCutOff
WITH recipeId1, recipe1, recipe2, similarity
ORDER BY similarity DESC
LIMIT 100 //topK
RETURN recipeId1, recipe1, COLLECT(recipe2) AS similarRecipes, COUNT(recipe2) AS noOfSimilarRecipes

Quite a few things done differently here

  • Jaccard algorithm is still in the alpha tier at this time of writing with a streaming Function alone, which means we’ll need to do most of the work ourselves, the tweaking of parameters and writing to the Graph
  • We do have greater flexibility with our data projection by leveraging Cypher
  • Since the Jaccard algorithm is essentially a comparison of two lists of numbers, we can boost the importance of a certain element by amplifying it in the list. Here, we’ve particularly boosted the Ingredient, Keyword, Cuisine/Collection elements, in that order of priority, between Recipes to determine similarity. Turns out we have a larger set while maintaining the parameters of degreeCutOff, similarityCutOff and topK.

Next is to run Louvain community detection for both to see what Recipe communities we’re looking at. And… to see which one did our pickled cucumbers fall into? We’ll do this with our NodeSimilarity run from before (and, bonus points to you for doing the same with Jaccard!).

//running NodeSimilarity in mutate mode first, notice the SIMILAR relationship and similarity score property being specifiedCALL gds.nodeSimilarity.mutate('recipe_bipartite', 
{degreeCutoff: 20,
similarityCutoff: 0.25,
topK:100,
mutateRelationshipType: 'SIMILAR',
mutateProperty: 'score'})
YIELD nodesCompared, relationshipsWritten

//running Louvain community detection algorithm over the in-memory graph, notice how we filter the appropriate Node and Relationship (to turn it into a monopartite graph) over which we are to run community detection
CALL gds.louvain.stream('recipe_bipartite',
{nodeLabels: ['Recipe'],
relationshipTypes: ['SIMILAR'],
relationshipWeightProperty: 'score',
maxLevels:10, maxIterations:10, includeIntermediateCommunities:false})
YIELD nodeId, communityId, intermediateCommunityIds
WITH communityId, intermediateCommunityIds, COUNT(nodeId) AS cnt, COLLECT(gds.util.asNode(nodeId).name) AS recipes
WHERE 'dill pickled cucumbers' IN recipes
RETURN communityId, intermediateCommunityIds, cnt, recipes
//RETURN COUNT(DISTINCT communityId)//RETURN communityId, intermediateCommunityIds, COUNT(nodeId) AS cnt, COLLECT(gds.util.asNode(nodeId).name) AS recipes
//ORDER BY cnt DESC LIMIT 100
//WITH communityId, COUNT(nodeId) AS size
//RETURN size, COUNT(communityId)
//ORDER BY size DESC

Turns out our ‘dill pickled cucumbers’ found their way into a community of salads, dressings, dips and relishes!

A quick check on the stats generated by Louvain show community convergence after 3 iterations, it didn’t even make it to the default 10.

//checking statsCALL gds.louvain.stats('recipe_bipartite',
{nodeLabels: ['Recipe'],
relationshipTypes: ['SIMILAR'],
relationshipWeightProperty: 'score',
maxLevels:10, maxIterations:10, includeIntermediateCommunities:false})
YIELD communityCount, ranLevels

Some more fun with the communities, a little Christmassy down here;

Of Quiches, Tarts and Tartlets…

Right, hungry enough? Pun intended. Well, it did make me want to abandon my analysis and go bake a pie. But there’s so much more to do before we settle for/with comfort food, so let’s move on!

Let’s now find associated Ingredients from the Recipes they’re used most with and run community detection to see if they all come together! We’ll make this quick & dirty, so an in-memory graph Cypher projection followed by Louvain community detection.

//quick data profilingMATCH (n:Recipe)
WITH COUNT(n) AS totalNoOfRecipes
MATCH (i:Ingredient)<-[:INGREDIENT]-(r:Recipe)-[:INGREDIENT]->(j:Ingredient)
WITH i.name AS ingredient1, j.name AS ingredient2, COUNT(r) AS noOfRecipes, totalNoOfRecipes, apoc.coll.sort(apoc.coll.toSet(COLLECT(r.name)))[..10] AS sampleRecipes
WHERE noOfRecipes > 1 AND toFloat(noOfRecipes)/toFloat(totalNoOfRecipes) > 0.0075
RETURN DISTINCT apoc.coll.sort([ingredient1,ingredient2]) AS ingredients, noOfRecipes, sampleRecipes
ORDER BY noOfRecipes DESC
LIMIT 50
//creating in-memory Cypher projectionCALL gds.graph.create.cypher('ingredient_association',
'MATCH (i:Ingredient) RETURN ID(i) AS id',
'MATCH (n:Recipe)
WITH COUNT(n) AS totalNoOfRecipes
MATCH (i:Ingredient)<-[:INGREDIENT]-(r:Recipe)-[:INGREDIENT]->(j:Ingredient)
WITH i, j, COUNT(r) AS noOfRecipes, totalNoOfRecipes
WHERE noOfRecipes > 1 AND toFloat(noOfRecipes)/toFloat(totalNoOfRecipes) > 0.0075
RETURN ID(i) AS source, ID(j) AS target, noOfRecipes AS weight')
//running Louvain community detection over the projected in-memory graphCALL gds.louvain.stream('ingredient_association',
{relationshipWeightProperty:'weight'})
YIELD nodeId, communityId, intermediateCommunityIds
WITH communityId, intermediateCommunityIds, COUNT(nodeId) AS cnt, COLLECT(gds.util.asNode(nodeId).name) AS recipes
WHERE cnt > 1
RETURN communityId, cnt, recipes
ORDER BY cnt DESC
LIMIT 100

…and what do we see?! Can you match the columns (ok, that’s visuals with the rows below but wasn’t matching columns fun back in school exams)?

Let’s see if we can get this working with Cuisines now.

//quick data profilingMATCH (c1:Cuisine)<--(:Recipe)-->(i:Ingredient)<--(:Recipe)-->(c2:Cuisine)
WHERE c1 <> c2
WITH c1.name AS cuisine1, c2.name AS cuisine2, i, COUNT(*) AS noOfRecipes
ORDER BY noOfRecipes DESC, i.name
WITH cuisine1, cuisine2, COUNT(i) AS noOfIngredients, COLLECT(i.name)[..100] AS ingredients
WHERE noOfIngredients >= 100
RETURN DISTINCT apoc.coll.sort([cuisine1,cuisine2]) AS cuisines, noOfIngredients, ingredients
ORDER BY noOfIngredients DESC

Well, some are obvious;

…and others? Let’s find out!

//creating an in-memory Cypher projection of associated Cuisines based on the ingredients they share in commonCALL gds.graph.create.cypher('cuisine_association',
'MATCH (c:Cuisine) RETURN ID(c) AS id',
'MATCH (c1:Cuisine)<--(:Recipe)-->(i:Ingredient)<--(:Recipe)-->(c2:Cuisine)
WHERE c1 <> c2
WITH c1, c2, COUNT(DISTINCT i) AS noOfIngredients
WHERE noOfIngredients >= 100
RETURN ID(c1) AS source, ID(c2) AS target, noOfIngredients AS weight')
//running Louvain community detection over the in-memory Cypher projectionCALL gds.louvain.stream('cuisine_association',{relationshipWeightProperty:'weight'})
YIELD nodeId, communityId, intermediateCommunityIds
WITH communityId, intermediateCommunityIds, COUNT(nodeId) AS cnt, COLLECT(gds.util.asNode(nodeId).name) AS cuisines
WHERE cnt > 1
RETURN communityId, cnt, cuisines
ORDER BY cnt DESC
LIMIT 10

…and well? Too symmetrical for one but notwithstanding, the world is the west and the east by far!

Coming back to our Recipes, we may want to find those most popular in our collection. We’ll use a simple PageRank algorithm to find them. Once again, quick & dirty analysis from our existing Recipe similarity monopartite graph;

//using PageRank over in-memory native projectionCALL gds.pageRank.stream('recipe_bipartite',
{nodeLabels: ['Recipe'],
relationshipTypes: ['SIMILAR'],
relationshipWeightProperty: 'score'})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS recipe, score
ORDER BY score DESC LIMIT 10

And the one that ranks highest isn’t always the one that’s popular with the accolades/ratings as we see. I’m sure you can now gather how that was true about the popular teen back in school. Or that colleague at work. Or… well…, you get it (!).

Next, we try finding what’s the shortest path between two Ingredients. Now, that isn’t solving for the what-do-we-cook-with-these-problem, but we could see how two Ingredients may probably be connected. We’ll employ Yen’s K shortest paths because we’re looking for more than just one path!

//using Yen’s K shortest path algorithm
MATCH (source:Ingredient {name: 'sichuan peppercorns'}), (target:Ingredient {name: 'kaffir lime leaves'})
CALL gds.beta.shortestPath.yens.stream(
{nodeProjection:'*',
relationshipProjection:{
INGREDIENT : {
type: 'INGREDIENT',
orientation: 'UNDIRECTED'},
CUISINE : {
type: 'CUISINE',
orientation: 'UNDIRECTED'},
FUZZY_MATCH : {
type: 'FUZZY_MATCH',
orientation: 'UNDIRECTED'}},
sourceNode: id(source),
targetNode: id(target),
k: 3,
path: true})
YIELD index, sourceNode, targetNode, totalCost, nodeIds, costs, path
RETURN
//index,
//gds.util.asNode(sourceNode).name AS sourceNodeName,
//gds.util.asNode(targetNode).name AS targetNodeName,
//totalCost,
[nodeId IN nodeIds | gds.util.asNode(nodeId).name] AS nodeNames,
//costs,
path
//ORDER BY index

How about how two Chefs may be connected?

MATCH (source:Author {name: 'gordon ramsay'}), (target:Author {name: 'sarah cook'})
CALL gds.beta.shortestPath.yens.stream(
{nodeProjection:'*',
relationshipProjection:{
INGREDIENT : {
type: 'INGREDIENT',
orientation: 'UNDIRECTED'},
CREATED : {
type: 'CREATED',
orientation: 'UNDIRECTED'},
CUISINE : {
type: 'CUISINE',
orientation: 'UNDIRECTED'},
COLLECTION : {
type: 'COLLECTION',
orientation: 'UNDIRECTED'},
COURSE : {
type: 'COURSE',
orientation: 'UNDIRECTED'},
FUZZY_MATCH : {
type: 'FUZZY_MATCH',
orientation: 'UNDIRECTED'}},
sourceNode: id(source),
targetNode: id(target),
k: 3,
path: true})
YIELD index, sourceNode, targetNode, totalCost, nodeIds, costs, path
RETURN
//index,
//gds.util.asNode(sourceNode).name AS sourceNodeName,
//gds.util.asNode(targetNode).name AS targetNodeName,
//totalCost,
[nodeId IN nodeIds | gds.util.asNode(nodeId).name] AS nodeNames,
//costs,
path
//ORDER BY index

Lastly, are there possible missing links between say, ingredients, that could find their way in a recipe? Let’s find out! For this, we’ll need to persist our Ingredient association monopartite in-memory graph since the Link Prediction algorithms are Functions in the alpha tier at the time of writing this and have no ability to operate on an in-memory projection.

MATCH (n:Recipe)
WITH COUNT(n) AS totalNoOfRecipes
MATCH (i:Ingredient)<-[:INGREDIENT]-(r:Recipe)-[:INGREDIENT]->(j:Ingredient)
WHERE i <> j
WITH i, j, COUNT(DISTINCT r) AS noOfRecipes, totalNoOfRecipes
WHERE noOfRecipes>1 AND toFloat(noOfRecipes)/toFloat(totalNoOfRecipes) > 0.0075
WITH i, j, noOfRecipes
MERGE (i)-[r:INGREDIENT_ASSOCIATED_WITH]-(j)
RETURN COUNT(r)
//running Link Prediction algorithms
MATCH (n1:Ingredient{name:'butter'})
MATCH (n2:Ingredient{name:'sugar'})
RETURN
gds.alpha.linkprediction.commonNeighbors(n1, n2, {relationshipQuery:'INGREDIENT_ASSOCIATED_WITH'}) AS cn_score,
gds.alpha.linkprediction.preferentialAttachment(n1, n2, {relationshipQuery:'INGREDIENT_ASSOCIATED_WITH'}) AS pa_score,
gds.alpha.linkprediction.adamicAdar(n1, n2, {relationshipQuery:'INGREDIENT_ASSOCIATED_WITH'}) AS aa_score,
gds.alpha.linkprediction.resourceAllocation(n1, n2, {relationshipQuery:'INGREDIENT_ASSOCIATED_WITH'}) AS ra_score,
gds.alpha.linkprediction.totalNeighbors(n1, n2, {relationshipQuery:'INGREDIENT_ASSOCIATED_WITH'}) AS tn_score

…and that was a no-brainer, in the name of testing

However, chilli — ice cream, anyone? Turns out, that has been attempted.

Well, I hope you had as much fun as I did with all the food and analysis. Now, off to baking that pie while I leave you with a treasure trove of resources to get onboard!

Resources

Introduction to Neo4j GDS Offering

https://neo4j.com/product/graph-data-science-library/

Learn from the Neo4j Graph Academy

https://neo4j.com/graphacademy/training-iga-40/enrollment/

Check out the latest & greatest Neo4j GDS Library

https://neo4j.com/docs/graph-data-science/current/

Spin up a Neo4j GDS sandbox

https://medium.com/neo4j/hands-on-with-the-neo4j-graph-data-science-sandbox-7b780be5a44f

Download your free copy of Neo4j GDS for Dummies

https://neo4j.com/books/graph-data-science-for-dummies/

Learn from noted bloggers

https://towardsdatascience.com/how-to-get-started-with-the-new-graph-data-science-library-of-neo4j-3c8fff6107b

https://towardsdatascience.com/part-2-exploring-pathfinding-graph-algorithms-e194ffb7f569

https://towardsdatascience.com/community-detection-of-the-countries-of-the-world-with-neo4j-graph-data-science-4d3a022f8399

https://towardsdatascience.com/node-embeddings-node2vec-with-neo4j-5152d3472d8e

If you’re as passionate about Spark, try Spark with Neo4j GDS!

https://neo4j.com/developer/spark/gds/

https://neo4j.com/graph-algorithms-book/

Learn the latest Neo4j GDS library offerings

https://markhneedham.com/blog/2021/02/03/neo4j-gdsl-hits-algorithm/

https://markhneedham.com/blog/2021/02/08/neo4j-gdsl-overlapping-community-detection-sllpa/

Go further on in your quest with ML; (Neo4j is the only Graph Database with a native Graph supervised-ML offering!)

https://neo4j.com/blog/new-supervised-machine-learning-workflows-in-neo4j/

--

--