BI on Graphs

Photo by Lukas Blazek on Unsplash

Business Intelligence Reporting & Dashboarding tools are an integral part of a typical IT Business infrastructure / solutions framework and the ability to meet data where it resides is an attractive proposition, without the unnecessary hassle of custom ETL and cumbersome workarounds in place. The Neo4j BI Connector was built with the same objective; meeting Customers where they are, by democratizing seamless access to Neo4j data alongside other traditional sources of data, much to the relief & pleasure of Graph enthusiasts & BI practitioners. A win-win for Neo4j and popular BI tooling! With support for JDBC alone earlier, the recently (GA) added ODBC functionality support makes for wider acceptance in BI tooling, covering the likes of Power BI, Tableau, Looker, TIBCO Spotfire Server, MicroStrategy and many more.

The Neo4j Connector for BI is available at no extra charge for Neo4j Enterprise Edition Customers. Functionally, the Product:

  • builds on the Java Database Connectivity (JDBC) and Open Database Connectivity (ODBC) standards
  • translates SQL into Neo4j’s native, graph-optimized Cypher query language
  • makes connected data insights accessible in real-time or near real-time as supported by the BI tooling in place
  • is fully supported and ready for Production, Enterprise deployment

Right, with that being a short introduction, this post will focus on leveraging the Neo4j BI ODBC Connector in Power BI. For the purpose of demonstration, I’ve used a sample .pbix based on Opportunity Analysis (This sample is part of a series that shows how you can use Power BI with business-oriented data, reports, and dashboards. It was created by obviEnce with real data, that has been anonymized. The file and associated data is property of obviEnce llc and has been shared solely for the purpose of demonstrating Power BI functionality with industry sample data.). Since native Graph Queries (localized) & Analytics (global span) are typically built to answer Qs you can’t otherwise answer or may have difficulty answering (such as variable path expansions, several complex joins, associations etc.), I’ve resorted to employing a typical Dashboard with Queries that are meant to depict aggregations (i.e. the data warehouse-y kind number crunching operations). I’ve downloaded the data behind the sample .pbix reports and have modeled the same in Neo4j as a Graph. I’ve then used the same data from Neo4j to rebuild the Dashboard, only this time leveraging the Neo4j BI Connector instead. It looks like the round-about way but I didn’t want to spend time designing a Dashboard from scratch (nor talk about the powerful features of Power BI because that’s not what this post is about; the sample Dashboard already speaks for itself) and a legit sample just came handy.

So, we start by doing a Cypher LOAD CSV to load each of the sample .pbix datasets into Neo4j. I’ve spun up a local standalone Neo4j 4.4.3 instance on Windows with APOC 4.4.0.2 (a pre-requisite to using the Neo4j BI Connector).

CREATE CONSTRAINT c_accountID IF NOT EXISTS FOR (n:Account) REQUIRE n.accountID IS UNIQUE;
CREATE CONSTRAINT c_partnerID IF NOT EXISTS FOR (n:Partner) REQUIRE n.partnerID IS UNIQUE;
CREATE CONSTRAINT c_productID IF NOT EXISTS FOR (n:Product) REQUIRE n.productID IS UNIQUE;
CREATE CONSTRAINT c_opportunityID IF NOT EXISTS FOR (n:Opportunity) REQUIRE n.opportunityID IS UNIQUE
:auto USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///account.csv' AS row
WITH row
MERGE (n:Account {accountID: row.`Account ID`})
SET
n.account = row.Account,
n.region = row.Region,
n.segment = row.Segment
:auto USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///partner.csv' AS row
WITH row WHERE row.`Partner Driven` = 'Yes'
MERGE (n:Partner {partnerID: row.`Partner ID`})
SET
n.partner = row.Partner
:auto USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///product.csv' AS row
WITH row
MERGE (n:Product {productID: row.`Product ID`})
SET
n.product = row.`Product Code`
:auto USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///opportunity.csv' AS row
WITH row
MERGE (n:Opportunity {opportunityID: row.`Opportunity ID`})
SET
n.opportunity = row.Name,
n.size = row.`Opportunity Size`
:auto USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///fact.csv' AS row
WITH row
MERGE (n:Opportunity {opportunityID: row.`Opportunity ID`})
SET
n.createdOn = date(apoc.date.convertFormat(row.`Create Date`,'dd-MM-yyyy')),
n.estimatedClosure = date(apoc.date.convertFormat(substring(row.EstimatedCloseDate,0,10),'dd-MM-yyyy')),
n.stage = CASE row.`Sales Stage ID` WHEN '1' THEN 'Lead' WHEN '2' THEN 'Qualify' WHEN '3' THEN 'Solution' WHEN '4' THEN 'Proposal' WHEN '5' THEN 'Finalize' END,
n.opportunityDays = row.`Opportunity Days`,
n.year = row.Year,
n.month = row.Month,
n.monthNumber = apoc.text.lpad(row.`Month_Number`, 2, '0')
WITH *
MATCH (ac:Account {accountID: row.`Account ID`})
MERGE (n)-[:ACCOUNT]->(ac)
WITH *
MATCH (pr:Product {productID: row.`Product ID`})
MERGE (n)-[r:PRODUCT]->(pr)
SET
r.productRevenue = toFloat(apoc.text.regreplace(row.ProductRevenue,'(\$)|(,)','')),
r.factoredProductRevenue = toFloat(apoc.text.regreplace(row.FactoredProductRevenue,'(\$)|(,)',''))
WITH *
MATCH (pa:Partner {partnerID: row.`Partner ID`})
MERGE (n)-[:ALLIANCE]->(pa)
MATCH (o:Opportunity)
SET o.directSales = NOT exists((o)-[:ALLIANCE]->())
RETURN COUNT(o)

Once that’s done, since I’m working with Power BI Desktop, I create a System DSN pointing to my local Neo4j standalone instance. Make sure you install the Simba Neo4j ODBC Driver before that! Also, make sure that you use a connector whose bitness matches the bitness of the client application.

I don’t set any of the ‘Advanced Options’ here. Since my local instance has not been set up with SSL/TLS, I leave the default setting as is (i.e. SSL disabled). I do a ‘Test…’ connection and that succeeds. Moving on, in Power BI, I navigate to the Get Data -> Other -> ODBC menu and select the DSN that was just created. Now, what you got to observe is that the ODBC data source option only works in ‘Import’ mode and not ‘DirectQuery’. And what that means is data is imported into Power BI from Neo4j, for further use. Once the import is complete, data will no longer refresh unless an on-demand/scheduled refresh is performed. The next thing to do is to model relationships by joining the Tables together. I know it’s weird, first I got Tabular relational data modeled as a Graph, then I got to converting it back into siloed Tables and now I’m joining them all over again. But that’s how Nodes and Relationships from Neo4j are represented, Tables for every distinct Node-Label combination and each Relationship Type. Because that is the rationale behind making possible, SQL-like queries, being served by a ‘native’ Graph Database. More on the schema translation in the manual.

With that done, I proceed to rebuild the report visuals from the sample .pbix, with imported data from Neo4j, easy-peasy. (the report visuals are a copyright of obviEnce llc)

Other things tried & tested;

(1) Including / excluding Labels, Relationship Types when it comes to metadata refresh and data sampling, using the ‘Connector Configuration Options’ in the ODBC connection string, in Power BI, as depicted from below. It’s a useful option when you have tens* of Labels & Relationship Types in your underlying Neo4j Database and want to benefit from several connections on a single DSN, each with a purpose of its own. You only choose the Labels and Relationship Types to work with and leave the rest out, it makes for a leaner & cleaner model, saves on resources and overall refresh time.

(2) Connecting to a standalone Neo4j instance and a Neo4j Causal Cluster with SSL/TLS in place. For this, I cleverly employed an AuraDB Free Database connection and an AuraDB Professional Database connection, respectively. You will need to select the ‘Use SSL’ option while creating the System DNS or your connection will fail. Production Environments will have SSL/TLS set up which is to suggest the client will use an SSL encrypted connection to communicate with the Neo4j server. Also to note is that SSL is configured independently of authentication. When authentication and SSL are both enabled, the connector performs the specified authentication method over an SSL connection.

The ‘PEM Certificate File’ (Connector Configuration Option ‘TrustedCerts’) is required when client_auth=REQUIRE is configured on a Neo4j server for encrypted bolt connections. Simply put, the client authorization on the side of the server (Neo4j in this case) requires that the client (BI Connector in this case) has access to the certificate of the server.

Self-Signed Certificates (‘Allow Self-Signed Certificates’ or Connector Configuration Option ‘AllowSelfSignedServerCert’) are disabled by default. When enabled, the connector authenticates the Neo4j server even if the server is using a self-signed certificate.

(3) You can strictly use the Neo4j bolt scheme (Connector Configuration Option ‘StrictlyUseBoltScheme’) to connect to a standalone Neo4j instance or say a Read Replica instance of a Neo4j Causal Cluster. It also works for a Neo4j Causal Cluster, as tested with a AuraDB Professional Database, with its DNS or proxy host entry (e.g. neo4j+s://xxxxxxxx.databases.neo4j.io) in the DSN ‘Host’ configuration option. It is recommended that you use the bolt scheme to connect to a standalone endpoint and the neo4j scheme to connect to a clustered endpoint.

(4) You may sometimes be required to add custom functionality in the form of a derived Table (in the case of Neo4j, say a friend-of-friends query or an association query etc.). Custom queries using a DSN, involve writing SQL like queries, that the Neo4j BI ODBC Connector then translates into equivalent Cypher, for execution in Neo4j. I’d say it’s a bit of a struggle having to know the schema upfront with Neo4j Labels and Relationship Types being converted to Tables. Below are two examples of custom SQL queries written on an adapted version of the famous sample Movie Graph in Neo4j. Well, given a choice, I am not writing SQL queries to do what Cypher can so elegantly do! That said, I hope for Cypher support to come soon!

If you do choose Cypher over SQL, you can instead use the Neo4j HTTP API (that provides a Cypher Transactional HTTP endpoint) to import data into Power BI as explained here.

//actors having worked together the most
MATCH (actor:Person)-[:ACTOR]->(m)<-[:ACTOR]-(coactor:Person)
WHERE ID(actor) < ID(coactor)
WITH actor, coactor, COUNT(m) AS cnt, COLLECT(m.title) AS movies
WHERE cnt > 1
RETURN actor.name, coactor.name, movies, cnt
ORDER BY cnt DESC
LIMIT 5
--SQL equivalent of above Cypher
SELECT TOP 5 a1.name AS actor1, a2.name AS actor2, COUNT(*) AS cnt
FROM Person a1, Person a2, Movie m, Person_ACTOR_Movie am1, Person_ACTOR_Movie am2
WHERE
a1.`_NodeId_` < a2.`_NodeId_` AND
a1.`_NodeId_` = am1.`_SourceId_` AND m.`_NodeId_` = am1.`_TargetId_` AND
a2.`_NodeId_` = am2.`_SourceId_` AND m.`_NodeId_` = am2.`_TargetId_`
GROUP BY a1.name, a2.name
HAVING COUNT(*) > 1
ORDER BY cnt DESC
--SQL query used without the limit clause
SELECT a1.name, a2.name, m.title
FROM
(
SELECT a1.`_NodeId_` AS actor1, a2.`_NodeId_` AS actor2, COUNT(*) AS cnt
FROM Person a1, Person a2, Movie m, Person_ACTOR_Movie am1, Person_ACTOR_Movie am2
WHERE
a1.`_NodeId_` < a2.`_NodeId_` AND
a1.`_NodeId_` = am1.`_SourceId_` AND m.`_NodeId_` = am1.`_TargetId_` AND
a2.`_NodeId_` = am2.`_SourceId_` AND m.`_NodeId_` = am2.`_TargetId_`
GROUP BY a1.`_NodeId_`, a2.`_NodeId_`
HAVING COUNT(*) > 1
) AS ip, Person a1, Person a2, Movie m, Person_ACTOR_Movie am1, Person_ACTOR_Movie am2
WHERE
a1.`_NodeId_` = ip.actor1 AND
a2.`_NodeId_` = ip.actor2 AND
a1.`_NodeId_` = am1.`_SourceId_` AND m.`_NodeId_` = am1.`_TargetId_` AND
a2.`_NodeId_` = am2.`_SourceId_` AND m.`_NodeId_` = am2.`_TargetId_`

(5) Logging can be enabled to help troubleshoot issues. Logging or tracing must be enabled only long enough to capture an issue. Logging or tracing has an impact on overall performance and can consume a large quantity of disk space. There are however options to control the log file size and number. Logging when enabled with level ‘TRACE’ is seen to create a fairly large number of files since ALL log events are then logged. Logging when enabled, is seen to create 1 file per connection. While the connection to Neo4j appears to be single from Power BI, the log files show to be otherwise. It’s best to set logging options (such as ‘LogLevel’, ‘LogPath’) for the connection in question in the connection string than to have DSN or Connector-wide logging enabled. Settings in the connection string take precedence over settings in the DSN, and settings in the DSN take precedence over Connector-wide settings.

So, here are my key takeaways;

  • In a hybrid/polyglot data landscape, where Neo4j lends itself as one of the many disparate data sources that are to be reported upon, offering a consolidated view of data & KPIs using BI tooling such as Power BI and others, the Neo4j BI Connector serves its purpose.
  • The ODBC data source in Power BI currently operates in Import mode only, which means data is either cached locally in Power BI for a Power BI Desktop User, or would reside in the cloud when it comes to publishing the on-premise ODBC data source via the Power BI Gateway to the Power BI Online Service. The data refresh frequency would have to be arrived at from the defaults, as configured. There are other limitations that come with the Power BI Online Service that one needs to be aware of.
  • I haven’t validated whether the Import-only ODBC data source using the Neo4j BI Connector can be combined with say, a DirectQuery data source (live connection) in Power BI. Nor have I combined data from another Import-only data source and modeled the same in Power BI, though I believe that would in most cases, be the practical usage of the Neo4j BI Connector i.e. using Power BI to link data from Neo4j and other supported data sources.
  • For complex queries that are otherwise easily understood and written using Cypher, the Neo4j BI Connector may not serve as the best bet if the reporting is predominantly on Neo4j. The Neo4j BI Connector support for custom Cypher is not known at this point in time. Should the Neo4j BI Connector still be employed in cases where ETL or materialized views are not an option, care must be taken to see only relevant entities (Labels & Relationship Types) from Neo4j be filtered or cached as the case may be, with a view to optimal performance and resource utilization.

lifelong learner

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Minimal Switch in Flutter

Annotating Swift Code

What’s Better? Anaconda or Python Virtualenv

Creating a New Environment in Python

Stateful Model-Based Database Fuzzing

The TAO: A New Framework to Power the Web

Practical Python: Using Code to Find a COVID19 Vaccine Opening at CVS

How to debug code: Best Practices | Tips & Tricks

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Priya Jacob

Priya Jacob

lifelong learner

More from Medium

Visualization of Dynamic Network via a Sampled Massive Sequence View

Computational infrastructure behind Habidatum Location Risk Score

Deploy Bokeh Apps on GCP Cloud Run

Visualizing Gapminder Data Using Bokeh