Graphing the Floodlight Open Project data

Frederico Braga
Neo4j Developer Blog
4 min readDec 10, 2018

A friend of mine mentioned the Roche Floodlight Open Project to me this weekend and the fact that Roche publicly shares the data it collects during this study.

You can read about the Floodlight Project here:

In Roche’s own words: “Through Floodlight Open we hope to create a more holistic view of MS. By engaging with Floodlight on a regular basis, you generate data for yourself, and join thousands of others to build a unique open access dataset.”

That’s cool.

What’s interesting about Floodlight is that it shows us new possibilities of conducting clinical research. The subjects are enrolled online, fully electronically, the tests are conducted using a mobile phone and according to Roche, its results are pretty interesting and show how incredibly accurate the mobile apps developed by its scientists can be. It is the most extreme representation of the term ‘Virtual Trial’ I have seen to date.

I had a morning flight and decided to download the data from their website and start playing with it in Neo4j during my flight while I was offline.

The data is very simple and has ID’s for both the subject and the test which I leveraged to merge and build relationships. I also decided to create a separate label for the Country so we can leverage relationships for our queries

Here’s the code after downloading the csv file to my ‘import’ folder in the Neo4j database instance:

load csv with headers from ‘file:///floodlight_complete_dataset.csv’ as a
with
a.floodlightOpenId as floodlightOpenId,
a.participantIsControl as participantIsControl,
a.testResultMetricCreatedOn as testResultMetricCreatedOn,
a.participantSex as participantSex,
a.participantHeightCms as participantHeightCms,
a.participantBirthYear as participantBirthYear,
a.participantWeightLbs as participantWeightLbs,
a.participantCreatedOn as participantCreatedOn,
a.participantCountryOfResidence as participantCountryOfResidence,
a.testResultMetricTimestamp1 as testResultMetricTimestamp1,
a.testResultMetricTimestamp2 as testResultMetricTimestamp2,
a.testMetricName as testMetricName,
a.testResultMetricId as testResultMetricId,
a.testName as testName,
a.testResultMetricValue as testResultMetricValue
merge (p:participant {id:floodlightOpenId}) set
p.participantIsControl=participantIsControl,
p.participantSex=participantSex,
p.participantHeightCms=participantHeightCms,
p.participantBirthYear=participantBirthYear,
p.participantWeightLbs=participantWeightLbs,
p.participantCreatedOn=participantCreatedOn
merge (c:country {name:participantCountryOfResidence})
merge (p)-[:resides_in]->(c)
merge (t:test {id:testResultMetricId }) SET
t.testResultMetricTimestamp1=testResultMetricTimestamp1,
t.testResultMetricTimestamp2=testResultMetricTimestamp2,
t.testMetricName=testMetricName,
t.testResultMetricValue=testResultMetricValue,
t.testResultMetricCreatedOn=testResultMetricCreatedOn
merge (p)-[:conducted]->(t)

I was curious to learn which tests did the subjects which used the app more frequently where taking:

match (t:test)-[cd:conducted]-(p:participant)
with p, cd, t, t.testMetricName as testname, p.id as participant_id, count(p) as test, t.testResultMetricValue as result
return testname, sum(test) as num_of_tests order by num_of_tests desc

OK, looks like “Life Space Daily”, “Hand Used” and “Mood Response” are the winners.

Then I looked at who is taking these tests the most:

match (t:test)-[cd:conducted]-(p:participant)
with p, cd, t, t.testMetricName as testname, p.id as participant_id, count(p) as test, t.testResultMetricValue as result
return testname, p.id as participant_id, count(cd) as tests order by tests desc

Next, I check who was doing the top 50 tests, which tests and the connection to the Country:

I also wanted to see if there is a relationship between the number of tests conducted by a subject and the results (score) achieved. I decided to look at a test called ‘Successful Pinches’ and look at the 50 subjects that had conducted this test more often:

match (t:test)-[cd:conducted]-(p:participant)-[b]-(c:country)
where t.testMetricName=’Successful Pinches’ with
p.id as participant_id, collect(distinct t.id) as id_test, count(cd) as num_of_tests, t.testResultMetricValue as result order by num_of_tests desc limit 50
return distinct participant_id, avg(num_of_tests) as avg_tests, avg(toint(result)) as avg_result order by avg_tests desc

It does not look like this relationship can be established for this test.

I would have loved to play a bit more with the data but the flight was short :) Roche says they are preparing to share more via API’s. In this era of clinical development, I am super excited about the possibilities created by open collaboration. Congrats to Roche for this really great initiative, I look forward to playing more with the data in the future as it gets enriched and extended.

--

--