## Data and cutdown versions

In this tutorial I am importing a list of flight routes obtained from openflights.org into a Neo4j graph database. I am using three files available from this source

• airlines.dat
• airports.dat
• routes.dat

As the names suggest, these contain lists of airlines, airports and routes in CSV format. Every route is linked to an airline and it has a source airport and destination airport. I have placed these files in a data folder and a simple line count wc -l data/* gives us the number of records in each

 6048 airlines.dat
8107 airports.dat
67663 routes.dat


To make testing while developing quicker I have created a cutdown version of the data using a python 2.7 script. I write a function that takes an input CSV file, an output file to write (actually overwrite), the column names and a predicate which is a lambda expression on a row that returns a boolean. If the predicate is True for a given row we keep the row in the cutdown file. Finally, if the column_to_return is passed in, a python set is returned with all the values for this column.

from __future__ import print_function

import csv
import sys

def cutdown(inputfile, outputfile, fieldnames, predicate, column_to_return=None):
with open(inputfile) as input, open(outputfile, 'w') as output:
writer = csv.DictWriter(output, fieldnames=fieldnames)
s = set()
for row in reader:
if predicate(row):
writer.writerow(row)
if column_to_return is not None:
return s


Next I filter all airports and airlines in United Kingdom. Because the cutdown function will return a set with all the ids I keep, I can use these sets in the predicate for the routes. I keep only the routes for these airports and airlines.


def main():

country = 'United Kingdom'

airline_fieldnames = ['id','name','alias','iata',
'icao','callsign','country','active']
airline_pred = lambda row: row['country'] == country and row['active'] == 'Y'
airline_ids = cutdown('data/airlines.dat', 'data/airlines_cutdown.dat',
airline_fieldnames, airline_pred, 'id')

airport_fieldnames = ['id','name','city','country','iata_faa',
'icao','latitude','longitude','altitude',
'timezone','dst', 'tz_timezone']
airport_pred = lambda row: row['country'] == country
airport_ids = cutdown('data/airports.dat', 'data/airports_cutdown.dat',
airport_fieldnames, airport_pred, 'id')

route_fieldnames = ['airline','airline_id','source_airport','source_airport_id',
'destination_airport','destination_airport_id',
'codeshare','stops','equipment']
route_pred = lambda row: (row['airline_id'] in airline_ids and
row['source_airport_id'] in airport_ids and
row['destination_airport_id'] in airport_ids)
cutdown('data/routes.dat', 'data/routes_cutdown.dat', route_fieldnames, route_pred)

return 0

if __name__ == '__main__':
sys.exit(main())


A line count on the cutdown files wc -l data/*_cutdown.dat gives

 40 data/airlines_cutdown.dat
210 data/airports_cutdown.dat
217 data/routes_cutdown.dat


## Neo4j server installation

You can follow the official documentation, or, if you are on Ubuntu you can use this post.

## Populating the graph database

To extract the data from the CSV files and store them in Neo4j I constructed a make_graph.py script. I have used python 2.7 and py2neo version 2.0.4. I will show the bits one by one and you can put them together in the order they are presented. Alternatively, the code for this post can be found in this github repo.

The imports you’ll need are

from __future__ import print_function

import csv
import sys

from math import radians, cos, sin, asin, sqrt
from py2neo import Graph, Node, Relationship


I use py2neo to talk to Neo4j, and also define some functions in order to calculate the distance between two airports and place this information on the route node.

The logic for creating airports and airlines is similar, so I write a common function that will read the sourcefile which is a CSV file with columns given in fieldnames, and then create nodes with label on the given graph. In our case the label will be either ‘Airline’ or ‘Airport’.

def create_nodes(graph, label, sourcefile, fieldnames):
with open(sourcefile) as csvfile:
nodes = {}
for row in reader:
node_properties = {p: row[p] for p in fieldnames}
node = Node(label, **node_properties)
graph.create(node)
nodes[row['id']] = node
return nodes


Notice that the function expects that there is an ‘id’ column, and it will return a dictionary of all the nodes created with their ‘id’ being the key. I keep the nodes, as I will later on need to connect them to routes.

Now creating airline and airport nodes is trivial:

def create_airline_nodes(graph, sourcefile):
airline_fieldnames = ['id','name','alias','iata','icao','callsign','country','active']
return create_nodes(graph, 'Airline', sourcefile, airline_fieldnames)

def create_airport_nodes(graph, sourcefile):
airport_fieldnames = ['id','name','city','country','iata_faa',
'icao','latitude','longitude','altitude',
'timezone','dst', 'tz_timezone']
return create_nodes(graph, 'Airport', sourcefile, airport_fieldnames)


For a given pair of aiports, there might be several routes connecting them, and actually for each route there must be one in the opposite direction. I keep a dictionary of distances to save me from computing them again.

known_distances = {}


Given two (latitude, longitude) pairs I can calculate the distance between the two points on the map like this (code taken from this stackoverflow answer)

def haversine(lat1, lon1, lat2, lon2):
lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])

dlat = lat2 - lat1
dlon = lon2 - lon1

a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))

r = 6367 # km
return r * c


Next, I write a function that takes two airport nodes, checks if their distance is already computed and present in known_distances dictionary. The key of this dictionary is the pair of airport id’s. If the distance is not found, I extract the latitude and longitude information and calculate it. Before returning the result I store the result with both orderings of the airport id’s.

def get_distance(source_airport_node, destination_airport_node):
s_id, d_id = (source_airport_node.properties['id'], destination_airport_node.properties['id'])
distance = known_distances.get((s_id, d_id))
if distance is None:
lat1 = float(source_airport_node.properties['latitude'])
lon1 = float(source_airport_node.properties['longitude'])
lat2 = float(destination_airport_node.properties['latitude'])
lon2 = float(destination_airport_node.properties['longitude'])
distance = haversine(lat1, lon1, lat2, lon2)
known_distances[(s_id, d_id)] = distance
known_distances[(d_id, s_id)] = distance
return distance


Creating a route node, is slightly more complicated than the airline and airport nodes. First, notice that I pass the dictionaries with the airline and airport nodes. I connect to its source airport with a FROM relationship and a TO relationship to its destination airport. If any of the two airport id’s is missing, I ignore the route. I connect a route node to its airline via OF relationship. If the airline id is missing, I still create the route node but not connect it to an airline.

def create_route_nodes(graph, sourcefile, airline_nodes, airport_nodes):
route_fieldnames = ['airline','airline_id','source_airport','source_airport_id',
'destination_airport','destination_airport_id',
'codeshare','stops','equipment']
with open(sourcefile) as csvfile:
for row in reader:
node_properties = {p: row[p] for p in ['codeshare','stops','equipment']}
source_airport_id = row['source_airport_id']
destination_airport_id = row['destination_airport_id']
if source_airport_id.isdigit() and destination_airport_id.isdigit():
source_airport_node = airport_nodes[source_airport_id]
destination_airport_node = airport_nodes[destination_airport_id]
dist = get_distance(source_airport_node, destination_airport_node)
route_node = Node('Route', distance=dist, **node_properties)
graph.create(route_node)
graph.create(Relationship(route_node, 'FROM', source_airport_node))
graph.create(Relationship(route_node, 'TO', destination_airport_node))
airline_id = row['airline_id']
if airline_id.isdigit():
graph.create(Relationship(route_node, 'OF', airline_nodes[airline_id]))


Putting all the pieces together:

def main():

neo4j_uri = 'http://localhost:7474/db/data/'

graph = Graph(neo4j_uri)

print("Creating airline nodes...")
airline_nodes = create_airline_nodes(graph, 'data/airlines.dat')

print("Creating airport nodes...")
airport_nodes = create_airport_nodes(graph, 'data/airports.dat')

print("Creating route nodes...")
create_route_nodes(graph, 'data/routes.dat', airline_nodes, airport_nodes)

return 0

if __name__ == '__main__':
sys.exit(main())


Make sure you change the neo4j_uri if you are running your server on a different machine or if you are not using the default port. Also change the paths to the .dat files if you have placed them somewhere else, or if you just want to use the cutdown versions to make smaller graph.

Note that this might take long time to run as I am inserting nodes one by one. For more efficient ways of importing CSV files via cypher have a look here.

## Querying the graph

I will run Cypher queries in the Neo4j browser that will be available on the address http://localhost:7474/browser/. If the Neo4j server is running on a different machine replace localhost with the hostname or ip address of the machine that is running the Neo4j server.

I want to fly from Athens, Greece to Narita Airport in Tokyo, Japan. Checking for direct flights

MATCH (athens:Airport {name:'Eleftherios Venizelos Intl'}),
(tokyo:Airport {name:'Narita Intl'}),
(athens)<-[:FROM]-(route:Route)-[:TO]->(tokyo),
(route)-[:OF]->(airline)
RETURN athens, route, tokyo, airline


returns zero results. So next I check for flights with a stopover at some airport. I want to fly with the same airline on each of the two legs. The query is the following

MATCH (athens:Airport {name:'Eleftherios Venizelos Intl'}),
(tokyo:Airport {name:'Narita Intl'}),
(athens)<-[:FROM]-(leg1:Route)-[:TO]->(stopover:Airport),
(stopover)<-[:FROM]-(leg2:Route)-[:TO]->(tokyo),
(leg1)-[:OF]->(airline1:Airline),
(leg2)-[:OF]->(airline2:Airline)
WHERE airline1.name = airline2.name
RETURN athens, leg1, stopover, leg2, tokyo, airline1, airline2


The result of this query is

Airports are purple nodes, routes are green, and airlines are red. I have customised the airport and airline nodes to show their names instead of their id. If you click on a node of a specific label, a window appears with its properties.

In this window there is a style tab with an “eye” symbol, this is the second tab. In there you can customise the caption of the specific type of node, its size and its colour.

The start and end of the journey are the airports in the middle. There are pairs of routes that are linked to the two airports and the stopover airport. Stopover airports form a ring around the start and end airports. Each stopover airport is connected to two legs, and the two legs are connected to a single airline. Therefore, the airline that stops over at the specific airport is drawn close to the stopover airport in the graph.

This is just an example query. To learn more about the Cypher query language you can start here.