Wednesday, September 3, 2014

Getting Started with Titan Graph, Gremlin-Scala, and Play!

Hello Graph


So, I've been trying out different graph databases for my next project. I really like the way that they organize things, it seems to match my thinking quite a bit. I wanted to stay open source so I skipped neo4j, due to its licensing. I first tried OrientDB, which was nice, but then I came across Titan, a distributed graph db. Titan is built on top of your choice of HDFS, you can use BerkelyDB, Cassandra, or HBase. Berkeley is more for small-medium graphs, but it performs much faster because it is local. Whereas, Cassandra and Hbase can be distributed across a cluster and hold massive graphs. For my use case I went with Berkeley, because it's fast and I don't think I'll approach its ceiling with my next app.


The fact the Titan is built on an HDFS, provides it with some interesting OLAP features.Titan comes packed with Faunus, which allows you to run Hadoop operations on the graph. There is also talk of integrating Apache Spark, which would be amazing. This feature is mainly used for doing whole graph analysis. Graph is incredibly fast when traversing out from a single node, but is very slow when traversing the entire graph. However being able to load the graph up into Hadoop, drastically speeds up full graph traversals. Although, it is somewhat expensive. Because, I plan to use my graph database primarily for analytics, Titan is a good fit for my application.

Finally, I need some of my graph data to be available in real time. While most of the graph db's offer Lucene indexing, Titan give you the option of Elasticsearch or Lucene. You can also select to run ElasticSearch remotely or embedded. Running it embedded will provide faster query times, but you cannot access it outside the application. I personally chose Elasticsearch over Lucene, but it is really a matter of preference. I like that Titan offers both options, and you can even index into both of them if you wanted, although that would also get a little expensive.

I like Scala

I have become a huge fan of Scala over the last year. After having coded in Python, PHP, and Java; I find that Scala is clean, fast, and efficient. I love that it mixes right in with Java on the JVM, giving you Java's extensive library, and the JVM's performance. I also really enjoy Play Framework. It organizes my projects well, speeds up my process, and performs under pressure. Pretty much every app I build now it on Play, and I have had no regrets to date. 

Scala works well with Titan, because it's written in Java. Moreover, Titan is built with Tinkerpop at its core. And with Tinkerpop at its core, I can use Gremlin-Scala, which is awesome. It allows a nice clean integration of Titan into Play. Now, if I can just get Apache Spark in the mix, all will be well with the world.

Getting Started

As fond as I am of Titan now, I did have some trouble getting it started. The docs are decent, and up to date, but everything is written in Groovy, so it can be a bit difficult to translate. Also, I like that Tinkerpop is built into Titan, but at times its difficult to tell where Titan ends and Tinkerpop begins. Once your on your feet, it gets a lot easier. So, hopefully I can get you there. All the code I'm sharing is also a complete play project on my github, if you just want a working version.

First add your Titan dependencies to Play's build.sbt file.


Then, you need a basic object for calling up and instance of your graph.
It is important that you index your desired properties before adding to the graph.


Once you have indexed your properties you can begin adding to the graph.



Now, you can use Gremlin-Scala to query it.

Now you should be set to go with your Titan graph, just scroll through the Gremlin-Scala and Titan docs for further info. If you have trouble getting it going, feel free to clone my github project

1 comment:

  1. Can you tell instead of using the BaseConfiguration which you are using how to point to embedded cassandra?

    ReplyDelete