Very cool. You should look into Hive for indexing your non-relational data. It uses MapReduce algorithms to index your data in a way that can be queried in a familiar and quick way with typical (mostly) SQL. You can plop Apache Solr in front of it and now you have a kick ass search engine.

Did something similar in a project for a large health care provider working with petabytes of unstructured medical data. We had medical observation reports and lab reports going back to 1980s line printer feeds. In the end I could not only query quickly on data within the documents, I could run analytics using other tools on result sets to answer many interesting questions in an anonymized way that was HIPAA compliant.

One such interesting correlation we found was that dildo and sex toy related injuries skyrocketed about the time the Internet really took off. Answering the medical questions of our time.