Hackyo’s Update — Week 4

Haryo Akbarianto Wibowo
Inspire Crawler
Published in
3 min readMar 13, 2016

Introcution

What’s up? I’m here to give you to talk you about what have I done in this week. Here you go.

What I’ve Done So Far

We’ve entered our 4th week. The sprint has begun so that I’ve started to code the crawler. And I’m also creating tutorial for my friend to code the crawler. Here’s the list what i’ve done:

Code Filterer

There are 3 features that will be implemented on our Inspire Crawler. These are

  • Crawler
  • Database
  • Filterer

The filterer are without doubt the core of this program. In this feature, It will filter the text that has been taken from the website with Crawler. It will return list of quote that will be inputed to the database. The hardest part is that how will we filter it?

There are 2 techniques that I think that can filter them, these are:

  1. Machine Learning
  2. Improved Regular Expression.

Option 1 (Machine Learning) is impossible to do now as we don’t have any training data. So the only option is the Improved Regular Expression. Why Improved? It’s not some normal Regex, but Regex with the help of NLP (Natural Language Processing). Here’s the Improved Regular Expression that I formulated in Java:

/**
* Filter the stirng
*
@return
*/
public List<Quote> quote(String textDariWebsite){
// System.out.println(tagger(textDariWebsite)); //FOR unix
String pattern = “(\”\\{.*\\}\\\\NP .*\\{.*\\}\\\\VP\”|\\{.*\\}\\\\NP .*\\{.*\\}\\\\VP)(\\n(-|~|)(.*\\\\People)*|\\s*(-|~)(.*\\\\People)*)”;
//FOR windows
String pattern2 = “(\”\\{.*\\}\\\\NP .*\\{.*\\}\\\\VP\”|\\{.*\\}\\\\NP .*\\{.*\\}\\\\VP)(\\r\\n(-|~|)(.*\\\\People)*|\\s*(-|~)(.*\\\\People)*)”;
//Filter these out…
Pattern p = Pattern.compile(pattern2);
//System.out.println(tagger(textDariWebsite));
Matcher m = p.matcher(tagger(textDariWebsite));
while(m.find()){
System.out.println(“Found a quote sentence :\n “ + m.group(0));
// System.out.println(“Total Group : “ + m.groupCount());
for(int i = 0 ; i < m.groupCount(); i++){
// System.out.println(“Group “ + i + “ : \n” + m.group(i) );
}
System.out.println(“Quote : “ + m.group(1));
System.out.println(“Author : “ + m.group(4));
System.out.println(“ — Delete These Tagger!! — “);
String quote = m.group(1);
String author = m.group(4); //TODO differ for each…
//TODO MASIH BOROS
quote = quote.replace(“{“, “”);
quote = quote.replace(“}”, “”);
quote = quote.replace(“\\NP”, “”);
quote = quote.replace(“\\VP”, “”);
//TODO BOROS JUGA
author = author.replace(“\\People”,””);
System.out.println(“Clear Quote : “ + quote);
System.out.println(“Clear Author : “ + author);
}
return null;
}

This is the regex part

//FOR unix
String pattern = “(\”\\{.*\\}\\\\NP .*\\{.*\\}\\\\VP\”|\\{.*\\}\\\\NP .*\\{.*\\}\\\\VP)(\\n(-|~|)(.*\\\\People)*|\\s*(-|~)(.*\\\\People)*)”;
//FOR windows
String pattern2 = “(\”\\{.*\\}\\\\NP .*\\{.*\\}\\\\VP\”|\\{.*\\}\\\\NP .*\\{.*\\}\\\\VP)(\\r\\n(-|~|)(.*\\\\People)*|\\s*(-|~)(.*\\\\People)*)”;

This will filter quote in text that has NP (Nominal Phrase) and VP (Verbal Phrase) and also the name of the Quote’s author. In general, I’ve searched some quotes and most of them contain VP and NP. The question is, How to label them where is VP,NP, and People?

This is the task that Alief will do later.First, my class will call his class that will put Tagger, then use it to filter the text. Because Alief hasn’t done his task, then I create this dummy code for testing my Regex.

You can see the code on our github.

Create Tutorial

I’ve created some tutorial how to code the crawler. You can find it in #code channel in slack.

Conclusion

Yeah, this is the only things that I’ve done. I won’t tell you about my contribution in the team. It will be posted later by my Hustler. Thank you.

“Two things are infinite: the universe and human stupidity; and I’m not sure about the universe. “ — Albert Einstein

--

--

Haryo Akbarianto Wibowo
Inspire Crawler

Mad AI Enthusiast. I write mostly about Artificial Intelligence and Self Development. I also love to read Engineering, Psychology and Startup. Love to share!