Text Extraction From URL by Scala

haimei
play-hard-work-hard
3 min readMay 13, 2016

In this post, we talk about how to extract text from URL. Please note, we not involve special pages (e.g. Facebook posts, Facebook comments, etc) into this talk. But in another post, I will write a solution for Facebook posts extracting.

We know there are many types for a url, like text/*, application/xml, application/xhtml+xml, application/pdf, image, etc. In this post, we only support the types which we list.

There are three parts for the code snippet, one is for text/*, application/xml, application/xhtml+xml, see Case 1, the other is for application/pdf, see Case 2. The other is for image, like png, jpg, etc. see Case 3.

Case 1:

import org.jsoup.Jsoup
val doc = Jsoup.connect(<your_url>).get()
getTextByDoc(doc)

According to Jsoup, we get doc, but within the doc, there are many useless elements, like footer, header, etc. In fact, we don’t need them, we just want to obtain pure meaningful content. So here we do some filters. Please note, we all know we can’t filter all, because we don’t know which part is useful, which is not. What we can do is to try all our best to remove common known useless parts.

import org.jsoup.nodes.Document
private def getTextByDoc(doc: Document): String = {
doc.head().remove()
doc.getElementsByTag("header").remove()
doc.getElementsByTag("footer").remove()
doc.getElementsByTag("form").remove()
doc.getElementsByTag("table").remove()
doc.getElementsByTag("meta").remove()
doc.getElementsByTag("img").remove()
doc.getElementsByTag("a").remove()
doc.getElementsByTag("br").remove()
doc.getElementsByClass("tags").remove()
doc.getElementsByClass("copyright").remove()
doc.getElementsByClass("widget").remove()
doc.select("div[class*=foot").remove()
doc.select("div[class*=tag").remove()
doc.select("div[class*=Loading").remove()
doc.select("div[class*=Widget").remove()
doc.select("div[class*=Head").remove()
doc.select("div[class*=menu").remove()
doc.select("p[class*=link").remove()
val paragraphs = doc.select("p")
val divs = doc.select("div")
paragraphs.text() + divs.text()
}
Case 2:For pdf url, it is a little complex. First we need to get its content type to make sure it is "application/pdf" and then we create a local temporary file and then to extract local pdf to obtain pure text. Finally, we delete this temporary file.import java.io.File
import java.net.URL
val url = new URL(<your_url>)
val conn = url.openConnection()
val contentType = conn.getContentType
contentType match {
case "application/pdf" =>
val fileName = Random.alphanumeric.take(5).mkString + ".pdf"
url #> new File(fileName) !!
val texts = getTextFromPDF(None, None, fileName)
val of = new File(fileName)
of.delete()
texts
case _ => None
}
Here is to extract text from local pdf file. Here because I don't know its start page and end page, I just skip it. By default, it will fetch all.import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.util.PDFTextStripper
private def getTextFromPDF(startPage: Option[Int], endPage: Option[Int], fileName: String): Option[String] = {
try {
val pdf = PDDocument.load(new File(fileName))
val stripper = new PDFTextStripper()
startPage match {
case Some(startInt) => stripper.setStartPage(startInt)
case None =>
}
endPage match {
case Some(endInt) => stripper.setEndPage(endInt)
case None =>
}
Some(stripper.getText(pdf))
} catch {
case e: Throwable => None
}
}
Case 3:For image, it involves into a new technology, named 'OCR' which can help to parse image's content. So we need a java-ocr-api into system.Step1:In build.sbt to add one line to add dependence.libraryDependencies += "com.asprise.ocr" % "java-ocr-api" % "[15,)"Step2:To import library:import com.asprise.ocr.OcrStep3:Here is the code snippet to show how to implement it. Please note: here <your_file> is a File type. If you only have fileName/filePath, you need to use new File(<file_name>) to convert it.try {
// Image
Ocr.setUp()
val ocr = new Ocr
ocr.startEngine("eng", Ocr.SPEED_FASTEST)
val files = List(<your_file>)
val outputString = ocr.recognize(files.toArray, Ocr.RECOGNIZE_TYPE_ALL, Ocr.OUTPUT_FORMAT_PLAINTEXT)
ocr.stopEngine()
Some(outputString)
} catch {
case e: Exception => None // todo: to support multiple file types
}

--

--