Text & Semantic Analysis — Machine Learning with Python

SHAMIT BAGCHI
3 min readJan 22, 2017

--

Algorithms from SaaS machine learning platforms such as Aylien, Algorithmia, MonkeyLearn make it easy! Image Source: Aylien

In machine learning, semantic analysis of a corpus (a large and structured set of texts) is the task of building structures that approximate concepts from a large set of documents. It generally does not involve prior understanding of the documents.

I have recently been trying out different APIs for text analytics and semantic analysis using machine learning and I have stuck to coding in Python — to directly go to my code samples here is the Github link: https://github.com/shamitb/text_analytics

Algorithmia — Many text analytics, NLP and entity extraction algorithms are available as part of their cloud based offering. Some algorithms tried out include:

  • Part of speech tagging using OpenNLP: http://opennlp.apache.org/ The Part of Speech Tagger marks tokens with their corresponding word type based on the token itself and the context of the token. A token might have multiple pos tags depending on the token and the context. The OpenNLP POS Tagger uses a probability model to predict the correct pos tag out of the tag set. To limit the possible tags for a token a tag dictionary can be used which increases the tagging and runtime performance of the tagger. Parts are tagged according to the conventions of the Penn Treebank Project (https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). For example, a plural noun is denoted NNS, a singular or mass noun is NN, and a determiner (such as a/an, every, no, the,another, any, some, each, etc.) as DT.
  • Tokenizer: https://algorithmia.com/algorithms/ApacheOpenNLP/TokenizeBySentence
  • Auto tagging of text: Algorithm uses a variant of nlp/LDA to extract tags / keywords — https://algorithmia.com/algorithms/nlp/AutoTag
  • Parsey McParseface : A popular language parser that Google recently open-sourced using Tensorflow. The Parsey McParseface neural model is an incredibly accurate sentence parser and Parts-of-Speech tagger that can be used for computational linguistic problems such as sentiment analysis and comparative opinions. Can be used to build intelligent chatbots.

Aylien — Classification by Taxonomy: https://developer.aylien.com/

Figure: Approaches used include OCR, extraction of entities

Named Entity Recognition — StanfordNLP/NamedEntityRecognition: This algorithm retrives recognized entities from a body of text using the stanfordNlp library. Currently it identifies named noun type entities such as PERSON, LOCATION, ORGANIZATION, MISC and numerical MONEY, NUMBER, DATA, TIME, DURATION, SET types. https://algorithmia.com/algorithms/StanfordNLP/NamedEntityRecognition

Image Source: Aylien

Concept Extraction: Identify an author’s intent with word sense disambiguation; does apple refer to the fruit or the company

Use LDA to Classify Text Documents — LDA is an algorithm that can be used to generate topics to understand a document’s general theme: http://blog.algorithmia.com/lda-algorithm-classify-text-documents/

Image Source: Aylien

MonkeyLearn: Taxonomy Classifier: https://app.monkeylearn.com/main/classifiers/cl_b7qAkDMz/tab/tree-sandbox/

Tesseract OCR in Algorithmia: https://algorithmia.com/algorithms/tesseractocr/OCR

Create PDF using ReportLab PLUS: https://www.reportlab.com/reportlabplus/

Overall Algorithmia and Aylien are powerful! Let me know if you come across better cloud based APIs & offerings on machine learning or semantic and text analytics! Or write to me at: shamit dot bagchi at deu dot kyocera dot com

CODE SAMPLES here — let me know and we could collaborate: https://github.com/shamitb/text_analytics

--

--

SHAMIT BAGCHI

Complexity | #Computing #Science #Music #Art #Creativity | Free spirited views are my own ..