<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Dx dy on Medium]]></title>
        <description><![CDATA[Stories by Dx dy on Medium]]></description>
        <link>https://medium.com/@dxdy?source=rss-22d505a81939------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*dmbNkD5D-u45r44go_cf0g.png</url>
            <title>Stories by Dx dy on Medium</title>
            <link>https://medium.com/@dxdy?source=rss-22d505a81939------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sun, 24 May 2026 04:26:18 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@dxdy/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Utilizing dependency trees and NER models for relation extraction task]]></title>
            <link>https://medium.com/@dxdy/utilizing-dependency-trees-and-ner-models-for-relation-extraction-task-effa5463cb8?source=rss-22d505a81939------2</link>
            <guid isPermaLink="false">https://medium.com/p/effa5463cb8</guid>
            <category><![CDATA[nlp]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[python]]></category>
            <category><![CDATA[spacy]]></category>
            <dc:creator><![CDATA[Dx dy]]></dc:creator>
            <pubDate>Fri, 26 Mar 2021 13:03:33 GMT</pubDate>
            <atom:updated>2021-03-26T13:03:33.296Z</atom:updated>
            <content:encoded><![CDATA[<p>Machine learning solutions rarely consist of a single model in particular areas. Especially NLP. You do a lot of preprocessing from linguistics perspective when you prepare your data for the model. This article focuses on syntax and semantic analysis of natural language, utilizing both modern approaches and some “old school” technology for relation extraction task.</p><p>For the sake of simplicity — imagine we are building conversational model/chat bot/intelligent assistant etc. It is not so hard to create one, but it is definitely hard to make one, that stands out. Say you need to develop a module, where user enters text/uses voice (using voice is entirely another story, that may improve “level of cool” from user perspective and make NLP engineer’s life much harder :)). Usually the pipeline consists of intent recognition module for understanding what does the user want, NER-component for recognizing entities and their properties, and dialogue policy for managing the whole conversation. Lets focus on the problem of the NER-component, which may not seem a problem at all, but is very interesting to solve when you get into the details. Consider it “lemon pasta problem”.</p><p><em>“I want an orange juice and lemon pasta”</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*7dfNe_ohFU33-ZNWjJAioA.jpeg" /><figcaption>Pasta alla limone. Created using Dx dy</figcaption></figure><p>Lets import everything we need and create a stub for NER-model.</p><p>Main concept of NER— is keeping separate target entities (products) and tokens or phrases which bring additional description of the entity, like taste, color, material etc. This can be easily extended in the same way for requests, where user enters parameter names and parameter values.</p><pre>import spacy<br>from spacy import displacy<br></pre><pre>nlp = spacy.load(&quot;en_core_web_lg&quot;)</pre><pre>doc = nlp(&quot;I want an orange juice and lemon pasta&quot;)<br>tags = [&quot;O&quot;, &quot;O&quot;, &quot;O&quot;, &quot;U-product-description&quot;, &quot;U-product&quot;, &quot;O&quot;, &quot;U-product-description&quot;, &quot;U-product&quot;]</pre><pre>for i, (w, t) in enumerate(zip(doc, tags)):<br>    print(f&quot;{w}\t\t{t}&quot;)</pre><p>Assuming we have more or less trained model for NER task we expect the following output <strong>for this particular case </strong>(BILOU-tagging scheme is preferable here. BILOU stands for Begin-Inside-Last-Outside-Unit. If the word does not seem familiar — check different NER tagging schemes)<strong>.</strong></p><pre>I		O<br>want		O<br>an		O<br>orange		U-product-description<br>juice		U-product<br>and		O<br>lemon		U-product-description<br>pasta		U-product</pre><p>Question is, how does the ML part understand that juice needs to be orange and pasta needs to be lemon and not vice versa? I dare say, some might not notice any difference between lemon and orange pasta if they haven’t tried one, but one can certainly tell the difference between lemon and orange juice.</p><h3>Dependency trees</h3><p>Using dependency tree seems the most straightforward idea. They can be built using spaCy or NLTK. Consider reading <a href="https://spacy.io/usage/linguistic-features#dependency-parse">spaCy documentation</a> and/or <a href="https://en.wikipedia.org/wiki/Parse_tree">Wiki</a> if the term does not ring a bell. Visualization below was created with spaCy. Nouns (“juice” and “pasta” in this case) are often connected to adjectives, adverbs, and other nouns that describe them in a certain way, etc. While those very adjectives, adverbs, etc have “head-connection” to nouns. Dependency tree below shows, that words “orange” and “lemon” have “head connection” (counter arrow-wise direction) to nouns “juice” and “pasta”.</p><pre>doc = nlp(&quot;I want an orange juice and lemon pasta&quot;)<br>options = {&quot;bg&quot;: &quot;white&quot;, &quot;distance&quot;: 130,<br>           &quot;color&quot;: &quot;black&quot;, &quot;font&quot;: &quot;Source Sans Pro&quot;}</pre><pre>displacy.render(doc, style=&quot;dep&quot;, options=options)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*alTexO2U0ESxFflBnLd8RA.png" /><figcaption>Dependency Tree. You can see dependencies between words just below the arrows</figcaption></figure><p>Bearing this in mind, we have to traverse certain nodes of tree, to find pairs of products(dishes) and phrases or words that describe them, most accurately. There are 2 options. You either start from the “noun-node” like “juice” or “pasta” and do BFS of all their children nodes or you can start from the opposite side — and just go to parent node. In my opinion — second approach is more simple to implement, ergo — less space for error.</p><p>This is utility function for finding entities in sentence and start/end indices (token-level, not string).</p><pre>def get_entity_indices(document, tags):<br>    result = []<br>    <br>    start_index = -1<br>    end_index = -1<br>    <br>    for i, (w, t) in enumerate(zip(doc, tags)):<br>        if &quot;U-product&quot; in t or &quot;B-product&quot; in t:<br>            start_index = i<br>        if &quot;U-product&quot; in t or &quot;L-product&quot; in t:<br>            end_index = i<br>        <br>        if start_index &gt; 0 and end_index &gt; 0:<br>            result.append({<br>                &quot;value&quot;: &quot; &quot;.join([doc[j].text for j in range(start_index, end_index + 1)]),<br>                &quot;entity_type&quot;: t[2:],<br>                &quot;start&quot;: start_index,<br>                &quot;end&quot;: end_index,<br>            })<br>            <br>            start_index = -1<br>            end_index = -1<br>            <br>    return result</pre><p>Here comes the tree traversal part. Key idea is — just go to head node, until you find, what you are looking for or reach the root of the tree.</p><pre>def find_head_relation_index(token, tags):<br>    i = token.i<br>    found = False<br>    <br>    while not found:<br>        if tags[token.i] in [&quot;B-product&quot;, &quot;I-product&quot;, &quot;L-product&quot;, &quot;U-product&quot;]:<br>            found = True<br>            <br>            return token.i<br>        <br>        token = token.head<br>        <br>        # Means the root of the tree has been reached<br>        if token == token.head:<br>            break</pre><p>Now comes the final function that wraps it all. Since ML does not guarantee 100% accuracy — you have to either leave space for error or handle them. Ensemble models rely on each other very much, so we have to keep in mind that we may operate on data with errors from previous model as well as incorrect output from the current model.</p><p>We will iterate through tokens and tags simultaneously and if the tag is <em>*-product-description</em> we will try to “wire it up” to <em>*-product</em> via “head transition”. However, we need to keep all the entities found by NER, if tree traversal does not yield any result for particular entity, that’s why there is a workaround with <em>entity_indices</em>.</p><pre>def process_dependency_tree(document, tags):<br>    result = []<br>    <br>    # Will be further used for &quot;not found&quot; entities aka &quot;appendix&quot;<br>    entity_indices = get_entity_indices(document, tags)<br>    <br>    def extract_entity(index):<br>        value = None<br>        deletion_ix = -1<br>        <br>        for i, entity in enumerate(entity_indices):<br>            if entity[&quot;start&quot;] &lt;= index &lt;= entity[&quot;end&quot;]:<br>                value = entity[&quot;value&quot;]<br>                deletion_ix = i<br>                break<br>        <br>        ### Once entity is succesfully found - it will be removed from the &quot;appendix&quot;<br>        ### Appendix contains entities which cannot be bound/were tagged by NER-model incorrectly<br>        ### This way information losses will be minimized<br>        if deletion_ix &gt; 0:<br>            entity_indices.pop(deletion_ix)<br>                <br>        return value            <br>    <br>    for i, (w, t) in enumerate(zip(doc, tags)):<br>        # It is easier to traverse from &quot;the end&quot; of the &quot;description&quot;<br>        if t == &quot;U-product-description&quot; or t == &quot;L-product-description&quot;:<br>            product_index = find_head_relation_index(token=w, tags=tags)<br>            <br>            product = extract_entity(product_index)<br>            description = extract_entity(i)<br>            <br>            result.append({<br>                &quot;product&quot;: product,<br>                &quot;description&quot;: description<br>            })<br>    <br>    return {<br>        &quot;successfully_processed&quot;: result,<br>        &quot;appendix&quot;: entity_indices<br>    }</pre><p>Lets try processing dependency tree on both correct and incorrect NER-tagging results and focus on errors and how to deal with them.</p><p><strong>Case 1. Everything is correct.</strong></p><pre>print(process_dependency_tree(doc, tags))</pre><p>Yields</p><pre>{&#39;succesfuly_processed&#39;: [{&#39;product&#39;: &#39;juice&#39;, &#39;description&#39;: &#39;orange&#39;},<br>  {&#39;product&#39;: &#39;pasta&#39;, &#39;description&#39;: &#39;lemon&#39;}],<br> &#39;appendix&#39;: []}</pre><p>We have managed to establish dependencies correctly.</p><p><strong>Case 2. NER error</strong></p><p>Lets intentionally bring an error into tagging results, and see how the “appendix-trick” handles it.</p><pre>tags_with_error = [&quot;O&quot;, &quot;O&quot;, &quot;O&quot;, &quot;U-product-description&quot;, &quot;U-product&quot;, &quot;O&quot;, &quot;B-product&quot;, &quot;L-product&quot;]</pre><p>We have merged 2 last tokens into one entity. Rather common type of mistake for NER-model.</p><pre>I		O<br>want		O<br>an		O<br>orange		U-product-description<br>juice		U-product<br>and		O<br>lemon		B-product<br>pasta		L-product</pre><p>Nevertheless, the entity hasn’t been dropped out.</p><pre>process_dependency_tree(doc, tags_with_error)</pre><pre>{&#39;succesfuly_processed&#39;: [{&#39;product&#39;: &#39;juice&#39;, &#39;description&#39;: &#39;orange&#39;}],<br> &#39;appendix&#39;: [{&#39;value&#39;: &#39;lemon pasta&#39;,<br>   &#39;entity_type&#39;: &#39;product&#39;,<br>   &#39;start&#39;: 6,<br>   &#39;end&#39;: 7}]}</pre><p><strong>Case 3. More complex dependency tree</strong></p><p>Alas, head traversal does not handle all the cases. This is omitted for the sake of simplicity. In this case you need to check, if there is a path of specific dependency relations between entity and its description. Try entering examples of your own and examining dependency tree. Now lets mention a couple of alternatives which can be used as “plan-B” as well as stand-alone solutions.</p><h3><strong>Alternatives</strong></h3><p><strong>Constituency tree</strong></p><p>Try operating on noun/verb-phrase level for information extraction</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*VBhuME8jBpqYra4yNYZ2Ig.png" /><figcaption>Constituency tree built using simple regexp grammar and NLTK. Consider using tools like Stanford NLP for more serious tasks.</figcaption></figure><p><strong>Transformers attention layer</strong></p><p>Extract relations from last <strong>attention layers</strong> of transformer models like BERT. Using tools for visualization of attention layers will be of great help. Take your time, selecting appropriate model/layer. Outputs can be somewhat unpredictable from human perspective, though:)</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/554/1*5FGEULSkitSkWeVmoTjrow.png" /><figcaption>BERT base uncased visualization created using BertViz</figcaption></figure><h3><strong>Wrapping it all up</strong></h3><p>Relation extraction might be not so beginner friendly, especially when your journey into the world of NLP has just begun. Here is a couple of links which may be of help</p><ul><li>Star of the show — <a href="https://en.wikipedia.org/wiki/Parse_tree">parse trees</a></li><li><a href="https://en.wikipedia.org/wiki/Dependency_grammar">Dependency grammar</a></li><li><a href="https://spacy.io/usage/visualizers">spaCy visualization</a></li><li><a href="https://github.com/jessevig/bertviz">BertViz</a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=effa5463cb8" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>