Smart Investigation: Uncovering Hidden Links in Unstructured Data with IBM Watson and i2

Imagine, you are able to find someone in a crowd; you have never seen him before, you do not know his real name; only that the one is probably there. Probably there but hiding. Hiding behind fictitious identity, fake name. All this can be uncovered with the solution described below.

Devil lies in the detail. Tap into new depths and possibilities with new approaches discovering valuable information that is hidden in unstructured data. Even with a state of the art in advanced link analysis or data visualization tools at one’s disposal, the “full” picture lacks important information if it is based only on portion of data available, like finding hidden entities in paragraphs of electronic conversation or identifying fictitious entities with of IBM Watson and IBM i2 technologies.

An amount of data for Law Enforcement investigation rises every day. It is almost overwhelming, especially data in unstructured form. The unique DATERA’s approach combines structured and unstructured data for a better understanding and thorough insight into the data that organizations already possess.

Even with a good knowledge of i2 products family, there are still ways how to enrich investigation data for a better, and more accurate result. Related to i2 capabilities, the visualization of entities and links hidden in unstructured data, is especially challenging. For an unstructured data analysis, an IBM Watson technology is used to detect and to uncover entities and their relations in plain text or speech transformed to text. In addition, fake identities and fictitious entities are uncovered using an IBM Identity Insight. One of components included in the IBM Identity Insight is an IBM InfoSphere Global Name Management which helps manage, search, analyze, and compare multicultural name datasets by leveraging culture-specific name data, and linguistic rules.

Thanks to that, it is possible to discover and detects also non-obvious relationships between entities and identities and make connections that are otherwise hidden in the noise. This method furthermore enriches the current intelligence discoveries with highlighting details that investigator might not notice before.

The whole concept is based on a real LEA customer. Due to a sensitivity of the customer’s data a concrete use case was modelled on a well-known company called Enron. As an input, we employed these publicly accessible Enron related data:

  • structured data and datasets,
  • internal emails,
  • twitter communication

and applied these steps:

  1. IBM Watson unstructured and semi-structured data analysis.
    Provides entity detection and indexing of all unstructured content. This step also transforms outputs into structured form.
  2. Transforming data and loading data into repository.
    Provides ETL processes between the outputs and centralized investigation repositories.
  3. Visualizing data from central repositories using IBM i2.
    Enable user to see hidden patterns through advanced analytics.

Related to i2 centralized repository, this transform simple tables and raw data into real-life model of entities and connections. Prerequisite for this is careful design of entity model and mapping of data to this model. All analyzed and transformed data are loaded into this repository. This is important as the amount of communications is very large and if analyst would analyze only few locally opened mailboxes this approach can cause missing connections and conversations. Overall process is displayed on following schema.

When all analysis, transformations and data loads are done, we can see clearly visualized information like:

  • Who is who?
  • Who knows whom?
  • Who is communicating with whom?
  • Who is anyhow related to an email or email attachment?
  • Detected entities mentioned in email communication?

Example visualization chart that includes also entities detected by Watson can be seen on the screen below.

Chart with connections is not the only way how to visualize stored data. All links and communications can be projected to a heatmatrix. This type of representation shows you at what period of time was the communication most active, if the conversation part was sent in inappropriate time or even suspicious time. The more conversations were sent, the darker the color gets. Through this capability analyst can see those spots instantly.

Another way how to refine time dimension is showing entities and connections in a timeline. Each entity gets its own outline on the chart and all connections are represented separately ordered by time sequence.

Each exploration starts with just few entities and after a while you can extend your focus deeper into the data network. Limited search based on every attribute or simple expanding the context of specific entity will help you to get full view in one place.