Exploring CodeProject With Curiosity

Rafael Oliveira
Nov 25, 2020 · 11 min read

I was reading my email this morning, when I stumbled upon an interesting question in the daily newsletter from CodeProject, to which, like the author, I had no idea about the answer: “How many articles there are on CP?”

Image for post
Image for post
Did you know…

Turns out CodeProject is quite a large library after all, with more than 14 million users and 63 thousand articles! We’ve been looking for interesting data-sets to demonstrate how to use the Curiosity Search, and this seemed like an interesting one!

You can reproduce this on your own machine — all the code is available in this repository, and you can download a copy of Curiosity to run locally, or deploy to your favorite containers platform using the Docker image.

Time to check those 63000+ articles

As we’ll be extracting the data from the Site Map, let’s take a look and decide on a data schema for our own CodeProject knowledge graph. First, from the Site Map, we can see that CodeProject is structured in Categories such as Desktop Development, and Subcategories such as Clipboard.

Image for post
Image for post
Site map: Categories and Subcategories

If we open one of the subcategories pages, we can see the entire list of articles that is tagged in each subcategory. These pages might take a while to load, as they contains a link to every single article in that sub-category. The next data type we can derive from is the Article. Let’s open one to see what’s inside!

Image for post
Image for post
Articles all the way

Finally, if we open an article, we can see a few more things that can be interesting to capture:

Image for post
Image for post
Finally some content to read!

If we take a quick look on the article structure, we can notice two more data types: Authors and Tags. We can also decide on what metadata to import from the article, like the Title, the Short Description, the article content and the stats of the article.

Image for post
Image for post
What we will be extracting from an article

This gives us this final schema for our CodeProject graph:

Image for post
Image for post
Structuring CodeProject in a graph

First let’s take a look on how to define a schema for this data using the Article type. We start by creating a class and tag it with the [Node] attribute. Each node schema in a Curiosity application has to have a unique string key that identifies it — we mark that field with the attribute [Key]. For the Article type, we use the URL of the article as that already provides us a nice unique identifier for each article. The other fields will be marked with the [Property] attribute, and will contain the rest of the data we will extract from each article webpage. For the node timestamp (which we use in Curiosity for time filtering), we use a DateTimeOffset field (marked with the attribute [Timestamp]). For articles we fail to parse the date from the HTML, we use the default value DateTimeOffset.UnixEpoch — this is a special value that will be ignored when searching.

The final code for the Article typelast thilooks like this:

[Node]
public sealed class Article
{
[Key] public string Url { get; set; }
[Timestamp] public DateTimeOffset Timestamp { get; set; }
[Property] public string Title { get; set; }
[Property] public string Description { get; set; }
[Property] public string Text { get; set; }
[Property] public string Html { get; set; }
[Property] public int Views { get; set; }
[Property] public int Bookmarks { get; set; }
[Property] public int Downloads { get; set; }
}

You can see the remaining data schemas in the attached GitHub repository. For the edges, we’ use a simple convention (HasAuthor / AuthorOf), and I create a simple static class so that we don’t have to repeat the names as strings in the code later:

public static class Edges
{
public const string AuthorOf = nameof(AuthorOf);
public const string HasAuthor = nameof(HasAuthor);
public const string TagOf = nameof(TagOf);
public const string HasTag = nameof(HasTag);
public const string CategoryOf = nameof(CategoryOf);
public const string HasCategory = nameof(HasCategory);
public const string SubcategoryOf = nameof(SubcategoryOf);
public const string HasSubcategory = nameof(HasSubcategory);
}

Finally we can start putting our data connector together:

using (var graph = Graph.Connect(server, token, "CodeProject"))
{
await graph.CreateNodeSchemaAsync<Article>();
await graph.CreateNodeSchemaAsync<Author>();
await graph.CreateNodeSchemaAsync<Tag>();
await graph.CreateNodeSchemaAsync<Category>();
await graph.CreateNodeSchemaAsync<Subcategory>();
await graph.CreateEdgeSchemaAsync(Edges.AuthorOf,
Edges.HasAuthor,
Edges.CategoryOf,
Edges.HasCategory,
Edges.SubcategoryOf,
Edges.HasSubcategory,
Edges.TagOf,
Edges.HasTag);
await IngestCodeProject(graph); await graph.CommitPendingAsync();
}

The token variable contains the API token you can generate on the application. The server variable should point to the address Curiosity is hosted — in the case of testing it locally, it’s “http://localhost:8080/”. Remember in a real deployment you shouldn’t store the API token directly in the code

All that is left to do is to implement the IngestCodeProject method to crawl the CodeProject site map and get all articles from it.

Crawling CodeProject

Image for post
Image for post
Not a good sign when the API last changed 5 years ago…

So we fall back to a web crawler instead. To implement the crawler in C#, we can use the good-ol’ HtmlAgilityPack parsing library. We start with a few helpful methods to download a page and extract links:

static async Task<HtmlDocument> GetPage(string url)
{
using (var c = new HttpClient())
{
var html = await c.GetStringAsync(url);
var doc = new HtmlDocument();
doc.LoadHtml(html);
return doc;
}
}
static IEnumerable<string> GetLinks(string baseUrl,
HtmlDocument doc)
{
return doc.DocumentNode
.SelectNodes("//a[@href]")
.Select(n => ToAbsolute(baseUrl, n.Attributes["href"].Value))
.Where(u => u is object)
.Distinct();
}
static string ToAbsolute(string baseUrl, string url)
{
if (string.IsNullOrWhiteSpace(url)) return null;
var uri = new Uri(url, UriKind.RelativeOrAbsolute); if (!uri.IsAbsoluteUri)
{
uri = new Uri(new Uri(baseUrl), uri);
}
return uri.ToString();
}

From navigating around the website HTML, we can see that the hierarchy of categories can be easily read from the side menu that appears in any section page:

Image for post
Image for post
Structure, structure, structure…

The full code for the crawler won’t fit here — but you can check it in GitHub. Once we have the crawler code working ,we just need write a few lines of code to create the right nodes in the knowledge graph. For example, this is how we connect articles, authors and tags:

var articleNode = graph.AddOrUpdate(
new Article()
{
Url = articleLink,
Bookmarks = stats.bookmarked,
Views = stats.views,
Downloads = stats.downloads,
Description = content.description,
Title = content.title,
Text = content.text,
Html = content.html,
Timestamp = date
});
foreach (var author in authors)
{
var authorNode = graph.AddOrUpdate(
new Author()
{
Name = author.name,
Url = author.url
});
graph.Link(articleNode, authorNode,
Edges.HasAuthor, Edges.AuthorOf);
}
foreach (var tag in tags)
{
var tagNode = graph.AddOrUpdate(
new Tag()
{
Name = tag.tag,
Url = tag.url
});
graph.Link(articleNode, tagNode,
Edges.HasTag, Edges.TagOf);
}
graph.Link(articleNode, subcategoryNode,
Edges.HasSubcategory, Edges.SubcategoryOf);
graph.Link(articleNode, categoryNode,
Edges.HasCategory, Edges.CategoryOf);

As you can see, we are creating each article in the graph, and linking them via the respective edges to authors and tags.

Armed with an API token, we can now run now the crawler in the command line with

dotnet run http://localhost:8080/api/ {API_TOKEN}

You’ll see it starts downloading all articles and uploading it to your Curiosity application:

Image for post
Image for post
The 🕷 is alive!

You can check on your browser the Data Hub page to see the data starting to appear as the articles are downloaded:

Image for post
Image for post

Now that we have the crawler doing the hard work for us, it’s time to check how we can configure the search experience on Curiosity.

Making CodeProject Searchable

  • Configure search, autocomplete and filtering
  • Create data views for each type with the relevant information

Let’s start with configuring search!

First we head to the Data Hub again, open the Article data type, click on Text Search and add the fields Title, Description and Text as searchable, for the other data types, like Author, we’ll configure Name as searchable.

Image for post
Image for post
Making things searchable takes a few clicks

This way, we can already search for our favorite authors and look for information in the content of the articles. You’ll notice that I’ve also changed the boost for each field, so that text matches in the Article’s Title and Description fields contribute more for the ranking of an article.

For Autocomplete and Filtering, we would like the search box to offer Authors, Tags and Categories/Subcategories to search for.

So let’s configure it:

Image for post
Image for post

Finally, we also want our search box to automatically capture Authors, Tags and Categories/Subcategories if the user types them as text directly in the search box — we can configure the required NLP models like this:

Image for post
Image for post

This means if we type something like “windows” in the search box, it will automatically be recognized as Tag in the knowledge graph, and we get better search results based on it:

Image for post
Image for post

You might have noticed that we’re still showing our Key fields (i.e. the URLs) on the filters and card titles — this is because we have not yet configured how to render our data. Now that we configured search, it’s time to make our search experience more useful and pretty!

Improving our Search Experience

For example, for the articles, we might want to show not only the content, but also the related Author as a link the user can easily navigate to and find other Articles. We might also want to configure similarity for articles, so we can suggest users other articles that they might be interested at — we’ll get to that later in this article!

Starting with the Article type, we head to the Data Hub again, and configure the following:

  • Style: Change the Label to the field Title, and change the icon and color
  • Renderer: Add related Author, Category and Tags to the footer, add the article URL as a link to the card, use the HTML when we open the card preview, and add a tab to show related Authors.
Image for post
Image for post

We will do something similar for the other data types, so that we can see and search in related articles for Categories, Subcategories and Authors:

Image for post
Image for post

Searching our data

Image for post
Image for post
The legend, Sacha Barber

One can also explore the graph we created while ingesting the data, and take a look “behind the scenes” in how the data is connected:

Image for post
Image for post
Graphs, graphs everywhere

Recommendation for Similar Articles

We train this model by adding an embeddings index in the Settings page:

Image for post
Image for post
Adding a similarity index for articles

And configure it as follows:

Image for post
Image for post

Note that this model can also use a pre-trained Token Embeddings model to improve the training data by expanding the captured “Concepts” in the text. To do this, we manually train once the Token Embeddings model:

Image for post
Image for post
First step: Training the token embeddings for word similarity

Once this model is trained, we can then train our Article Similarity model:

Image for post
Image for post
Second step: Training the article similarity model

Once the model is trained, we can modify our Article view in the Data Hub to show similar articles. For this, we add one new tab to the Full View, and add a Similar Search component inside of it:

Image for post
Image for post

Publishing a recommendation endpoint

Image for post
Image for post

The code for this endpoint is quite simple, the endpoint receives the URL of the article in the body, and uses the Curiosiy query language to retrieve 10 similar articles:

var url = Body.Trim('"');if(Graph.TryGet(N.Article.Type, url, out var article))
{
return Q().StartAt(article.UID)
.Similar(count:10)
.EmitWithScores("Similar",
fields: new []{"Url", "UID", "Timestamp"});
}
return "{}";

In order to call the endpoint from outside, you’ll need to generate a token for it (Settings > Tokens > New Token > Endpoint). You can then test it using curl, for example, with the following command:

curl -X POST -H “Accept: application/json” 
-H “Content-Type: application/json”
-H “Authorization: Bearer ${TOKEN}” — data “ARTICLE URL”
http://localhost:8080/api/cce/token/run/recommend-articles

Conclusion

Image for post
Image for post

The Startup

Medium's largest active publication, followed by +752K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store