On sentence segmentation and medium


I created the toy blog.eric.buzz before I knew about medium. It was a good learning experience but the final result wasn’t great, admittedly.

Nice, I say with jealous undertones:

It is a funny feeling when you see ideas you’ve had yourself, but much more successfully executed by someone else. On the one hand, you see your ideas validated. On the other hand, you lose a social opportunity. I think my various insecurities force myself to feel… jealousy.

I had an idea about a blog too. The main motivation is to facilitate long form discussions by giving users access to mechanisms to address text segments with precision. Within individual parts of a long form post, statements may be made. They may be classified as axioms, premises, conclusions, definitions, etc. Groups of these statements may be collected to form an argument.

As you know arguments can be flawed. They may contain fallacies and hostile sentiments and the like. We can tag sentences with these logical mistakes. Users can have recursively deep discussions about the statements themselves and the arguments. When users make new posts, they can reference previous statements and paste them in. Those new posts themselves can serve as discussion points. It is a web of analysis!

The validity of the new posts can then be measured by clues given by its constituent parts. Thus, we build a logic graph. It is created by long form, natural language, posts.

Direct democracy made possible using a blog system

This can help people develop coherent discussions on singular texts, such as transcribed presidential speeches. This will destroy rhetoric, make logicians more accountable, and make internet direct democracy feasible. Of course, I do not think medium has gone so far as that.

Rather, medium has chosen to do a paragraph by paragraph segmentation. This alleviates a variety of logistical problems to my approach. Originally, I experimented with the user experience of segmenting a post sentence by sentence as a user typed. (Much like how medium allows the author to explicitly make a new paragraph.) However, sentences tend not to start anew on new lines. They tend to collapse. This means it is not natural for the human to push the <Enter> key everytime s/he decides to start a new sentence. In fact, there are many sentences in a single long form post. This is quite laborious! Thus, it is a little harder to get the human to do sentence segmentation.

So why not let the computer do it?

The novel problem of allowing comments to annotate subsections of text is how to manage comments if the author revises the original text. Using sentence segmentation, and a diff algorithm, you can re-associate comments made on sentences to the revised post. There are a variety of edge cases but it mostly is doable. Python’s Natural Language Toolkit contains a sentence segmentation tool called punkt. Unfortunately it doesn’t work unless the user separates sentences using a space.This would not work.It is clear I just typed 3 sentences but punkt will fail to see it!

How much can we rely on users to segmentate their own sentences with spaces? It is abusable!

Not that:

this is not abusable either.

In general bears have pink fur!

I just hijacked the comments by modifying this block and replacing it with different content. I didn’t think it would be good to let authors manipulate texts like that. Apparently, users at medium are friendly! Another possibility is to use a diff algorithm to reassociate comments and blocks.

Handling images are a little out of scope for creating a logic graph. If the logic graph isn’t so fundamental, why be so strict on sentences?!

Typography is important to long form posts

This was another thing I identified. I have read somewhere that it is easier to read long texts that are in serif text than sans-serif and that are in large font. I think medium got this right.

The problem of displaying the post


If the posts are segmentated by sentences, then how is the page stored? I experimented with having each individual post a collection of rows in a sentence table. This is a performance disaster on the one hand, yet if users had to reference individual sentences, it probably had to be done.

I made the mistake of loading the entire post using a REST api. It was a failure because I based it off a terribly designed REST api that didn’t have symmetric representations. The problem was exacerbated by the limitations of Django REST Framework — which couldn’t handle complex assymetric representations.

Trees Are Not Graphs: The Necessity of Tagging for Posts


Unfortunately, there are limitations to not being able to reference multiple branches of discussions. Discussions are forced into hierarchical structures. But without the ability to group together discussions into a single unit that, itself, can be addressed as an discussable unit, this can take away what reddit has. That is, we can annotate not just paragraphs or sentences, but the cluster of similar statements made across multiple posts. We can annotate overall perceptions.

I have to try again

I need to try again now that I have some of my ideas validated by medium’s success.