A “hard return” or a “hard wrap” is a line break intended for keeping the text width within a certain value, usually 80 characters. They make texts more readable, and are a convention in formatting your documents under certain text editors. However, under some applications (such as natural language processing and format-converting), they can be annoying.

From elementary schools you have learned the basic rules to wrap your lines when writing on a piece of paper: you never start a new line beginning with a symbol, you don’t break words unless you really have to, etc. These rules are easily…

Here’s how my document-parsing (i.e. info-extraction) projects are usually set up:

  1. Save all documents-to-parse into a folder called raw_files.
  2. Pre-process raw documents and save processed content into a MongoDB Collection cleaned_files. Each document should contain the following fields:
  • content: A multi-line string of the cleaned content.
  • preprocessing_status: An indicator of whether the parsing succeeded or not. Can be an integer, boolean or string.

4. Create two new Collections: responses_0 and results_0. Later, you can simply create two new Collections like responses_1 and `results_1 and iterate your project.

5. Write parsing methods. By “write”, I mean programming.

6. The parser program…

Suppose you have two Collections (A and B) in a MongoDB Database, and you want to find all Documents in A whose certain field does not exist in any of the Documents in B. In SQL, you would use some thing like this:

WHERE A.field = B.field );

or, by exploiting the nature of outer joins:


Such queries are called left anti joins. It is a such popular receipe that many data-processing…

No rules, no standards. — Mencius

Tables are not created the same. Some are born with coherence to rigorous standards, some are not. The formers are easily parsable by machines, which the latters are only made for humans’ eyes only.

Take this excerpt from an annual report of the Imperial Bancorp as an example:

This example perfectly demonstrates some major difficulties in parsing textual tables:

  • No column indicators.
  • The words “Imperial” align so perfectly in the middle of the table that one may confuse these entries of “Imperial Bank” as two separate columns.
  • Multi-line cells exist (e.g. Imperial Municipal Services…

Generally speaking, a CSV table may or may not contain a header, and the header, when it exists, may consists of multiple lines. Question: How can we detect which first rows constitutes the header of a CSV file?

I wrote a script, named headsman, that finds the best amount of first few rows to cut off from a table in order to maximize the purity of datatypes in each column.

What do I mean?

Notice that, in a column of a table, even if it contains only numerical entries, its header (if any) is often a string. …


सांसों के किसी एक मोड़ पर
मिली थी तू ज़िन्दगी
मेरी दोस्त बन के


चल दिया तेरा बात मान कर
तेरा हाथ थाम कर
तुझे साथी चुन के


मैं किस मंजिल का रही हूँ
तू किन राहों पे लायी है
समझ पाऊं ना मैं तुझको
ना तू मुझको..

你也不 懂 我

जो ना मंजूर है मुझको
वही मंजूर है तुझको
समझ पाऊं ना मैं तुझको
ना तू मुझको..

你也不 懂 我

जो ले लिया था तूने फैसला
ज़मी पे आसमां मैं रख दिया
मैं छाव में लपेटे…

Despite privacy concerns, the trending movement known as Quantified Self has really inspired many of us to evaluate our lifestyle in a more accurate and scientific manner.

Keeping records of ourselves doesn’t really mean buying additional wearable gadgets — your current laptop and smartphone would do most of the job. Popular tracking apps include Qbserve, a time-monitoring tool for Mac users, and Moves, a mobile app that keeps track of whereabouts and commuting methods. These applications has accumulated for me loads of data about myself, and I can finally sit down and look back at how my time was spent.

For decades, educated human beings suffered from the limitations imposed by pens and paper. Now in this very modern era of 2017, this should see some updates.

Eliminate the Hassle of Bringing Books to Class — Use A Tablet

You are not the Statue of Liberty; you don’t have to take books with you all year round. Go get a big tablet and let everything go digital.

Zaalima | Raees | Shah Rukh Khan & Mahira Khan | Arijit Singh & Harshdeep Kaur | JAM8

Zaalima is a popular song from a 2017 Hindi movie Raees. I translated the lyrics into Chinese, trying my best to keep it rhyming. Please do point out if I misunderstood something, after all the translation is mostly based on English versions :P

Jo teri khatir tadpe …

Need a witty touch on your next iDevice? Here are some clever messages that will make you laugh:


