How I enrich my bookmarks using AI

Published in

Thoughts on Machine Learning

5 min readOct 28, 2023

The Problem: I use Notion as a central place to collect all my bookmarks. But the metadata is not consistent depending on the website I’m bookmarking and whether I’m bookmarking from the desktop or my mobile.

I initially though I could solve the problem in the acquisition phase by finding the right extension or application to bookmark.

But I realized that the perfect bookmarking tool did not exist, in particular with regards to the new limitations on Twitter’s API (Twitter is 80% of my bookmarks) or additional capabilities that I wished to have such as summarization for long articles.

The setup

Data acquisition

I rely on Save to Notion when I’m my desktop (it provides richer metadata), and Pocket otherwise. Pocket is synchronized to Notion using Make.

Bookmarks processing

Here is a high-level breakdown:

Every night, a Make scenario makes a POST request to my enrichment service endpoint hosted on Replit
The enrichment service first pulls 10 bookmarks from Notion that have not yet been processed.
It sorts them by domain: Twitter, Youtube, LinkedIn, and others. They receive differentiated treatments.
Twitter bookmarks are sent to my tweet-scraping AWS lambda function. This function instantiates a Playwright instance (a headless browser) to scrape with javascript activated. And it makes use of BrightData to circumvent Twitter limitations. It returns the HTML and a screenshot saved on S3.
The other bookmarks are scraped directly from the enrichment service (no javascript requirements).
I use a combination of regular web scraping techniques (Beautiful Soup) and OpenAI API calls to extract metadata from the HTML and generate summaries.
I could very well imagine how OpenAI would be enough to extract the metadata. However the HTML of entire webpages often goes beyond the token limit. So being more selective about what to pass to is necessary.
The enrichment service updates the bookmarks on Notion.

In more details:

The code

Please bear in mind that I’m a beginner developer, so the code may not adhere to best practices in terms of architecture / naming /…

I have 2 code bases:

The Flask endpoint hosted on Replit
The tweet scraper docker image used by the AWS Lambda

Here is some of the code for the Flask endpoint. I’ll publish a separate article for the tweet scraper.

@app.route('/api/process_new_bookmarks', methods=['POST'])
@auth.login_required
def process_new_bookmarks():

  # Retrieve bookmarks
  bookmarks =\
  notion_io.retrieve_bookmarks_from_notion(nb_bookmark_processed_per_run)

  # Initialize bookmark categories
  twitter_bookmarks = []
  linkedin_bookmarks = []
  youtube_bookmarks = []
  other_bookmarks = []

  enriched_bookmarks = []

  # Categorize bookmarks
  for bkmk in bookmarks:
      domain = bkmk.get("domain")
      logging.debug(f"New bookmark: {bkmk['url']}")
      if domain in ["twitter.com", "x.com"]:
          twitter_bookmarks.append(bkmk)
      elif domain in ["linkedin.com", "www.linkedin.com"]:
          linkedin_bookmarks.append(bkmk)
      elif domain in ["youtube.com", "www.youtube.com"]:
          youtube_bookmarks.append(bkmk)
      elif domain is not None:
          other_bookmarks.append(bkmk)

  # Enrich Twitter bookmarks (uncomment and adjust as needed)
  enriched_bookmarks += enrich.enrich_twitter_bookmarks(twitter_bookmarks)

  # Enrich LinkedIn bookmarks
  enriched_bookmarks += enrich.enrich_linkedin_bookmarks(linkedin_bookmarks)

  # Enrich Youtube bookmarks
  enriched_bookmarks += enrich.enrich_youtube_bookmarks(youtube_bookmarks)

  # Enrich Other bookmarks
  enriched_bookmarks += enrich.enrich_other_bookmarks(other_bookmarks)
  
  # Update bookmarks in notion_io (uncomment and adjust as needed)
  [notion_io.update_bookmark_in_notion(bkmk) for bkmk in enriched_bookmarks]

  # Return bookmarks as JSON
  return jsonify(enriched_bookmarks)

My Flask application has a POST endpoint that requires an authentication. The function separates different kind of bookmarks (Twitter, LinkedIn, Youtube,…) that will receive specific processing.

def enrich_other_bookmarks(other_bookmarks):
  for bkmk in other_bookmarks:
    html = retrieve_page_html(bkmk)
    type = article_or_website(html)
    bkmk['type'] = type
    if type == "article":
      bkmk['title'] = get_article_title(html)
      bkmk['author'] = get_author(html)
      bkmk['image_url'] = get_image_url(bkmk['url'], html)
      bkmk['summary']= get_summary(html)
    else:
      bkmk['title'] = get_website_title(html)
  return other_bookmarks


def article_or_website(html):
  soup = BeautifulSoup(html, 'html.parser')

  meta_tag = soup.find('meta', {'property': 'og:type'})
  article = soup.find('article')
  type = ""

  if meta_tag and meta_tag.has_attr('content'):
    type = meta_tag['content']
  elif article:
    type = "article"
  else:
    type = "website"

  return type

Taking the example of the “other bookmarks”: depending on whether it’s a website or an article, different metadata are being harvested.

The title and the author are retrieved by passing the HTML to OpenAI because they can be stored in a variety of <meta> tags. OpenAI should be able to determine if and which tag to consider, avoiding the need for numerous brittle scraping rules.

def get_website_title(html):
  head = get_clean_head_tag(html)
  title = get_data(head, "Return the description. Keep only the first sentence. If you can't find a description return ''")

  return title

def get_article_title(html):
  soup = BeautifulSoup(html, 'html.parser')
  title=""
  if soup.find('h1') is not None:
    title = soup.find('h1').text
  else:
    head = get_clean_head_tag(html)
    title = get_data(head, "Return the title. If you can't find a title return '?'")

  return title

def get_author(html):
  head = get_clean_head_tag(html)
  author = get_data(head, "This is an article head tag. Return the author's name if you can find it in the metadata. Otherwise return 'Not found'. Format : First name Last name.")

  if author == 'Not found':
    author = get_data(get_article_content(html), "This is an article content. Return the author's name if you can find it. Otherwise return 'Not found'. Format : First name Last name.")

  return author

def get_summary(html):
  content = get_article_content(html)
  summary = get_data(content, "This is an article content. Identify the main theses, and organize them as bullet points.")
  return summary

The call to OpenAI API takes into account the risk that the HTML code will not fit into the token window.

The first try is with the HTLM as is and a 4K token window.

If it does not work, the 2nd try is with a 16K window.

And the last attempt is to use only the text (removing all tags from the HTML) with a 16K window.

def get_data(html, prompt_variable, mode="FULL", window="4K"):
    # Text of the prompt template
    prompt = f'''You are acting as an HTML parser. {prompt_variable}

    HTML:
    '''

    if mode == "FULL":
      prompt = f"{prompt}/n{html}"
    else:
      prompt = f"{prompt}/n{html.get_text()}"

    model_name="gpt-3.5-turbo"
  
    if window == "16K":
      model_name="gpt-3.5-turbo-16k"

    # LLM
    llm = ChatOpenAI(temperature=0, openai_api_key=os.environ["OPENAI_API_KEY"], model_name=model_name) 

    # Run the chain
    try:
      data = llm.predict(prompt)
      return data
    except InvalidRequestError:
      if mode == "FULL":
        return get_data(html, prompt_variable, "SIMPLIFIED")
      elif mode == "SIMPLIFIED":
        return get_data(html, prompt_variable, "SIMPLIFIED", "16K")
      else:
        return "The HTML was too long to pass to the LLM in 1 call."

Source code

You can find the Flask app code on Github:

GitHub - meaningfool/bookmark-enrichment

Contribute to meaningfool/bookmark-enrichment development by creating an account on GitHub.

github.com

The source code for the lambda will be published in a separate post. Stay tuned.