Matching HS Codes in 2024: Traversing the Customs Space
HS (Harmonised System) Codes help classify customs authorities and businesses around the world to ensure appropriate import/export control. This system recursively classifies a product into finer granularity of categories.
Could we do this differently with the linguistic tools like LLMs in 2024? This blog examines different approaches towards classification.
Methodology of Classification
Assigning a HS Code is done hierarchically.
Starting off from Chapters, to Sections to further granularities. If we consider this as a search problem, the search domain keeps shrinking as we continue the “process of selection”. i.e., as soon as we classify a product to be an electrical component, we don’t need to keep Wines in the search space.
A graph could be a great solution to this problem!
Resources at our Disposal
We are focusing on classifying according to the definitions set by Singapore Customs.
- Singapore Trade Classification, Customs & Excise Duties (STCCED) 2022
- Customs Ruling Dataset
- Guide for Classifying Products
Constructing a Graph
STCCED, as the name suggests, contains the hierarchical classification of traded goods. It is a PDF file, and the text is divided into Sections, Chapters and “Subchapters”.
Step 1: ⬇️ Download the STCCED 2022 PDF, use PyPDF2
to extract the text content.
Step 2: ✂️ Split the text content recursively, into Sections, Chapters and “Subchapters”.
Step 3: Convert all the subchapters into GraphViz dot
files. (Compact representation of parent-child relationships). I used gpt-4
to read the sections and construct these subgraphs.
Step 4: Merge the subgraphs hierarchically. The result, is a Graph of 13K+ Nodes, neatly organised!
Building a Hierarchical Search
For each granularity level (Section, Chapter, Subchapter, Dash and Double Dash), I sent requests to gpt-4
directly with a list of the child nodes. Perhaps this illustration will be of assistance!
Thus, progressive granularity would shrink the total search space down.
Giving it some LLM and Streamlit
For the first build, I used openai
directly and got streamlit
to create the user-interface.
Try it now on Streamlit:
🌏 https://harmony-ai.streamlit.app/
Limitations and Future Work
- The current implementation does not go backwards to course-correct. Thus it continues on with a bad classification in a higher granularity.
- More tokens may be saved by combining vector search along-side graphs for nodes with larger number of children.