Free Public NLP Datasets for Automotive Domain

Shailendra Jain
2 min readNov 7, 2021

--

Photo by Matt Hudson on Unsplash

There are several free public NLP (Natural Language Processing) data sets available for tweet sentiment analysis, Yelp reviews, question-answering, legal case summaries etc. For the automotive domain, NHTSA (National Highway Traffic Safety Administration) provides multiple datasets that can be used to build NLP use cases suitable for the automotive domain.

NHTSA is a federal agency committed to transportation safety in the United States. All the Automotive OEMs need to report all their consumer complaints, recalls and recall status to NHTSA as part of Early Warning Reporting. General public can also report safety defects directly to NHTSA. NHTSA has made all of this data available to the general public. This data is available at the Office of Defects Investigation (ODI) Flat File Downloads page. The datasets include Complaints, Defect Investigations, Recalls and Manufacturer Communications. These datasets contain rich natural language data (in English) for the Automotive domain.

The complaints data catalogs user complaints starting from 1995. In addition to the description of complaints for each OEM, Make, Model, Model Year, this data also includes important Boolean information like Crash, Fire, Medical Attention, Police Report etc. The defect investigation data logs all the safety-related defect investigations starting from 1972, however the investigation summaries have richer text for more recent years. Similarly recall data includes all of the recall campaigns since 1967. In the recall campaign data, Defect Summary, Consequence Summary and Corrective Summary can be of special interest to NLP enthusiasts. Technical service bulletin data is available starting from 1995. The summary text of this data can provide some interesting insights about the service bulletins issued by OEMs.

Since the NHTSA data is sourced from multiple OEMs, it is suitable for building generalized text analytics systems. In all the available NHTSA datasets, the complaints data is the most interesting dataset. This dataset is suitable for many useful text analytics applications. For example, it can be used to reliably detect safety events from complaint narratives as it covers a broad range of equipment failures.

Like any other data source, there are some data quality issues with NHTSA data as well. For example, sometimes the complaint text might not imply a crash, yet the record might be marked with CRASH=Y. In such cases manual re-labeling will be required. Notwithstanding these shortcomings, NHTSA text data is a comprehensive automotive industry reference that can be used to draw invaluable insights using natural language processing.

--

--

Shailendra Jain

Management professional with experience in Artificial Intelligence, Deep Learning, Natural Language Processing and Conversational AI