ChatGPT and web scraping: how we classified 50,000 lawyers from more than 4,000 websites

Who we are?

We are a company called DataOx and we are providing web scraping services: our experts can grab, collect and classify everything you see in your browser. Also, for more than 2 years, AI is actively used in our tasks.

What is the task?

A customer requested us to create a database of all lawyers in the USA. Data needs to be scraped from more than 4,000 resources (only public ones due to privacy laws).

Example of such websites:

https://bulkley.com/professionals/santucci-jennifer/https://frblaw.com/professionals/elizabeth-g-conklin/
https://carmodymacdonald.com/people/brad-crandall/

The information in red squares needs to be collected

Then, collected lawyers' information needs to be classified by practice area, specialty and marked by tags. The result should be saved in database and be accessible through the web application.

The customer gave us the task due to complexity of scraping. Then we suggested to use ChatGPT for classification purposes.

Note: we are scraping more than 4,000 websites each 2 weeks right and the database is updating on a regular basis.

How we scraped lawyers data

Firstly, we collected a list of law resources.

The customer provider us with the most of the websites. Then, we searched for similar resources which includes: lawyer name, bio and contact data. This information is needed for the further classification. In addition, we searched for education and reviews.

Secondly, we started to develop web scrapers.

There are two types of resources:

  • Unprotected: if a resource does not have DDoS protection, we just get information from the website’s API. It’s a cheap and fast approach.
    However, we are trying to avoid huge amount of requests in short time to avoid websites overloading and harming.
  • Protected: if a resource has protection (usually Cloudflare), we use Selenium WebDriver with Zyte pool of the USA IP addresses.
    It costs more and scraping is slower, but actually it’s also not a problem.
The main sponsor of Selenium :)

All the scrapers deployed to AWS (Amazon Web Services). We launch them manually 2 times per month to update the database. The data is stored is Elasticsearch.

Now, the database has about 50,000 lawyers.

Training of the AI model

After scraping, we are received a huge amount of unstructured data: names, biographics, some contact data. This gives us no ability to search by practice area or specialty (as was requested).

Therefore, we need to somehow classify lawyers and split them into categories. To do this, we fine-tuned the ChatGPT model.

Very well-known logo for the past year

The process of AI training:

  1. We extracted and sorted information of some lawyers manually.
  2. Then our team classified and set appropriate tags depended on bio and further information.
  3. Finally, ChatGPT has been trained with these examples.

For fine-tuning we used regular approach from Open AI β€” view information.

The fine-tuned model has been deployed to AWS and now works in API format: we send text and receive tags in response.

The full process

Now, the entire system works in the following way:

  1. We collect new data with the help of web scraper.
  2. The data is passed to the AI API.
  3. API gives us tags and summary.
  4. Entire information with tags and contact data is added to DB.
  5. Lawyers data is accessible by web UI.
Workflow of the data classification

Examples of extracting the practice area and specialties:

  • β€œJennifer Santucci is a member of the Real Estate department where she works with clients on drafting and negotiating purchase and sale agreements; reviewing and analyzing sales contracts, LLC/corporate documents, trust documentation and title commitments; and preparing for and conducting closings. Her experience also includes real estate financing, including representing various lenders in commercial real estate transactions, and preparation of loan agreements and other loan documents on behalf of lenders.” -> Practice Areas: Real Estate, Specialties: Finance, P&S.
  • β€œBrad Crandall is a principal attorney in the firm’s transactional group and focuses his practice on mergers and acquisitions, corporate finance, commercial real estate, and corporate law. He also counsels and represents companies and entrepreneurs in a wide variety of industries in the areas of business formation, funding, and strategy, corporate governance, business succession, and private equity and venture capital. Prior to joining Carmody MacDonald, Brad was a founding member of a St. Louis-based law firm offering services in business law, M&A, commercial finance, employment law, real estate, estate planning, and probate and trust administration. Brad received his B.S. in Economics, cum laude, from Southwest Missouri State University, and his J.D., Order of the Coif, from Washington University School of Law in St. Louis. Brad is licensed to practice in Missouri and Illinois.” -> Practice Areas: Corporate, Specialties: Finance, Private Equity, M&A, Venture Capital.

Conclusion

It was an example of our regular tasks. Therefore, if you need to scrap some large amount of data, to make AI classification or to develop some software β€” ask us!

--

--