Illustration by Alysheia Shaw-Dansby

A Deep Dive into Course Descriptions: Using Quanteda to Identify Work-Based Learning Opportunities

Data@Urban
Urban Institute
Published in
8 min readOct 19, 2023

--

Community colleges, with their career-focused curricula and diverse student populations, offer an ideal setting for work-based learning (WBL), which consists of opportunities such as internships, apprenticeships, and practicums focused on career preparation and training in supervised, work-like settings. In the first post of our two-part series, we detailed our process for compiling a comprehensive dataset of Florida community college course descriptions using web scraping techniques to begin to measure WBL opportunities in community colleges. With that dataset, we now turn to our question for analysis: What is the prevalence and nature of WBL opportunities in Florida’s community colleges, as reflected in their course descriptions?

To answer this question, we turned to text analysis, an approach that allows us to extract meaningful information from large volumes of text data. Text analysis is particularly suited to our research question because it enables us to identify and quantify specific keywords related to work-based learning, within our data frame of course descriptions. In this post, we’ll delve into our methodology for analyzing these data using the quanteda package in R, a powerful tool for quantitative text analysis. All of the code can be viewed in our public GitHub repository.

Getting started with quanteda

Quanteda, short for Quantitative Analysis of Textual Data, is a powerful tool for managing and analyzing text data in R. It offers a suite of functions for managing a corpus — or a collection of texts — such as creating document-feature matrices and analyzing keywords. These functions are highly efficient and provide a consistent interface with support for multiple languages. While it operates as a standalone package, it also integrates seamlessly with extensions such as readtext, spacyr, and quanteda.textstats.

To install and load the required packages, we used the librarian package for convenience.

Next, we loaded our text data containing course descriptions and document-level metadata using the readtext::readtext() function. We then performed some additional cleaning steps to standardize variable names and focus our analysis on active courses for credit. We included five columns in our courses data frame:

  • doc_id: A unique identifier for each course, constructed by combining the school abbreviation, the course prefix, and the course number. For example, ‘BC-THE-2300’ refers to a “Survey Of Dramatic Literature” course in the theatre arts department at Broward College.
  • discipline: The subject area of the course, represented as a code and a description, such as ‘080 — THEATRE ARTS’.
  • course_title: The title of the course.
  • course_credits: The number of credit hours that a student receives upon completion of the course.
  • text: The course description text that was scraped from the Florida Department of Education’s course catalog website. This column is the main body of text that we analyzed using text mining techniques.

First, we created a corpus of our course descriptions. We did so using the corpus() function in quanteda.

This code outputted a corpus object that consists of the course descriptions from our data frame. The corpus contains metadata about the documents, each of which correspond to a course description. For instance, the first few documents in the corpus look something like this:

Each line represents a document, with the document ID followed by a snippet of the course description. This corpus will serve as the basis for our subsequent text analysis.

Next, we extract the tokens in the corpus — usually words, but they can also be n-grams (a collection of successive tokens) or multiword expressions. The tokens function allows us to define what we consider to be a token and ignore elements such as punctuation and digits. Our list of tokens for each document included the individual words from each course description, stripped of punctuation and numbers.

Key term searches with a dictionary

We wanted to identify courses related to different types of WBL, such as internships, apprenticeships, or practicums. To do so, we developed a dictionary of key terms in collaboration with subject matter experts from Urban’s Income and Benefits Policy Center who specialize in workforce development research. The resulting dictionary used in this analysis was a product of careful curation based on the literature on WBL.

For each type of WBL experience, we created a list of terms that we wanted to treat equivalently. For instance, our dictionary specifies that a course description refers to a clinical WBL experience if either “clinicals” or “clinical experience” appear. This approach allows us to capture the various ways a type of WBL might be referred to in a course description.

Equipped with our dictionary, we searched for the terms using the kwic() (keyword in context) function. This function takes in a corpus and a dictionary as inputs, along with a window parameter specifying the number of tokens before and after a keyword that we want to see for context. The function outputs a data frame that provides the matching keyword along with its surrounding context in the pre and post columns, allowing us to understand the use of our key terms within the course descriptions.

Finally, we integrated the results from the dictionary-based keyword in-context search back into our course-level data. This process involved merging the keyword data with the original course data and performing a series of data transformations. Specifically, we split the docname document identifier into separate school and course columns, created a sentence column that encapsulates the matching keywords along with their context, and performed string manipulations on the discipline and pattern columns to enhance readability.

Having identified the keywords in the course descriptions, we turned our attention to quantifying the prevalence of different types of WBL opportunities across schools. To do so, we grouped our data by school and pattern (which represents the type of WBL opportunity), counted the number of occurrences, and reshaped our data so each type of WBL opportunity was a separate column. We also added a ‘total’ row that summed the counts across all schools for each type of opportunity.

Findings and Limitations

In our analysis of course descriptions from Florida community colleges, we have identified key findings and limitations that shed light on the prevalence and nature of work-based learning (WBL) opportunities.

Key Findings:

1. Medical and Education Sectors Dominate. Given their licensure requirements, it’s not very surprising that the medical and education sectors offer the majority of WBL programs. Notably, clinical experiences are predominantly found in medical and related disciplines, constituting 75% of all WBL courses. Additionally, the education sector hosts 78% of courses involving field experiences. The other major sectors with WBL programs are in Engineering (8.7% of all WBL courses) and Business (4.5% of all WBL courses). Therefore, there exists potential for expansion into other domains to provide students with diversified work experiences, skill development, and connections with employers.

2. Apprenticeships are Scarce: Despite their substantial value in career preparation and advancement, apprenticeships are among the least offered forms of WBL, with only 11 courses identified in our dataset. On the other hand, field experiences (307), clinicals (237), co-ops (214), practicums (204) are most common. This finding underscores an area with untapped potential for growth and enhancement of WBL programs.

Limitations:

1. Quality and Equity Information Lacking: While our dataset provides valuable insights into the prevalence of WBL courses, it falls short in providing information that would help assess the quality of WBL experiences and dimensions of equity. Notably, there is no information on whether these programs are paid. Given that 81% of part-time and 47% of full-time students at two-year public colleges work while enrolled in their courses, the absence of paid opportunities could hinder their ability to participate in WBL alongside their studies and work commitments. This could negatively impact successful completion of studies.

2. Limited Scope: Our analysis focuses solely on a single state, which constrains the generalizability of our findings to a broader context. While valuable, the WBL data for Florida only offers a snapshot into the landscape of WBL opportunities and may not be indicative of national trends or cross-state variations. Developing a nationally representative dataset would permit more comprehensive comparisons and robust insights into the disparities and nuances of WBL programs across different states.

In conclusion, while the approach used here has limitations, it may be useful for establishing metrics related to goals for increasing the number of work-based learning opportunities for students, and identifying opportunities for growth (e.g., in certain sectors).

Looking Forward

Quanteda stands out for its intuitive interface and ability to efficiently handle large volumes of text data. It provides researchers with the flexibility to shape their analysis according to their specific needs, such as identifying key terms, comparing text documents, or investigating text patterns. Given that our study was exploratory, this flexibility was particularly useful. It allowed us to quickly and efficiently examine novel text data. The functions used in this analysis only scratch the surface of the full functionality of quanteda. To learn more, you can read quanteda’s excellent official tutorial and documentation.

Although our analysis focused on Florida, our use of quanteda to analyze course descriptions and identify WBL opportunities in community colleges could be applied to other states or regions. Extending the analysis to other states would provide a broader understanding of WBL opportunities in community colleges across the US, informing policy and identifying areas for expansion. This work could help establish goals, support programs, and highlight best practices nationwide.

-Manuel Alcalá Kovalski

-Judah Axelrod

Want to learn more? Sign up for the Data@Urban newsletter.

--

--

Data@Urban
Urban Institute

Data@Urban is a place to explore the code, data, products, and processes that bring Urban Institute research to life.