DATA STORIES | LEARNING ANALYTICS | KNIME ANALYTICS PLATFORM

Pipelines towards understanding students’ learning behavior with KNIME

A dashboard about distance learning students

Rozita Tsoni

Published in

Low Code for Data Science

12 min readApr 5, 2023

Co-authors: Georgia Garani and Vassilios Verykios

This article is a readapted version of the scientific paper “Data pipelines for educational data mining in distance education” (2023) published in Interactive Learning Environments Journal.

While teaching, tutors constantly gather information deriving from learners, using it to interact efficiently and to adapt their teaching according to students’ needs. The ability to observe and reflect on this information is a key asset for their effectiveness. Unfortunately, in Distance Learning valuable elements, such as body language and facial expressions, are to a large extent unavailable due to the temporal and spatial distance that separates teaching and learning. So, what can be done to bridge this gap? Fortunately, Learning Analytics (LA) and Educational Data Mining (EDM) can help overcome those barriers and transform accumulated transactional data to valuable knowledge. Therefore, it is needed to design, create and implement a holistic process to improve our understanding about the learners.

1. Learning Analytics can bring “closer” distance learners and facilitate the learning process

The impact of LA and EDM is highly recognized in multiple sectors of education since they can improve the educational design, the learning material, the teaching methods, and institutional policies. Although the automation of the EDM and LA process could be considered to bypass human-driven concerns, the flexibility that automated processes offer allows us to adjust to real-life problems and address them efficiently. Therefore, we propose a combination of an education-oriented designed Data Warehouse (DW) with data pipelines as a complete EDM process. Hopefully, the main benefit of our approach would be the repeatability and adjustability to the specific needs of the educational research. By reducing the time and effort that is required for modifying the EDM process, we can constantly produce updated and relevant results. There is a wider opportunity for pedagogical reflection and evidence-based decision making when the technical details of an EDM project are resolved.

We aim to provide the design and infrastructure that would load raw educational data and redound to actionable knowledge. To test the functionality of our project, we focus on producing Dashboards aiming to inform tutors about students:

a. Online activity

b. Academic performance

c. Social interaction

2. Organizing data through a custom-made DW

We developed a DW designed on the constellation-schema approach. The constellation schema can be regarded as a collection of several star schemas sharing a number of dimension tables. The data processed are retrieved from Moodle, an e-learning platform. Moodle contains microlevel data (i.e., mainly clickstream data), mesolevel data (i.e., mostly students writing artifacts), and macrolevel data (i.e., information collected at the institutional level, such as admission data).

The constellation schema designed for this work follows a similar approach to the 5W1H framework (Garani, & Adam, 2020). It contains two fact tables: the Log fact, and the Performance fact. The Log fact table includes data about the login actions of Moodle users. The data describe who (user), when (date and time), where (IP address), and what (event). An event can be any action the user can take, such as read, add, update or delete. Each event takes place on a specific element, for example, a quiz, an assignment, a resource, a forum, etc. Therefore, users may read a forum message, add an assignment, update a quiz, delete a resource, etc. Users are clustered into groups according to their role in the educational process, i.e., students, teachers, coordinators, etc. The second fact table is the Performance fact table, where data about students’ and teachers’ performances are stored. Specifically, data show who (user), when (date and time), what (event and type), whom (affected user) and in which (course) a particular grade is achieved. As presented in the description above, four dimensions are shared between the two fact tables: User, Date, Time, and Event dimension tables. Figure 1 demonstrates the constellation schema of the DW storing the main data components of the Moodle platform.

Figure 1. Constellation schema of the Moodle DW.

The Time and Date dimensions are de-normalized, whereas the Event, User, and Module dimensions are normalized expressing dimension hierarchies into different tables. The User, Event, Time and Date dimensions are shared by both fact tables. The Log and Performance tables share the Time table. The Course and Type tables are linked to the Performance fact table. The Module table is shared by the Performance fact table and the Log fact table through its hierarchical relationship with the dimension table Event.

3. A visual programming tool for data science

We chose KNIME Analytics Platform to support our attempt to provide a holistic, end-to-end solution for LA. Its graphical interface for visual programming does not require advanced programming skills. The reduce of the syntactic nature of data science aims to save more time and mental capacity so that it can be spent on concepts (Delen et al., 2021). KNIME allows the creation of data pipelines by assembling nodes, which are blocks that perform a given operation. Sub-workflows can be bundled in elaborated functionalities called components. Components can be shared, modified, and reused. Once a pipeline is created and configured, it can be executed. Upon execution, the outports of the nodes are activated so that results can be inspected. Moreover, the pipelines can be stored and exported in several formats. This way, the process can be repeated with accuracy allowing others to test its validity and make any necessary adjustments.

4. The pipeline

The increased complexity, modality, volume, and variability of operational data encountered in an educational setting during the last years call for an encompassing and unifying framework to deal with the different steps of the data science process. We propose the concept of a data pipeline as an end-to-end set of processes that run in a sequence.

Our pipeline consists of three major parts (Figure 2). The first part concerns the import of data, the second part implements the pre-processing steps, and the third part includes the analysis, visualization, and reporting tasks.

The data import area (Figure 3) includes the nodes that enable communication with the DW. The first node is the SQLite Connector node. The configuration window of the second node, the DB Query Reader node, allows forming the statements for querying the many tables of the DW. The node returns the requested results in a new data table.

The DW has already incorporated the data integration. Data integration or data blending is a key process for enriching the dataset and augmenting its dimensionality (Silipo & Rudnitckaia, 2021). In educational research, multidimensionality is necessary in order to encompass the complexity of our research problems. The architecture of the DW itself, along with the data query, is already carrying out a part of the data filtering and data selection tasks, significantly reducing the data pre-processing steps.

The second part of the pipeline is driven by the specific requirements of the LA process, since some of the data preparation steps are motivated by analysis and modelling itself (Berthold et al., 2020a). Not all pre-processing tasks are appropriate for every LA process (Romero et al., 2014). Here, the goal is to offer educators an overview of the students’ learning progress individually and within the context of their group, as well as provide them with insights into the students’ social behavior by visualizing their forum interaction network.

Recommendations driven from the literature (Rabbany et al., 2014; Bakhshinategh et al., 2018; Procaci et al., 2018; ElAtia et al., 2020; Ifenthaleret al., 2021) and previous relevant research (Tsoni et al., 2019; 2021; 2022a, 2022b), defined our pre-processing and analysis options. Therefore, there are two different sequences of tasks in the pre-processing part of the pipeline. The upper sequence of nodes (Figure 4) creates a data table for Social Network Analysis (SNA).

The yellow border rectangle (Figure 5) includes the process of attribute selection, data filtering and transformation to prepare the data for the analysis performed in the component that follows.

Figure 5. Attribute selection, data filtering, and data transformation.

The third part of the pipeline is also divided into two sets of processes. The SNA process (Figure 6) includes the network creation and visualization. The analysis and reporting process, which is shown in the lower dark blue border rectangle in Figure 2, is a more complex process of analyzing students’ data in order to produce graphs that summarize students’ activity and reveal important features of their learning behavior.

The results are presented in an interactive dashboard that allows for customization. Interactive dashboards are considered the most intuitive deployment options (Berthold et al., 2020c).

This last part of the pipeline enables to effortlessly introduce over time changes in the requirements that are motivated by the ongoing evolution of the educational settings. Instead of redesigning and applying a whole new process, modifications are made by replacing only the necessary nodes or just by changing the settings in the configuration windows of certain nodes.

5. What information do we get?

Multiple graphs are provided in the dashboard either for students individually, or for their group in total. One example from our pipeline execution is the combination of multiple sunburst charts. A sunburst chart is a multidimensional, hierarchical graph that represents nominal data in a radial layout. In this case (Figure 7), as we move outwards, the rings denote: the group, the user, and the event. By moving the cursor on the chart, the features are selected and the event count is appearing in the middle of the circle.

*Figure 7. One of the sunbursts charts for group activity comparison.*

By presenting the sunbursts charts for every group in a quote, the comparison is facilitated and important conclusions can be drawn. For example, information is provided on how students of each group interacted in the fora, their access to the online learning material, or the level of tutor’s intervention.

At the Hellenic Open University (HOU) written assignments are used to assess students’ progress and constitute an important learning tool. Written assignments are performed individually and in an unsupervised setting by the students who submit them online, declaring them as their own work. Tutors are relying on students’ integrity, since the only check on their behalf concerns plagiarism (i.e., to present as their own the work of another person) and does not involve other ways of cheating. Previous work (Gontzis et al., 2018) showed that clustering methods along with predictive models can indicate “potential cheaters”.

Thus, it is important to group students into four categories that are meaningful for the summative assessment of the course:

(1) Students who successfully completed the course.

(2) Students who were excluded from the final exams because they did not manage to have the pre-requisite average grade in the written assignments.

(3) Students who participated in the final exams but failed.

(4) Potential cheaters.

This labeling can help tutors reflect on the final results of their course and focus on the students who failed to complete the course. Also, tutors can identify difficulties that were not expressed, and led students to seek external help and resort to academic dishonesty. The “potential cheaters” category is not aiming to stigmatize students. Rather, the goal is to warn tutors about the need to channel their efforts to further motivate students’ participation and involvement, since there is a discrepancy between their written assignments and their final grade. In the following visualization (Figure 8), the activity of each student is monitored via the total count of events in the online learning environment. Students from each of the aforementioned categories are denoted in a different color. The low activity of “potential cheaters” supports our hypothesis. This is consistent with previous studies that have shown a positive correlation between students’ activity and their academic performance (Tsoni et al., 2022b).

Figure 8. *The activity of students by category.*

In online learning environments, social behavior is mainly expressed in discussion fora. At the HOU, students are encouraged to use them to interact with peers and their tutors, to ask questions about their courses, or simply to chit chat. Regardless of the specific purpose of the posts, visualizing forum activity as a network graph can provide rich information about students’ behavior. These benefits are highly recognized in the educational field (Sergis, & Sampson, 2017). A bimodal network was created to express students’ interaction in the communal forum for all classes. Each node represents a participant (either student or tutor) who made a post or a discussion thread. Two nodes are connected with an edge when a participant has posted in a thread. In Figure 9, the network of the forum interaction is shown.

One of the most obvious remarks is that the tutors have a central role in the network. Additionally, there is an inner structure denoting that there is a tendency to the formation of sub-groups within the forum. There is a completely disconnected area (in the left-upper part of Figure 9) signifying the preference of some students to interact only with people from their own group.

6. From raw data to insights

Educational research and practice demand clear and valid solutions that are based on thorough technical design and explicit methodological steps. Poor preparation and misuse of the available affordances will lead the learning community to blame distance learning as a process instead of its implementation failure (Naidu, 2020). Our approach addresses the issues of gathering, arranging, and manipulating data in an automated yet adjustable way. The process begins at the data source by organizing a DW for educational data and ends up by deploying the results in meaningful and interpretable graphs presented in a LA dashboard. It was shown that data pipelines, executed using KNIME Analytics Platform, can offer a holistic solution for EDM.

References

Bakhshinategh, B., Zaiane, O. R., ElAtia, S., & Ipperciel, D. (2018). Educational data mining applications and tasks: A survey of the last 10 years. Education and Information Technologies, 23(1), 537–553.

Berthold, M. R., Borgelt, C., Höppner, F., Klawonn, F., & Silipo, R. (2020 a). Data preparation. In Guide to Intelligent Data Science (pp. 127–156). Springer, Cham.

Berthold, M. R., Borgelt, C., Höppner, F., Klawonn, F., & Silipo, R. (2020 b). Deployment and Model Management. In Guide to Intelligent Data Science (pp. 319–328). Springer, Cham.

Delen, D., Helfrich, S., & Silipo, R. (2021). KNIME Analytics Platform for Visual Data Science and Business Analytics Teaching. In Proceedings of the 52nd ACM Technical Symposium on Computer Science Education (pp. 1373–1373).

ElAtia, S., Ipperciel, D., Zaiane, O., Bakhshinategh, B., & Thibaudeau, P. (2020). Graduate attributes assessment program. The International Journal of Information and Learning Technology.

Garani, G., & Adam, G. K. (2020). A Semantic Trajectory Data Warehouse for Improving Nursing Productivity. Health Information Science and Systems Journal, 8(25), 1–13.

Gkontzis, A., Kotsiantis, S., Tsoni, R., & Verykios, V. S. (2018). An Effective LA Approach to Predict Student Achievement. In Proceedings of the 22nd Pan-Hellenic Conference on Informatics. ACM

Ifenthaler, D., Gibson, D., Prasse, D., Shimada, A., & Yamada, M. (2021). Putting learning back into learning analytics: actions for policy makers, researchers, and practitioners. Educational Technology Research and Development, 69(4), 2131–2150.

Naidu, S. (2020). It is the worst — and the best — of times! Distance Education, 41(4), 425–428.

Procaci, T. B., Siqueira, S. W., & Nunes, B. P. (2018, July). Learning in communities: How do outstanding users differ from other users?. In 2018 IEEE 18th International Conference on Advanced Learning Technologies (ICALT) (pp. 173–177). IEEE.

Rabbany, R., Elatia, S., Takaffoli, M., & Zaïane, O. R. (2014). Collaborative learning of students in online discussion forums: A social network analysis perspective. In Educational data mining (pp. 441–466). Springer, Cham.

Romero, C., Romero, J. R., & Ventura, S. (2014). A survey on pre-processing educational data. In Educational data mining (pp. 29–64). Springer, Cham

Sergis, S., & Sampson, D. G. (2017). Teaching and learning analytics to support teacher inquiry: A systematic literature review. Learning analytics: Fundaments, applications, and trends, 25–63.

Tsoni R., Sakkopoulos E., & Verykios S. V. (2022a). Revealing Latent Student Traits in Distance Learning through SNA and PCA. Handbook on Intelligence Techniques in the Educational Process. Springer.

Tsoni R., Sakkopoulos E., Panagiotakopoulos C. T., & Verykios S. V. (2021). On the Equivalence Between Bimodal and Unimodal Students’ Collaboration Networks in Distance Learning. Journal of Intelligent Decision Technologies. 305–319.

Tsoni R., Samaras C., Paxinou E., Panagiotakopoulos C., & Verykios, V.S. (2019). From Analytics to Cognition: Expanding the Reach of Data in Learning. In Proc. of CSEDU (2) (pp. 458–465).

Tsoni, R., Panagiotakopoulos, C. Τ., & Verykios, V. S. (2022b). Revealing latent traits in the social behavior of distance learning students. Education and Information Technologies, 1–37

Tsoni, R., Zorkadis V. & Verykios, V. S. (2021). A data pipeline to preserve privacy in educational settings. In Proceedings of the 25nd Pan-Hellenic Conference on Informatics. ACM.