Crunching Covid-19 news articles with your own local data lake (step by step)
In this article, I will show how to run the basic components of a Data Lake on your personal computer, with minimal effort thanks to Docker container technology (https://www.docker.com/). By the way, if you don’t have it installed yet, this is a good time for you to do it, in a previous publication of this Blog (https://abxda.medium.com/geo-big-data-con-datos-censales-2de6250772a5), there are some instructions so that you can install it on a computer with Windows 10 Professional. Thanks to this technology, it will be enough to run a single line on your terminal, and with a little patience, you will have all the basic components of a Mini Data Lake running on your personal computer. With all that running on your laptop, you can systematically collect thousands of news articles using a technology called Apache Airflow (https://airflow.apache.org), creating a semi-structured database. All the collected news will be stored in an object repository based on Amazon’s S3 technology, but running locally in your own Data Lake with Free Software technology (https://min.io). After that, we will carry out an analysis of that Big Data collected through PySpark using Jupyter Lab Notebooks (https://jupyter.org), with which we will make some summaries and give structure to the data to be put in the PostgreSQL relational database (https://www.postgresql.org). Finally, we will connect a state-of-the-art Business Intelligence tool to visually analyze the analyzed data (https://superset.apache.org). Although it will not be seen in this tutorial, a technology to generate Data APIs was also included in the Data Lake, which will allow us to make Custom Data products, although that will be material for another tutorial 😃
What is a Data Lake?
Before continuing, it is worth establishing what is understood by a Data Lake, at the end of this article you will find the references used in this definition. It is understood that the basic purpose of a Data Lake is to store all the data that an organization produces. Allowing its incorporation with the least possible friction, accepting data without modeling, semi-structured or even unstructured as text files and images. So the data is accessible for analysis as soon as it is incorporated. Hence, the storage system used is highly relevant, as it requires the ability to dynamically adapt to the demands of space and flexibility in the types of objects to be stored. There are several ways to implement these requirements for the Storage component, in the case of this article I opted for open-source high-performance object storage compatible with Amazon S3 called MinIO (https://min.io/), which can be installed and operated locally. If at some point in the evolution of your Data Lake, part of the storage needs to be transferred to the cloud, it will not be necessary to rewrite all the code that you already have to consume the objects stored on the distributed storage servers. The used Internet address would simply be replaced and the rest of the communication protocols remain unchanged.
In addition to the storage system, a data lake requires a systematic data collection strategy, which is called Data Ingestion, in this article, it was implemented using Apache Airflow (https://airflow.apache.org/) which is a Python-based workflow management platform, which allows launching, schedule and monitors continuous processes of data collection, analysis, and management in general. For system experts, it is equivalent to a version of the Linux tool called Cron (https://es.wikipedia.org/wiki/Cron_(Unix)) re-imagined for the 21st century. Additionally in the concrete implementation of our Data Lake, we will use an extended version of the tool called Googler (https://github.com/jarun/googler) which offers the feature of being able to consult Google and several of its services such as they are Web Searches, News, and Videos from the operating system console, that is, without explicitly using a web browser. That is, by combining Airflow and Googler, we automate the query and storage of searches in Google. The author of this article modified Googler (https://github.com/abxda/googler) to also download all news content, not just the summary. So the ingestion will be much higher and there will be more text to analyze.
Later it will be necessary to apply Data Science techniques to generate value from the data collected in the storage system of our Data Lake. Deriving to the next component which is called Processing. Here we use Apache Spark technology (https://spark.apache.org/) in its Python variant (PySpark) complemented with SparkSQL, to simplify the processing of semi-structured information collected in JSON format. It is in this component that most of the Big Data processing occurs, as well as the application of Machine Learning techniques to extract value from the collected data. In this component, summaries are generated and the data is structured, with which they can be placed in more traditional structured storages, such as an SQL Database which is the next component of our Data Lake.
It is to the SQL Database that the results of the different analyzes that are carried out are transferred, we chose PostgreSQL as the database due to its robustness and the advances in the efficiency of version 13. It is to this Data engine that we connect the Data Delivery and Business Intelligence Self-Service components that are the last components of our interpretation of a Data Lake.
The Business Intelligence Self-Service is one of the outputs that allow the exploration of the results of the Big Data analysis, as well as the generation of Dashboards and attractive Visualizations that support decision-making for the rest of the organization. Here we choose Apache Superset technology (https://superset.apache.org/) which allows for interactive and visually appealing data exploration.
We also include a component for Data Delivery through REST APIs with FastAPI technology (https://fastapi.tiangolo.com/). Although it is not used in this article, we have prepared the infrastructure for the creation of custom data products, based on the findings of our Data Lake, but that will be in another article.
Finally, our Data Lake does not incorporate the Governance, Data Security, and Metadata Administration components. These components depend on the rules and technologies adopted by each organization, so in this version, we leave them out. However, they are part of the complete definition of a Data Lake.
The following image shows the components of the current implementation:
Collection of news articles about Covid-19 in Mexico
At the time of writing this article, it is March 2021 and we are experiencing a health contingency of global proportions, still with an uncertain outcome. Therefore, it is relevant to be able to monitor the entire stream of news that is generated day by day, even to be able to collect news generated in past months to put what happened in context. So in this exercise, we will collect thousands of articles from the Mexico region from January 1, 2020, until the day you carry out your collection. The search pattern used is: “Covid Mexico” and we will analyze the content of the news to generate summary views that allow us to better explore the universe of information that is being generated. Obviously, the collector is prepared to be adjusted to new dates and any search pattern. For more details on the parameters of the search engine, you can delve into the Googler user manual: https://github.com/jarun/googler. In the case of this article, a modified version will be delivered that also downloads the full text of the news to be analyzed. The original URLs are preserved so that the findings can be properly cited.
Running our Data Lake
The implementation of this data lake was uploaded to GitHub, so you will be able to access the code of all the Docker definitions that make up this proposal. If you are a restless and self-managing reader in your learning, you can even modify and add or remove components to your liking. To download the code you need to have Git installed on your computer, this is achieved by downloading the installer at the following link: https://git-scm.com/download/win where you can download the installer by clicking on the link shown :
proceed to its installation with all the default options, to confirm that it has been installed correctly, open a PowerShell window and execute the following command:
Output similar to the following is expected:
Since you have GIT installed, you will be able to download all the necessary definitions to run your Data Lake, for which you need to go to the definitive directory where the code will be downloaded. for example, in windows, it would be D: \
Finally, you can download the code as follows:
The next step depends on whether Docker is running on your machine, in the previous article, the instructions for installing Docker were given, this is the last opportunity to install it if you want to continue.
To verify that it is running you must be able to see the little whale in the icon bar of your computer, for example in my case it is not running yet !!!
The Docker whale is missing
Then you have to activate it, looking for it in the taskbar, and opening it:
It will take a while, but once it is finished, you will be able to identify that Docker is running and you will see its icon in the notification bar:
Now follow the instruction that will compile and lift all the servers in our Data Lake. We have to be in the directory where we download the code and execute the following instruction in PowerShell:
docker-compose up --build -d
The first time it will take approximately 1 hour, depending on your Internet connection and the speed of your computer, because it involves downloading and configuring everything you need for the first time, it is a good time to have a coffee or watch an episode of your favorite series.
When you have the following output, you will know that it is finished:
Later, only the following line will be enough to start everything and it will take a few seconds since you already have everything installed on your computer.
docker-compose up -d
The previous line will be necessary the next time you want to start your Data Lake. At the moment everything is running. You can verify this by opening the Docker Dashboard:
There you will see the group of services running and ready to be used:
The next step is to use Apache Airflow to download news according to some specific search criteria. Now you have to open the following link: http://localhost:7777/admin/ where we will find this web application:
The pre-programmed graph to collect news requires that we establish the basic parameters from a variable file that we download with the rest of the code, so it must be incorporated as follows:
Before starting we can go to the object storage server at the following address: http://localhost:9000/ where we can see that it is running and without any file at this time:
user is minio-access-key password is minio-secret-key
Upon entering we confirm that it is empty, Airflow will fill it with the documents that it will proceed to download in a few moments, as soon as we activate the collector.
Let’s go back to Airflow at http://localhost:7777/ and activate the collection:
After approximately 25 minutes we will have the following graph in the tree view of the Airflow graph that we just activated:
Now we can visit our object store where we are going to see all the files in a new Bucket created by Airflow in the path: / news / covid_mexico. This path contains the JSON files with the information of all the news that could be collected. To better identify each file, the YEAR-MONTH-DAY.json pattern was established, which indicates that within each file is the downloaded news. We will now proceed to explore the result of our collection.
Now we are going to process the data with a pre-programmed Jupyter notebook, we must open the following path on our computer http://127.0.0.1:8888/lab?token=m1n1lake
and we go to the notebook as indicated in the following image:
The notebook includes a programming pattern that connects to the object storage and reads all downloaded files, in the following line we can see the connection path:
We can explore the amount of news downloaded and even the scheme with which it was saved in the JSON files:
In the notebook carry out some basic cleaning, of course, the reader can adjust the process to his liking here a sample of the information with a new site field, which allows us to observe the origin of the news:
Now we have a table, adjusted and with some cleaning processes, we can save the DataFrame in Postgresql and start preparing the way to connect the data to a business intelligence tool like Apache Superset. The connection parameters are as follows:
It is important to clarify that the shared database and the shared user were created in the initialization process, that is why we can connect. The password is well known: changeme1234 which was also saved in the environment variable SHARED_PASSWORD.
In the next line, we save the PySpark DataFrame in the PostgreSQL database and overwrite whatever is called: tb_news_covid_mexico_date_text. Which will be the name of the first table that we will load in our SQL engine.
In the rest of the notebook, I selected the 15 most significant words from each article using the TFIDF criteria, leaving a table like the following one, which will also be uploaded to PostgreSQL. Inspiration for TFIDF calculation came from the following articles: https://sigdelta.com/blog/word-count-in-spark-with-a-pinch-of-tf-idf/ & https://sigdelta.com/blog/word-count-in-spark-with-a-pinch-of-tf-idf-continued/.
finally we save the result in the PostgreSQL database in the table called: tb_news_covid_mexico_palabras_top_tfidf.
We can verify that the data was added correctly, opening the PostgreSQL manager called pgAdmin 4 in the following path: http://localhost:5050/ . To connect, the user is firstname.lastname@example.org and the password: admin:
Once connected we can create a connection to the database where we have just loaded the tables with the basic analysis of the news collected. As we said above, the connection parameters are Database: shared; User: shared; Password: changeme1234, and postgres address. Let’s see how to connect:
Now we are going to create visualizations of the data from Apache Superset. This tool is located at: http://localhost:8088/login/ the access credentials are: user: admin and password: changeme1234
To create visualizations, it is first necessary to register the database that has the information we want to explore, our case is the well-known shared 😉
Now it will be necessary to register the tables with which we will be working:
The two tables generated with PySpark must be registered:
The two tables are now available for charting and dashboards.
First, we will make a filter, which will be the control that will allow us to adjust the data that we will be viewing, it is interesting to note that we will always start from the table view, to make controls or graphs. We will also create a Dashboard called COVID-19 News.
Now we will make a word cloud with the most significant words, which now depends on the configuration that the user makes of the filter while interacting with the Dashboard.
Now we will do a count of the number of articles per news site.
Finally, we will add a daily news histogram, according to the analysis period:
With this, we were able to generate a Control Board that allows us to explore and analyze the news that has touched the subject of COVID during the last fourteen months.
In this article-tutorial, we managed to build a Data Lake from scratch, and collect thousands of news items from January 1, 2020, to March 2021. We analyze the data collected by connecting to the S3 object repository of our implementation. And finally, we generate an interactive visualization of the analyzed data.
It only remains for the reader to replicate the exercise himself and delve into all the technologies applied here on his own. Share your screenshots and tag me on twitter @abxda
Thanks for reading me!
Inspired by the repository: https://github.com/kkiaune/emails-classification
Ref: Miloslavskaya, N., & Tolstoy, A. (2016). Big Data, Fast Data, and Data Lake Concepts. Procedia Computer Science, 88, 300–305. https://doi.org/10.1016/j.procs.2016.07.439
Ref: Mathis, C. (2017). Data Lakes. Datenbank-Spektrum, 17 (3), 289–293. https://doi.org/10.1007/s13222-017-0272-7
Ref: Khine, P. P., & Wang, Z. S. (2018). Data lake: a new ideology in big data era. ITM Web of Conferences, 17, 3025. https://doi.org/10.1051/itmconf/20181703025
Ref: Ravat, F., & Zhao, Y. (2019). Data Lakes: Trends and Perspectives. In Lecture Notes in Computer Science (pp. 304–313). Springer International Publishing. https://doi.org/10.1007/978-3-030-27615-7_23
The Airflow and Superset services can be not lifted due to an end-of-line error in the initialization files of the PostgreSQL database:
The next step involves making the end of a line change, one way is through the Sublime editor: https://www.sublimetext.com/ as follows:
For our small data lake to detect the change, we must carry out the following instructions in Powershell:
#aceptar la eliminación de los contenedores indicando: (y)
docker volume rm mini-data-lake_postgres_volume
docker-compose up --build -d
With the indicated settings, our cluster should be functional.