Reflection on Hands-on Lab #1
Our Hands-on Lab #1 is to parse the raw data of tweets gathered over a three-month period by selecting out #1ReasonWhy tweets and re-formatting them into a specific JSON schema, which generates the clean data that we need for further data mining and analysis. This step is also the beginning stage of our semester project of COM601. And the raw data is provided by our professor, purchased from GNIP and collected by TwitterGoggles.
When I first heard about this topic, I was completely puzzled by what my team members talked about, since I had no idea about tweets and data at that time. I continued to ask them questions such as how the tweet data looked like, what we could do with the tweet data, and why we chose this topic as our semester project. In the past, I rarely used Twitter as my online digital communication tool in daily life. So it was difficult for me to understand the structure and components of the raw tweet data at the very beginning. Thankfully, along with the progression of our project and patient explanations from my co-workers, I gradually grasped the significance and key points of our research project. I got to know what we could curate from the raw data, how we could re-organize it and generate some meaningful results from it by writing certain Python script.
To read the raw tweet data carefully and understand the meaning of every data attributes was the first job that I did. Through reading and checking the raw data, I learned that the item objectType in object of the raw data could be different based on whether it was about an original tweet or a re-tweet. I also understood why there was a second object included in the first object within the data of one tweet, why these two objects had totally different sets of attribute values, and why there were a couple of different ids in the data of one tweet. To understand these components and structure of tweet data clearly was very crucial for us to carry out the next curation. It decided which attributes of the raw data we needed to include into our JSON schema for the clean data.
Apart from understanding the raw data in JSON files, another challenge I encountered was coding with Python. As a novice of Python, I spent a lot of time practicing and reviewing my Python tutorials. However, I have not been proficient enough to write complex Python script by myself. So most of the Python script was written by other members of our team. The job I did was to understand and check the Python script they wrote, as well as to provide my thoughts and suggestions for improving it. Finally, I found some missing points and bugs during the coding process. That made me understand the Python script we wrote and the project itself more deeply.
Hereby, I want to express my great gratitude to my team members. We are cooperating with each other very well, everyone wants to undertake more tasks, and is eager to learn more from this project. Because of their help, I acquired new knowledge and skills quickly and solved technical problems smoothly. Mike put forward our topic of research ‘#1ReasonWhy’ and explained a lot about his thoughts in this topic. He also wrote the main part of fetchdata function in the Python script. Xi, who has experience in programming, provided us with technical support that dominated in the whole process. She started up our coding of Python script, wrote Python script to go through all the raw data and solved most of the bugs in the script. Jodi is good at coordination of our meetings and discussion, and she wrote one part of script to fetch data of dictionary usermentions. Xi and Mike have done most of the coding job, and Jodi and I helped labeling the parsing code. All of us have conducted discussion about our JSON schema format and Python script for data curation. Maybe the Python script we made is not perfect at present, however, it will be improved gradually along with the advancement of our project.
During the process of data curation with our collaborative efforts, I have learned a useful programming language Python, the re-organization of data structure in the tweet records using JSON format, and the way to capture and curate data. I also got familiar with several useful data analysis tools and web pages, especially Datahero, which could generate various kinds of infographics to describe data and present a clear and significant visualization of them. I think that would be beneficial for our next stage of research when we have the clean data and statistic figures at hand.
In the following stage, I will do the analysis on the “text” of the clean tweet data, curated by Hands-on Lab #1. I found that Text Analytics 101, the recommended reading material, was very useful for us to figure out the most frequently appearing themes in #1ReasonWhy tweets. Also, I will take more time to improve my coding with Python and write more parts of Python script for our project. Practice makes work perfect. And it is very interesting to write my own script and witness it succeed to return useful results.
If I continue to do another similar project, I will make a better plan, start learning necessary programming languages and tools earlier. Just as Hands-on Lab #1 teaches me, if I have had learned and practiced Python earlier, I would have been more experienced in writing Python script and would have contributed more to this project. The coding and bug solving process would enable me to be more confident in using Python. Therefore, I should start as early as possible in the next Hands-on Lab and our project tasks.