How web crawling/scraping and data analysis can help to grow your business ?
You must have heard of the news about how data analysis is impacting our lives. And from the point view of a business man or an enterprise, analysis ability can even decide whether they are able to retain their customers and keep a staggering revenue growth.Think about Walmart, they have just designed the latest Search Engine Polaris based on statistic analysis, semantic analysis, and etc. “We have improved our online purchasing system and successfully motivated more purchasing behaviors by 10% to 15% .” Said proudly by Walmart. In fact, not only Walmart, more and more start-up or medium-size enterprises are putting more weight on user data to dig out wealth of underlying information, which can help them make a more wise business decision.
Back to the data analysis itself, there can be many factors influencing customers’ desire for shopping. Encouraged by my own interests, I’d prefer analyzing whether there exists a possible correlation between users’ login frequency(Frequency) and their purchase quantity(Goods).
For the following up, I’ll introduce some methods about how I get users data records and what I’ve done with my data analysis.
1. Collect user data records
There may be lots of user records data in your online user management system. However, we need to export it as a more structured data set and have it stored into the local side for further analysis. For most business or enterprises, it can be costly to crawl data from websites by programming. Here, I can share with you the way how I crawl data from my online management system. Normally, I use Octoparse, which is an automatic web scraper/crawler that’s designed for non-programmers. We can just easily collect the target data by simple drag & click. While, concerned about privacy, I can’t directly show you how to crawl my own user management site. However, I will take Rakuten.com for an example to show you how it works to crawl the target data using this free web scraper tool. The operation interface is as below. Now, let’s see how to crawl online data with Octoparse.
First, you need to enter the target URL and load the web page completely in the built-in browser. Then, you need to build a loop list to include all the blocks containing target data fields, just like green dotted line box shown above. Next, you can begin capturing the needed data fields, like Desc, Price, Click frequency in this example. For my own case, I need the login frequency, purchased goods number and user id. Last, just make sure you have successfully set up the pagination action well so that it can flip to next page automatically and get you a complete set of data.
Once finishing configuration of the task, click the next step by following the instructions and select “Local Extraction”. Then, you will see how the data is extracted in the data extraction pane so smoothly within a short period of time, just like the demo shown below.
Octoparse enables us to extract data to various formats, saying any types of database, excel, txt, html, etc. You can pick up one way you’d export based on your needs.
2. Just start your data analysis
Back to my experimental case, I have exported my whole data into excel. Now, I will begin digging into if these two factors (Login frequency, goods number) really intertwine with each other. The data collected in the previous step.1 is shown in the table below (Note: The table only shows parts of the data crawled).
With these crawled data, we can plot a scatter diagram to observe these presumed coordinate points (Login Frequency, Purchase Num) are distributed in a regular way. The final scattered diagram is shown below. From the purchase number distribution, we can tell most scattered points have gathered between 2 and 5 around, whom we could possibly define as high quality users.
This has assumed a scenario that people with a login frequency falling in the range between 2 and 5 may exhibit a higher inclination to purchase. Additionally, by observing the red trend line, we could presume that the higher the login frequency is within this range, the more products the customers are willing to buy. However, this is just a subjective guess. We now need to go further to test our hypothesis.
2 ) Statistical hypothesis testing analysis (P-value Approach)
Now, let’s experiment with the presumption that there might be an underlying correlation between the users login frequency and their purchase quantity number.
Firstly, I have assumed that the login frequency number is within [2, 5].
Next, by sifting out 2, 3 and 5 which are the featured login frequency number, I can carry out the statistical hypothesis testing analysis.
To start with, I do a random sampling from the whole data set, and select out 22 sample data records for experiment, as can be seen in the Experiment table below.
Next, you can use matlab or any other available data analysis tools to do the single factor variance analysis. Note that we set the significance level α, the probability of making a Type I error to be 0.05.
The final result is as below. From the Variance analysis, we can see these three groups exhibit quite difference on Avg, thus we can specify an assumption- The sample groups difference is caused by the experiment sampling error.
Compare the P-value to α, we can see the P-value is less than α, thus we can reject the null hypothesis in favor of the alternative hypothesis that there exist difference among these three groups. Further, we can make a validation that the users purchasing quantity number is validated to be impacted by their login frequency.
Thankfully, from the analysis above, I’ll be able to pay more attention to those target users with a specified login frequency, be more clear with my goal and budget plan, also serve better for those high quality users.
In case you’d learn more about how to start with your business analysis, I’ve put together a list of tutorials for your reference:
-see more at Octoparse Blog