It is Not by Big Data, but back to the Fundamental
Nowadays when almost every aspect of our lives are digitized, electronic data is readily available, Big Data is commonly believed to be versatile and powerful in analysis and forecasting. However, it may NOT be applicable in financial forensics. Even though Big Data provides analysts the 3Vs: a large volume of information, a wide variety of data types, and a high velocity of data flow, but all these data can be manipulated or even faked. A recent case illustrates that financial forensic scientists adopt a fundamental or even a primitive data collection method to investigate the reliability and accuracy of some company data.
Badkar & Henderson of Financial Times (2020) reported that ‘a US-listed Chinese education company is facing an investigation by the US securities regulator following allegations from short-sellers that it faked sales.’ The allegation was mainly from a report prepared by a Research company in May 2020. This is the second case following an earlier case of a coffee chain-store that was delisted by the Nasdaq stock exchange after its ‘disclosure that RMB2.2bn ($322m) worth of sales last year had been “fabricated” and some costs and expenses were “substantially inflated”’.
How do they discover these alleged “fabricated data”? It is NOT by means of any Big Data methods or Machine Learning technologies. It is conducted by the most fundamental approach of data collection. According to an Anonymous (2019) report, they ‘mobilized 92 full-time and 1,418 part-time staff on the ground to run surveillance and successfully recorded store traffic for 981 store-days covering 100% of the operating hours of 620 stores.’ (p.4) Such an approach does not only involve a lot of manpower to cover the number of stores located in 38 cities for a period of time, but it also relies on the most primitive approach of data collection, that is by direct counting of foot traffic of each store and recorded videos from store open to store close, 11.5 hours per day on average.
Their counting model is as follows:
No. of orders per store per day = No. of customers picking up products + No. of paper bags picked up by delivery personnel … (1)
No. of items per store per day = No. of orders per store per day x No. of items per order … (2)
The No. of customers picking up products and the No. of paper bags picked up by delivery personnel can both be counted directly from the videos. The No. of items per order is assumed to be 1.14. (p.15) How to derive this figure is a bit difficult and tedious as follows:
They ‘gathered 25,843 customer receipts from 10,119 customers in 2,213 stores in 45 cities. The 25,843 receipts indicate 1.08 and 1.75 items per order for pick-ups and delivery orders respectively or blended 1.14.’ (p.20)
Yet the most tricky technique is the tracking of the No. of orders per store per day. ‘As all orders are placed and paid online and picked up offline, when an order is placed, a three-digit pick-up number and a QR code will be generated to facilitate the in store pick up. … the three-digit pick-up number appears sequential within each store in a day and shared by both pick-up and delivery orders’ (p.16)
But how can the investigators know the three-digit numbers of each store on each day?
This is one of the most critical methodologies in this study — [they] ‘placed one order each at the beginning and the end of a store’s operating hour to get the online order count for the day.’ (p.17)
Besides tracking the number of orders, they also compare the actual and the claimed average selling price per item. Since they have gathered 25,843 receipts which indicate the selling price of each item. These receipts become a sample for estimating the average selling price per item. The report claimed that the ‘25,843 receipts indicate … 12.3% inflation versus the reported case.’ (p.24) However, it does not explain in detail how the receipts are collected and how to avoid selection bias.
The above discussions aim to show the importance of a good research method for data forensics. It does not necessarily require artificial intelligence or Big Data techniques. Sometimes, primitive and down to earth methods are the most powerful, especially when the secondary data is not trustworthy. No matter how intelligent the AI can be, the results of Big Data analysis are subject to the reliability of the data. Garbage-in garbage-out, fake-in fake-out!
However, it also indicates how expensive a simple research project can be. For example, just a small part of the above study — video recording of stores requires 981 store-day x 11.5 hours per day manpower, i.e. 11,281.5 store-hour. Even if the hourly rate of part-time researchers is US$5, then it costs US$56,407.5!
This primitive method of direct observations may sound low-tech and labor-intensive, but a good design of using direct observations can be powerful in scientific research. I have also tried a similar method ten years ago in my study of the buyers-to-shoppers ratio of shopping malls (Yiu & Ng, 2010). Of course I did not have so many resources to conduct such large-scale research, but the basic approach is more or less the same.
Buyers-to-shoppers ratio has long been used to assess retail performance of shopping malls, but normally it is identified by surveys and questionnaires, which depend very much on the trustworthiness of the responses. In order to figure out the actual buyers-to-shoppers ratio, I employed a team of part-time students to undergo direct counting of the numbers of buyers and shoppers in shopping malls.
Altogether, we collected 810 observations (shoppers), 540 of them during weekdays and the remaining 270 during a weekend. In other words, our 3x 3x 3 data are actual observations of buyers-to-shoppers ratio at 3 types of chain shop, of 3 shopping malls, on 3 days. The information recorded in each observation includes (1)whether the shopper buys; (2) sex and age group of the shoppers(estimated by the enumerators); (3) how long does the shopper stay in the shop; (4) types of shop; (5) time (weekday or weekend); and (6) place (mall).
Table 1 shows the buyers-to-shoppers ratios directly observed at the malls. First, they are much smaller than those obtained from the questionnaires in the same malls. For example, over 53% and 43% of respondents on average said they have bought some apparel and AV/electrical products, but there were only about 26% and 16% of shoppers who were observed to be buyers. These results raise concerns over using questionnaires and interviews in the buyers-to-shoppers ratio study, and the retail study in general.
Similar to the study by Anonymous (2019), our students encountered some difficulties in the direct counting process. For example, they were stopped by some management staff, even though they were just standing there and observe. We thank you for the students’ persistence in finishing the data collection process. The situations in the Anonymous (2019) case seem to be even worse, it reports that ‘the 851 store-days that we visited but failed to record an entire day’s video, reasons including execution failure — asked out by Luckin staff, equipment crash etc. or quality control failure, mostly due to more than 10 minutes of footage missing for an entire day. The failed store-days are not included in the data analysis.’
Badkar, M. & Henderson, R. (2020). SEC probes US-listed Chinese education company GSX Techedu, Financial Times, Sep 3. https://www.ft.com/content/42ce7af3-73fc-43a5-827e-a362beb9bce0
Anonymous (2019). Luckin Coffee: Fraud + Fundamentally Broken Business, MuddyWatersResearch, Twitter, Feb 1. https://twitter.com/muddywatersre/status/1223274746017722371?lang=en, and the report at https://drive.google.com/file/d/1LKOYMpXVo1ssbWQx8j4G3-strg6mpQ7F/view
Yiu, C.Y. & Ng, H.C. (2010). Buyers-to-shoppers ratio of shopping malls: A probit study in Hong Kong, Journal of Retailing and Consumer Services. 17(5), 349–354. https://doi.org/10.1016/j.jretconser.2010.03.016