ZEB — Project
WebDev: Pascal Schiffers
AI: Nico Großkreuz
Data Science: Daniel Overesch, Carolin Bölling
Our Project was done in Cooperation with the zeb, a international management consultancy in the field of financial service which is based in Münster. The Project started with a rough idea for an application that was supposed to give the user information on used car prices on the go. So the rough idea was provided by zeb. With help from our mentors the project team, which at that point consisted of 5 members, tried to get a clear idea about how exactly an application like that could look and where which field of technology (AI, Data Science and WebDev) would come in to play.
After consulting with the zeb team we sat a project scope and agreed on several milestones during the project which were to be reviewed in upcoming meetings.
The goal of the Web Development part was to connect the functionalities of the different fields and guide the user through the process from taking a picture of a car to having a rough estimate about how he can finance the car.
While getting an introduction to React from the udemy course a first rough Sketch of the user interface was made my hand. it provided a good foundation to discuss where the interfaces to the AI and Data Science parts were and how the process could be made faster and easier to understand for the user.
After that an interactive Prototype was created using Adobe XD. It enabled the Team to get an Idea about how the application will look in the end and made it possible to discuss concrete content of the single screens and the concepts of interaction.
The biggest challenge was to switch from completing small programming challenges in the udemy course to actually creating an application from scratch. Luckily there are enough examples and tutorials that — after a fair amount of research — leave almost no questions unanswered.
The AI Part of this project is on the one hand to recognize the brand and model of a picture of a car. And on the other hand, stats/features from a picture of a data sheet of a used car should be identified.
Recognizing model and brand of a picture of a car
First, pictures of different cars are needed for training. Because this project is a proof of concept, we limit the model to recognize 4 different car models. These cars are “Audi A3”, “BMW 1er”, “Renault Clio” and “VW Golf 7 Variant”. With the Firefox Add-On “Google Image Downloader” 200–400 pictures of each car were downloaded. They were manually cleaned (e.g. Pictures of the inside of the car).
Following we can start creating a Neural Network. The FastAI library is used (https://github.com/fastai/fastai). Because our curriculum is built on FastAI 0.7, we use this version. With the created model, we are able to predict the model and brand by an accuracy > 90%. Important to mention is that the pictures which were used for training and testing the model were not as easy to recognize as you might think. Try it yourself. Google a car and scroll to the bottom. You will realize how impressive these 90% accuracy actually are.
Identifying features from a data sheet of a used car
The first challenge is getting training data. Because a Google search does not provide us with enough data, we drove to used car dealers and took some pictures ourselves.
With the optical character recognition (OCR) library “pytesseract” we can read text from these images. The output is either a string or a data frame showing size and position of the words. We limit the features we want to extract to 6 features. These are brand, model, year of construction, type of fuel, gearbox (automatic or manual) and mileage.
The last 4 features are easy to identify. With the data frame we can search for these features or synonymous of these words. The value of the next row in the data frame is the value of that feature. We can identify these features with a pretty good accuracy. The most challenging features to extract are brand and model. The brand and model of the car are often just big words on a picture without the featurenames (“brand” & “model”) in front of it. Because of the individuality of these data sheets we cannot derive rules for identifying the brand and model with a satisfying accuracy.
Data Science (Python)
The goal of my data science part was to output an average price of a desired car. The information about the data of the six modules ( brand, model, …) of the desired car was already imported through the AI part. In order to calculate an average price, we needed enough data and training material. The idea was to program a regression to predict the average price. We agreed to calculate the average price based on the offered cars on “AutoScout24”.
In order to extract the data from the “AutoScout24” page for the four models “Audi A3”, “BMW 1 Series”, “Renault Clio” and “VW Golf 7 Variant”, we first programmed a Webcrawler. To program the web crawler, we first examined the individual elements of the “AutoScout24” page and searched for the required 6 modules and the price in the classes. The biggest challenge was to clean the individual elements from unusable information in order to convert them into integer. To limit the amount of data, we focused on the first 21 pages of “AutoScout24”of each car model. For each of the 21 pages the formula of the WebCrawler is repeated. Thus data from more than 1580 cars could be drawn.
Regression Training and Price Prediction
In order to forecast a price, we first had to determine the quantitative influence of the six independent variables on the variable “price”. After the one-hot encoding of the category variables (brand, model, manual transmission and fuel), we used the “train_test_split” function to split the data into training and test data (70/30). We then trained a regression with the training data and tested it with the test data. For this we used the RandomForrestRegressor from Scikitlearn. This regression analysis can then be used for the next step of price prediction.
My Data Science part contains the analysis of the account transaction data of a dummy customer. The purpose of this is to calculate the customer’s ability to repay the loan so that the customer can be offered a financing option or a savings plan for the desired car.
We received the account transaction data via Ahoi-Sandbox, where different account turnovers can be created for demo customers. Access to the virtual account data is possible via API. On the one hand, it is possible to create an API using the library request by requesting different tokens from the sandbox that allow access to the virtual bank. Since the API has to be built like a real bank interface and this is very complex, there is an alternative solution via an explorer. The transactions were output as a JSON file. The JSON file contains individual customer transactions, i.e. all information that can also be seen on an account statement (value date, booking date, ID, account number, recipient, amount, currency, purpose, etc.).
Using the JSON library, it is possible to read a JSON file, which is structured like a dictionary, and save it as a DataFrame. Since some parts of the JSON file are nested objects, they must first be normalized. Among the nested objects were amount and currency, so without normalization they are displayed in the DataFrame-cell as Dictionary. Using normalization, the amount and also the currency get a column in the DataFrame.
Since only the booking date and the amount are required for a simple calculation of the data, these are selected by .iloc. The posting date still has the Java SimpleDataFormat (yyyy-mm-ddthh:mm:ssz) and is converted into a normal date by pandas. Now it is possible to group the contents by months and calculate the monthly sum. So that only whole months are available, the first and last month are deleted, because there is the risk that e.g. the salary of the dummy customer is not yet received and a bias would occur. Using a user-defined function, the moving average can be determined in the next step. The moving average can still be displayed graphically in comparison to the monthly amount. With the moving average, the users input and the average price of the car we are able to calculate the annuity loan.
The biggest challenge in this project was the coordination between the individual technology areas. As the only project that covered all three areas (AI, WebDev and Data Science), it was particularly challenging to allocate the entire project to the individual task areas. With the help of the frequent meetings together with the mentors of zeb it was possible to keep us up to date and to brief each other about the latest problems.
For the time being, we set ourselves the goal that each technology part should generate its output, which was necessary for the overall project. We managed to do this successfully so at this point the result consists of a web application which handles its own data between the screens but doesn’t communicate with the scripts from the Data Science and AI Parts. We‘ll try to solve this during the next month with help from the TechLabs mentors so that we’ll have a fully working application, that sends input data to the Data Science and AI parts and receives and shows the results from them in the UI.
Xing/ Linkedin — Links