D.I.Y Fraud Analysis in Election

Adi Pradana Yuda Purnomo
6 min readMar 27, 2024

--

The fraud analysis will be implement using python in this article, based on description of fraud, data learning method and fraud composition in previous article : https://medium.com/@adi_pradana14/another-way-of-doing-fraud-analysis-in-elections-02801e8f0085

Fraud Analysis

Before D.I.Y for fraud analysis, it need to create specification to implement the fraud analysis. Python will be chosen because of the learning library, that is scikit learn. Python and scikit learn are opensource and free.

scikit learn

In learning case, need to define the framework to making fraud analysis. It define in step by step to define the business, exploring data, making modeling for learning, and evaluate the result. The chosen framework is CRISP-DM.

Cross Industry Standard Process for Data Mining (CRISP-DM; Shearer, 2000).

The deployment will reliable based on business and data what got pattern based on learning using modeling.

Business Understanding

The data can loaded based on manageable data source seems like on the previous article https://medium.com/@adi_pradana14/data-marketing-management-960532517b6f .

At 2020 was the 59th quadrennial United State presidential election held on Tuesday, November 3, 2020. To win the election, the candidate needs 270 out of 538 electoral votes. A good sign, that show if a candidate is doing well, is if they win states that aren’t expected to go their way. This dataset contains county-level data from 2020 US Election.

All of regional in US has many registered voters. When the election is over, the electoral commission will got reported vote from all of regionals. It can be a chosen parameter to making fraud analysis.

Data Understanding

Digging deeper the data is needed to uncover the structure of problems in the business and the available data, then match them to the data mining process. In this case, county data and president candidate data in CSV format.

There are 2 dataset to using for fraud analysis : president_county.csv and president_county_candidate.csv.

president_county.csv, General information about reporting votes to presidential race by county :

  • state : The name of state.
  • county : The name of county.
  • registered_voters : Number of voters who registered into list of US election voters in 2020.

president_county_candidate.csv, Described information about candidate votes to presidential race by county :

  • state : The name of state.
  • county : The name of county.
  • candidate : The name of presidential candidate.
  • party : The party what supporting the presidential candidate.
  • reporting_votes : Number of vote what will reported to US Electoral Comission.

That dataset will be joined based on state and country field.

Joined dataset

Based on the understanding, registered_voters and reporting_votes can be compared to get suspicious scale. The supervised learning can be implemented in this case.

Registered voters vs Reporting Votes

There are 7 Voting Place from 11 suspicious case.

Data Preparation / Data Preprocessing

Usually the data still invalid and incomplete, so the data preparation phase is done by doing data cleansing, and converting data to different types if necessary. Sometimes numeric values must be normalized.

For data mining needs, the dataset requires a preprocessing process with data cleaning and data transformation, so that evaluations about data quality, cleaning raw data, and perfecting lost data. The following is what usually happens with databases:

  • Data is too old and redundant so it’s invalid;
  • Value of missing data;
  • Data does not fit the data mining model;
  • The value of data is irregular consistently or does not make sense.

These are the step of preprocessing data:

  • Deal with missing data. Replace lost data with constant data. Each missing data is replaced with value = 0 for the numeric value and the same label in the text that matches the data category. The data is filled in constantly. Replace missing data with average values ​​for numeric types;
  • Identifying Misclassifications. There are cases for the classification of the debtor’s home city, namely Jakarta, Palangkaraya, DKI Jakarta, Central Kalimantan and Padang. Jakarta and DKI Jakarta are two classifications that mean the same, namely DKI Jakarta. So that it can be summarized into one classification namely DKI Jakarta and the summing values ​​of Jakarta and DKI Jakarta. Palangkaraya is part of Central Kalimantan, according to that it can be summarized as Central Kalimantan. Identifying misclassifications is to combine classifications that mean the same thing and add these values;
  • Eliminating duplicate data records. In a database, records usually have copied data which makes the records duplicate. Duplicated records cause the data to be excessive in the data values ​​in some records. If a record is duplicated, only one record must be saved. For example a contract id is duplicated, and then deletes data that has the same contract id.

The dataset will be adept based on business needs and the chosen learning methods. Because of supervised learning, the registered_voters and reporting_votes must be compared to get flag based on suspicious scale. The suspicious flag will be defined based on voter_turnout_rate.

Suspicious rate

The threshold will be defined in 1.0, so the suspicious flag (suspicious_counties) value will be in 1, when the suspicious rate (voter_turnout_rate) more than 1.0.

Suspicious flag

Modeling

Modeling is a kind of model or pattern that captures order in the data.

X and y axis

Because of the supervised learning, the modeling will be chosen in classification or regression. In this case, the classification will be used to implement the supervised learning. The modeling chosen is RandomForestClassifier().

Modeling

The modeling will be chosen when in higher accuracy value.

Evaluate the model

Based on the accuracy value, the value is 1.0, it means the learning data and testing data are in equal rate, there is 100%. Accuracy of 0 means the classifier always predicts the wrong label, whereas accuracy of 1, or 100, means that it always predicts the correct label.

Evaluation

The purpose of the evaluation phase is to carefully assess and test the validity of data mining results. The evaluation phase also serves to help to ensure that the model meets the business objectives. Indeed the main purpose of data science for business is to support decision making, and the process starts by focusing on the problem that you want to solve. Because of the classification modeling, it can be evaluated using confusion matrix.

Deployment

In the deployment phase, the results of data mining are actually used to create a decision maker. For example, for online advertising needs, a system is used that automatically builds and tests the model when the ad is displayed.

Based on nodes majority in False Negative on confusion matrix, it concluded that majority in US election 2020 is not suspicious. Based on that found, the statement will be announce.

US Election 2020 in majority is not suspicious.

Dataset:

https://www.kaggle.com/datasets/unanimad/us-election-2020

Source:

Han, J., Pei, J., & Tong, H. (2022). Data mining: Concepts and Techniques. Morgan Kaufmann.

Géron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and Tensorflow: Concepts, Tools, and Techniques to Build Intelligent Systems.

Provost, F., & Fawcett, T. (2013). Data Science for Business What You Need to Know About Data Mining and Data-Analytic Thinking. California : O’Reilly Media.

--

--