PPP loan Fraud detection using python
PPP Loan Fraud Detection: The aim of this project will be to explore loan data from the Paycheck Protection Program administered by the Small Business Administration which relief to small and medium-sized businesses during the COVID-19 pandemic.
The main goal of this project will be to develop graphical visualizations of this data and apply anomaly detection methods to determine which loans are likely fraudulent.
The following are ideas on how to start with this project:
(a) Review background on fraud in the PPP loan program by using your preferred internet search tool, e.g. query “PPP loan fraud” in Google and read a few of the top news stories to develop an understanding of fraud issues.
(b) Download the full PPP loan dataset and data dictionary from the URL: https://data.sba.gov/dataset/ppp-foia/resource/aab… and review the different fields available in the data dictionary. Since this is a huge data set you can use only one part of the data set (i.e loans public _150kplus)
(c) Summarize the content of this data by both providing tabular summaries and graphical visualizations. For example, using pandas, write a Python script to read all .csv files containing loan data and plot a histogram of loan amounts.
(D) Some simple questions that you can quickly examine include:
the distribution of loan amounts approved under the PPP program
states and industries have received the most funding under the program.
the top loan originators and their loan approval rate
the average loan amount and approval rate for different sectors and demographic groups
(e) Finally, explore the use of more traditional unsupervised learning techniques such as anomaly detection ideas as in:
https://pyod.readthedocs.io/en/latest/
which can be supplemented with additional academic literature.
Longer-term questions and modeling topics that you would like to investigate include:
use unsupervised learning techniques, such as clustering or anomaly detection, to identify potentially fraudulent loans.
build a predictive model that can accurately identify fraudulent loans based on historical data.
(d) Next, we will investigate determining loans that have a high potential for fraud by grouping together loans in common categories and identifying outlier loans. For example, for a fixed NAICS code, group together loans in a similar geographical region and identify outlier loans.
Write a code in python using the above parameters with proper documentation and also a detailed report involving all the steps that have been taken to acheive the goal.