Your group members decide to run a startup company and pursue your entrepreneurship on peer-to-peer lending. You identified the business platform, i.e., the lending club(https://www.lendingclub.com/), the largest peer-to-peer online marketplace bringing together borrowers and lenders. You will build a startup institutional investment company (please see the definition of the investment company in the lending club: https://www.lendingclub.com/investing/institutional/team) using this platform to identify some segment of borrowers. However, you do not have enough funding of your own and plan to attract the venture funding (NOT your mortgage, it is too risky :)) to provide your startup funds.
Here is the list of documents you need to submit for grading:Improved iPython notebook (your Python code and running results) to process the historical lending data and predict Default or Stand-standing. (with the following due date)
Please make a note that you start with a Python Notebook that partially works and requires you to fill in the missing Python statement. The Python Notebook is on Google Drive. The data and Python notebook is available on the Google shared drive (Links to an external site.)(https://drive.google.com/drive/u/0/folders/11sC-Jq7YNUfL3E5c7KxNe9gDz0We6qdX (Links to an external site.)). The example Python notebook is already under the subdirectory Note-Books (Note-Books/Project3-Notebook.ipynb). The input data directory is in (input/accepted-200000.csv). The directory contains more (big) data. You do not need to reimplement the Python Notebook. You will add some machine learning algorithms (such as Neural Networks, Support Vector Machines, Naive Bayes model), modify the data preprocessing procedures, and tune the parameters. Your technical objective is to improve the performance metric (The Python Notebook uses the Area under Receiver Operator Characteristics (AUROC)). The higher the improvement you make in your customized Python notebook, the high your technical component in this project. As your reference: the test set AUROC score of the baseline method is around 0.72.
Some description of the data: in this project, we indeed have big data, more than 1Million data records. I create two datasets: one is the complete data set, and the other is a partial dataset. I picked the first 20% of the dataset. The data might still be relevantly big; you can sample data!