Getting Started with EDA

Before you develop a score card model, getting a well understanding of the data you are using plays an import role. In this tuorial, we will discuss exploratory data analysis module included in yasc.

Imports

First we import yasc and check its version.

# imports
import yasc
from yasc.data import german_data
from yasc.eda import (
    missing_stat,
    numeric_stat,
    categorical_stat,
    corr_analysis,
    describe
)
import pandas as pd
import numpy as np

# show version
yasc.__version__
'0.0.1'

Load data

Here we use german_data() to load german credit data and show its first five rows.

# load german credit data
data = german_data()
data.head()
StatusOfExistingCheckingAccount DurationInMonth CreditHistory Purpose CreditAmount SavingsAccountAndBonds PresentEmploymentSince InstallmentRateInPercentageOfDisposableIncome PersonalStatusAndSex OtherDebtorsOrGuarantors ... Property AgeInYears OtherInstallmentPlans Housing NumberOfExistingCreditsAtThisBank Job NumberOfPeopleBeingLiableToProvideMaintenanceFor Telephone ForeignWorker Creditability
0 ... < 0 DM 6 critical account/ other credits existing (not ... radio/television 1169 unknown/ no savings account ... >= 7 years 4 male : divorced/separated none ... real estate 67 none own 2 skilled employee / official 1 yes, registered under the customers name yes good
1 0 <= ... < 200 DM 48 existing credits paid back duly till now radio/television 5951 ... < 100 DM 1 <= ... < 4 years 2 male : divorced/separated none ... real estate 22 none own 1 skilled employee / official 1 none yes bad
2 no checking account 12 critical account/ other credits existing (not ... education 2096 ... < 100 DM 4 <= ... < 7 years 2 male : divorced/separated none ... real estate 49 none own 1 unskilled - resident 2 none yes good
3 ... < 0 DM 42 existing credits paid back duly till now furniture/equipment 7882 ... < 100 DM 4 <= ... < 7 years 2 male : divorced/separated guarantor ... building society savings agreement/ life insur... 45 none for free 1 skilled employee / official 2 none yes good
4 ... < 0 DM 24 delay in paying off in the past car (new) 4870 ... < 100 DM 1 <= ... < 4 years 3 male : divorced/separated none ... unknown / no property 53 none for free 2 skilled employee / official 2 none yes bad

5 rows × 21 columns

Check missing values I

missing_stat() function can help us find out columns with missing values and output missing rates.

missing_stat(data)
No missing values
column #missing missing_rate
0 StatusOfExistingCheckingAccount 0 0.0
18 Telephone 0 0.0
17 NumberOfPeopleBeingLiableToProvideMaintenanceFor 0 0.0
16 Job 0 0.0
15 NumberOfExistingCreditsAtThisBank 0 0.0
14 Housing 0 0.0
13 OtherInstallmentPlans 0 0.0
12 AgeInYears 0 0.0
11 Property 0 0.0
19 ForeignWorker 0 0.0
10 PresentResidenceSince 0 0.0
8 PersonalStatusAndSex 0 0.0
7 InstallmentRateInPercentageOfDisposableIncome 0 0.0
6 PresentEmploymentSince 0 0.0
5 SavingsAccountAndBonds 0 0.0
4 CreditAmount 0 0.0
3 Purpose 0 0.0
2 CreditHistory 0 0.0
1 DurationInMonth 0 0.0
9 OtherDebtorsOrGuarantors 0 0.0
20 Creditability 0 0.0
# only include columns with missing values
missing_stat(data, only_missing_columns=True)
No missing values
column #missing missing_rate

Check missing values II

The german data happens to have no missing values, let’s create a data frame with missing values to test missing_stat() function.

# create a data frame with missing values
df1 = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [np.nan, 5, 6, 7], 'c': [8, 9, np.nan, 10]})
missing_stat(df1)
3 columns, of which 2 columns with missing values
column #missing missing_rate
0 a 0 0.00
1 b 1 0.25
2 c 1 0.25
# check missing statistics of single column
missing_stat(df1, "b")
Column b of dtype float64, 1 missing(s) (0.25)

Describe numeric columns

# check statistics of numeric columns
numeric_stat(data)
DurationInMonth CreditAmount InstallmentRateInPercentageOfDisposableIncome PresentResidenceSince AgeInYears NumberOfExistingCreditsAtThisBank NumberOfPeopleBeingLiableToProvideMaintenanceFor
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000
mean 20.903000 3271.258000 2.973000 2.845000 35.546000 1.407000 1.155000
std 12.058814 2822.736876 1.118715 1.103718 11.375469 0.577654 0.362086
min 4.000000 250.000000 1.000000 1.000000 19.000000 1.000000 1.000000
25% 12.000000 1365.500000 2.000000 2.000000 27.000000 1.000000 1.000000
50% 18.000000 2319.500000 3.000000 3.000000 33.000000 1.000000 1.000000
75% 24.000000 3972.250000 4.000000 4.000000 42.000000 2.000000 1.000000
max 72.000000 18424.000000 4.000000 4.000000 75.000000 4.000000 2.000000

Describe categorial columns

# check statistics of categorical columns
categorical_stat(data)
StatusOfExistingCheckingAccount CreditHistory Purpose SavingsAccountAndBonds PresentEmploymentSince PersonalStatusAndSex OtherDebtorsOrGuarantors Property OtherInstallmentPlans Housing Job Telephone ForeignWorker Creditability
count 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000
unique 4 5 10 5 5 4 3 4 3 3 4 2 2 2
top no checking account existing credits paid back duly till now radio/television ... < 100 DM 1 <= ... < 4 years male : single none car or other, not in attribute Savings account... none own skilled employee / official none yes good
freq 394 530 280 603 339 548 907 332 814 713 630 596 963 700

Generate descriptive statistics

describe() function is used to generate descriptive statistics of observed data. Beyond what we get from pandas.core.frame.DataFrame.describe(), from describe() we can check missing values in columns and type of column (numeric or categorical).

describe(df1)
a b c
dtype int64 float64 float64
type numeric numeric numeric
#missing 0 1 1
missing_rate 0 0.25 0.25
count 4 3 3
mean 2.5 6 9
std 1.29099 1 1
min 1 5 8
25% 1.75 5.5 8.5
50% 2.5 6 9
75% 3.25 6.5 9.5
max 4 7 10
# get a descriptive statistics
describe(data)
StatusOfExistingCheckingAccount DurationInMonth CreditHistory Purpose CreditAmount SavingsAccountAndBonds PresentEmploymentSince InstallmentRateInPercentageOfDisposableIncome PersonalStatusAndSex OtherDebtorsOrGuarantors ... Property AgeInYears OtherInstallmentPlans Housing NumberOfExistingCreditsAtThisBank Job NumberOfPeopleBeingLiableToProvideMaintenanceFor Telephone ForeignWorker Creditability
dtype object int64 object object int64 object object int64 object object ... object int64 object object int64 object int64 object object object
type categorical numeric categorical categorical numeric categorical categorical numeric categorical categorical ... categorical numeric categorical categorical numeric categorical numeric categorical categorical categorical
#missing 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
missing_rate 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
count 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 ... 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000
unique 4 NaN 5 10 NaN 5 5 NaN 4 3 ... 4 NaN 3 3 NaN 4 NaN 2 2 2
top no checking account NaN existing credits paid back duly till now radio/television NaN ... < 100 DM 1 <= ... < 4 years NaN male : single none ... car or other, not in attribute Savings account... NaN none own NaN skilled employee / official NaN none yes good
freq 394 NaN 530 280 NaN 603 339 NaN 548 907 ... 332 NaN 814 713 NaN 630 NaN 596 963 700
mean NaN 20.903 NaN NaN 3271.26 NaN NaN 2.973 NaN NaN ... NaN 35.546 NaN NaN 1.407 NaN 1.155 NaN NaN NaN
std NaN 12.0588 NaN NaN 2822.74 NaN NaN 1.11871 NaN NaN ... NaN 11.3755 NaN NaN 0.577654 NaN 0.362086 NaN NaN NaN
min NaN 4 NaN NaN 250 NaN NaN 1 NaN NaN ... NaN 19 NaN NaN 1 NaN 1 NaN NaN NaN
25% NaN 12 NaN NaN 1365.5 NaN NaN 2 NaN NaN ... NaN 27 NaN NaN 1 NaN 1 NaN NaN NaN
50% NaN 18 NaN NaN 2319.5 NaN NaN 3 NaN NaN ... NaN 33 NaN NaN 1 NaN 1 NaN NaN NaN
75% NaN 24 NaN NaN 3972.25 NaN NaN 4 NaN NaN ... NaN 42 NaN NaN 2 NaN 1 NaN NaN NaN
max NaN 72 NaN NaN 18424 NaN NaN 4 NaN NaN ... NaN 75 NaN NaN 4 NaN 2 NaN NaN NaN

15 rows × 21 columns

Correlation analysis

corr, ax = corr_analysis(data, show_plot=True)
../_images/get_started_with_eda_20_0.png