Before you develop a score card model, getting a well understanding of
the data you are using plays an import role. In this tuorial, we will
discuss exploratory data analysis module included in yasc
.
First we import yasc
and check its version.
# imports
import yasc
from yasc.data import german_data
from yasc.eda import (
missing_stat,
numeric_stat,
categorical_stat,
corr_analysis,
describe
)
import pandas as pd
import numpy as np
# show version
yasc.__version__
'0.0.1'
Here we use german_data()
to load german credit data and show its
first five rows.
# load german credit data
data = german_data()
data.head()
StatusOfExistingCheckingAccount | DurationInMonth | CreditHistory | Purpose | CreditAmount | SavingsAccountAndBonds | PresentEmploymentSince | InstallmentRateInPercentageOfDisposableIncome | PersonalStatusAndSex | OtherDebtorsOrGuarantors | ... | Property | AgeInYears | OtherInstallmentPlans | Housing | NumberOfExistingCreditsAtThisBank | Job | NumberOfPeopleBeingLiableToProvideMaintenanceFor | Telephone | ForeignWorker | Creditability | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ... < 0 DM | 6 | critical account/ other credits existing (not ... | radio/television | 1169 | unknown/ no savings account | ... >= 7 years | 4 | male : divorced/separated | none | ... | real estate | 67 | none | own | 2 | skilled employee / official | 1 | yes, registered under the customers name | yes | good |
1 | 0 <= ... < 200 DM | 48 | existing credits paid back duly till now | radio/television | 5951 | ... < 100 DM | 1 <= ... < 4 years | 2 | male : divorced/separated | none | ... | real estate | 22 | none | own | 1 | skilled employee / official | 1 | none | yes | bad |
2 | no checking account | 12 | critical account/ other credits existing (not ... | education | 2096 | ... < 100 DM | 4 <= ... < 7 years | 2 | male : divorced/separated | none | ... | real estate | 49 | none | own | 1 | unskilled - resident | 2 | none | yes | good |
3 | ... < 0 DM | 42 | existing credits paid back duly till now | furniture/equipment | 7882 | ... < 100 DM | 4 <= ... < 7 years | 2 | male : divorced/separated | guarantor | ... | building society savings agreement/ life insur... | 45 | none | for free | 1 | skilled employee / official | 2 | none | yes | good |
4 | ... < 0 DM | 24 | delay in paying off in the past | car (new) | 4870 | ... < 100 DM | 1 <= ... < 4 years | 3 | male : divorced/separated | none | ... | unknown / no property | 53 | none | for free | 2 | skilled employee / official | 2 | none | yes | bad |
5 rows × 21 columns
missing_stat()
function can help us find out columns with missing
values and output missing rates.
missing_stat(data)
No missing values
column | #missing | missing_rate | |
---|---|---|---|
0 | StatusOfExistingCheckingAccount | 0 | 0.0 |
18 | Telephone | 0 | 0.0 |
17 | NumberOfPeopleBeingLiableToProvideMaintenanceFor | 0 | 0.0 |
16 | Job | 0 | 0.0 |
15 | NumberOfExistingCreditsAtThisBank | 0 | 0.0 |
14 | Housing | 0 | 0.0 |
13 | OtherInstallmentPlans | 0 | 0.0 |
12 | AgeInYears | 0 | 0.0 |
11 | Property | 0 | 0.0 |
19 | ForeignWorker | 0 | 0.0 |
10 | PresentResidenceSince | 0 | 0.0 |
8 | PersonalStatusAndSex | 0 | 0.0 |
7 | InstallmentRateInPercentageOfDisposableIncome | 0 | 0.0 |
6 | PresentEmploymentSince | 0 | 0.0 |
5 | SavingsAccountAndBonds | 0 | 0.0 |
4 | CreditAmount | 0 | 0.0 |
3 | Purpose | 0 | 0.0 |
2 | CreditHistory | 0 | 0.0 |
1 | DurationInMonth | 0 | 0.0 |
9 | OtherDebtorsOrGuarantors | 0 | 0.0 |
20 | Creditability | 0 | 0.0 |
# only include columns with missing values
missing_stat(data, only_missing_columns=True)
No missing values
column | #missing | missing_rate |
---|
The german data happens to have no missing values, let’s create a data
frame with missing values to test missing_stat()
function.
# create a data frame with missing values
df1 = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [np.nan, 5, 6, 7], 'c': [8, 9, np.nan, 10]})
missing_stat(df1)
3 columns, of which 2 columns with missing values
column | #missing | missing_rate | |
---|---|---|---|
0 | a | 0 | 0.00 |
1 | b | 1 | 0.25 |
2 | c | 1 | 0.25 |
# check missing statistics of single column
missing_stat(df1, "b")
Column b of dtype float64, 1 missing(s) (0.25)
# check statistics of numeric columns
numeric_stat(data)
DurationInMonth | CreditAmount | InstallmentRateInPercentageOfDisposableIncome | PresentResidenceSince | AgeInYears | NumberOfExistingCreditsAtThisBank | NumberOfPeopleBeingLiableToProvideMaintenanceFor | |
---|---|---|---|---|---|---|---|
count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
mean | 20.903000 | 3271.258000 | 2.973000 | 2.845000 | 35.546000 | 1.407000 | 1.155000 |
std | 12.058814 | 2822.736876 | 1.118715 | 1.103718 | 11.375469 | 0.577654 | 0.362086 |
min | 4.000000 | 250.000000 | 1.000000 | 1.000000 | 19.000000 | 1.000000 | 1.000000 |
25% | 12.000000 | 1365.500000 | 2.000000 | 2.000000 | 27.000000 | 1.000000 | 1.000000 |
50% | 18.000000 | 2319.500000 | 3.000000 | 3.000000 | 33.000000 | 1.000000 | 1.000000 |
75% | 24.000000 | 3972.250000 | 4.000000 | 4.000000 | 42.000000 | 2.000000 | 1.000000 |
max | 72.000000 | 18424.000000 | 4.000000 | 4.000000 | 75.000000 | 4.000000 | 2.000000 |
# check statistics of categorical columns
categorical_stat(data)
StatusOfExistingCheckingAccount | CreditHistory | Purpose | SavingsAccountAndBonds | PresentEmploymentSince | PersonalStatusAndSex | OtherDebtorsOrGuarantors | Property | OtherInstallmentPlans | Housing | Job | Telephone | ForeignWorker | Creditability | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 |
unique | 4 | 5 | 10 | 5 | 5 | 4 | 3 | 4 | 3 | 3 | 4 | 2 | 2 | 2 |
top | no checking account | existing credits paid back duly till now | radio/television | ... < 100 DM | 1 <= ... < 4 years | male : single | none | car or other, not in attribute Savings account... | none | own | skilled employee / official | none | yes | good |
freq | 394 | 530 | 280 | 603 | 339 | 548 | 907 | 332 | 814 | 713 | 630 | 596 | 963 | 700 |
describe()
function is used to generate descriptive statistics of
observed data. Beyond what we get from
pandas.core.frame.DataFrame.describe()
, from describe()
we can
check missing values in columns and type of column (numeric or
categorical).
describe(df1)
a | b | c | |
---|---|---|---|
dtype | int64 | float64 | float64 |
type | numeric | numeric | numeric |
#missing | 0 | 1 | 1 |
missing_rate | 0 | 0.25 | 0.25 |
count | 4 | 3 | 3 |
mean | 2.5 | 6 | 9 |
std | 1.29099 | 1 | 1 |
min | 1 | 5 | 8 |
25% | 1.75 | 5.5 | 8.5 |
50% | 2.5 | 6 | 9 |
75% | 3.25 | 6.5 | 9.5 |
max | 4 | 7 | 10 |
# get a descriptive statistics
describe(data)
StatusOfExistingCheckingAccount | DurationInMonth | CreditHistory | Purpose | CreditAmount | SavingsAccountAndBonds | PresentEmploymentSince | InstallmentRateInPercentageOfDisposableIncome | PersonalStatusAndSex | OtherDebtorsOrGuarantors | ... | Property | AgeInYears | OtherInstallmentPlans | Housing | NumberOfExistingCreditsAtThisBank | Job | NumberOfPeopleBeingLiableToProvideMaintenanceFor | Telephone | ForeignWorker | Creditability | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
dtype | object | int64 | object | object | int64 | object | object | int64 | object | object | ... | object | int64 | object | object | int64 | object | int64 | object | object | object |
type | categorical | numeric | categorical | categorical | numeric | categorical | categorical | numeric | categorical | categorical | ... | categorical | numeric | categorical | categorical | numeric | categorical | numeric | categorical | categorical | categorical |
#missing | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
missing_rate | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
count | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | ... | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 |
unique | 4 | NaN | 5 | 10 | NaN | 5 | 5 | NaN | 4 | 3 | ... | 4 | NaN | 3 | 3 | NaN | 4 | NaN | 2 | 2 | 2 |
top | no checking account | NaN | existing credits paid back duly till now | radio/television | NaN | ... < 100 DM | 1 <= ... < 4 years | NaN | male : single | none | ... | car or other, not in attribute Savings account... | NaN | none | own | NaN | skilled employee / official | NaN | none | yes | good |
freq | 394 | NaN | 530 | 280 | NaN | 603 | 339 | NaN | 548 | 907 | ... | 332 | NaN | 814 | 713 | NaN | 630 | NaN | 596 | 963 | 700 |
mean | NaN | 20.903 | NaN | NaN | 3271.26 | NaN | NaN | 2.973 | NaN | NaN | ... | NaN | 35.546 | NaN | NaN | 1.407 | NaN | 1.155 | NaN | NaN | NaN |
std | NaN | 12.0588 | NaN | NaN | 2822.74 | NaN | NaN | 1.11871 | NaN | NaN | ... | NaN | 11.3755 | NaN | NaN | 0.577654 | NaN | 0.362086 | NaN | NaN | NaN |
min | NaN | 4 | NaN | NaN | 250 | NaN | NaN | 1 | NaN | NaN | ... | NaN | 19 | NaN | NaN | 1 | NaN | 1 | NaN | NaN | NaN |
25% | NaN | 12 | NaN | NaN | 1365.5 | NaN | NaN | 2 | NaN | NaN | ... | NaN | 27 | NaN | NaN | 1 | NaN | 1 | NaN | NaN | NaN |
50% | NaN | 18 | NaN | NaN | 2319.5 | NaN | NaN | 3 | NaN | NaN | ... | NaN | 33 | NaN | NaN | 1 | NaN | 1 | NaN | NaN | NaN |
75% | NaN | 24 | NaN | NaN | 3972.25 | NaN | NaN | 4 | NaN | NaN | ... | NaN | 42 | NaN | NaN | 2 | NaN | 1 | NaN | NaN | NaN |
max | NaN | 72 | NaN | NaN | 18424 | NaN | NaN | 4 | NaN | NaN | ... | NaN | 75 | NaN | NaN | 4 | NaN | 2 | NaN | NaN | NaN |
15 rows × 21 columns
corr, ax = corr_analysis(data, show_plot=True)