Before you develop a score card model, getting a well understanding of
the data you are using plays an import role. In this tuorial, we will
discuss exploratory data analysis module included in yasc.
First we import yasc and check its version.
# imports
import yasc
from yasc.data import german_data
from yasc.eda import (
missing_stat,
numeric_stat,
categorical_stat,
corr_analysis,
describe
)
import pandas as pd
import numpy as np
# show version
yasc.__version__
'0.0.1'
Here we use german_data() to load german credit data and show its
first five rows.
# load german credit data
data = german_data()
data.head()
| StatusOfExistingCheckingAccount | DurationInMonth | CreditHistory | Purpose | CreditAmount | SavingsAccountAndBonds | PresentEmploymentSince | InstallmentRateInPercentageOfDisposableIncome | PersonalStatusAndSex | OtherDebtorsOrGuarantors | ... | Property | AgeInYears | OtherInstallmentPlans | Housing | NumberOfExistingCreditsAtThisBank | Job | NumberOfPeopleBeingLiableToProvideMaintenanceFor | Telephone | ForeignWorker | Creditability | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ... < 0 DM | 6 | critical account/ other credits existing (not ... | radio/television | 1169 | unknown/ no savings account | ... >= 7 years | 4 | male : divorced/separated | none | ... | real estate | 67 | none | own | 2 | skilled employee / official | 1 | yes, registered under the customers name | yes | good |
| 1 | 0 <= ... < 200 DM | 48 | existing credits paid back duly till now | radio/television | 5951 | ... < 100 DM | 1 <= ... < 4 years | 2 | male : divorced/separated | none | ... | real estate | 22 | none | own | 1 | skilled employee / official | 1 | none | yes | bad |
| 2 | no checking account | 12 | critical account/ other credits existing (not ... | education | 2096 | ... < 100 DM | 4 <= ... < 7 years | 2 | male : divorced/separated | none | ... | real estate | 49 | none | own | 1 | unskilled - resident | 2 | none | yes | good |
| 3 | ... < 0 DM | 42 | existing credits paid back duly till now | furniture/equipment | 7882 | ... < 100 DM | 4 <= ... < 7 years | 2 | male : divorced/separated | guarantor | ... | building society savings agreement/ life insur... | 45 | none | for free | 1 | skilled employee / official | 2 | none | yes | good |
| 4 | ... < 0 DM | 24 | delay in paying off in the past | car (new) | 4870 | ... < 100 DM | 1 <= ... < 4 years | 3 | male : divorced/separated | none | ... | unknown / no property | 53 | none | for free | 2 | skilled employee / official | 2 | none | yes | bad |
5 rows × 21 columns
missing_stat() function can help us find out columns with missing
values and output missing rates.
missing_stat(data)
No missing values
| column | #missing | missing_rate | |
|---|---|---|---|
| 0 | StatusOfExistingCheckingAccount | 0 | 0.0 |
| 18 | Telephone | 0 | 0.0 |
| 17 | NumberOfPeopleBeingLiableToProvideMaintenanceFor | 0 | 0.0 |
| 16 | Job | 0 | 0.0 |
| 15 | NumberOfExistingCreditsAtThisBank | 0 | 0.0 |
| 14 | Housing | 0 | 0.0 |
| 13 | OtherInstallmentPlans | 0 | 0.0 |
| 12 | AgeInYears | 0 | 0.0 |
| 11 | Property | 0 | 0.0 |
| 19 | ForeignWorker | 0 | 0.0 |
| 10 | PresentResidenceSince | 0 | 0.0 |
| 8 | PersonalStatusAndSex | 0 | 0.0 |
| 7 | InstallmentRateInPercentageOfDisposableIncome | 0 | 0.0 |
| 6 | PresentEmploymentSince | 0 | 0.0 |
| 5 | SavingsAccountAndBonds | 0 | 0.0 |
| 4 | CreditAmount | 0 | 0.0 |
| 3 | Purpose | 0 | 0.0 |
| 2 | CreditHistory | 0 | 0.0 |
| 1 | DurationInMonth | 0 | 0.0 |
| 9 | OtherDebtorsOrGuarantors | 0 | 0.0 |
| 20 | Creditability | 0 | 0.0 |
# only include columns with missing values
missing_stat(data, only_missing_columns=True)
No missing values
| column | #missing | missing_rate |
|---|
The german data happens to have no missing values, let’s create a data
frame with missing values to test missing_stat() function.
# create a data frame with missing values
df1 = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [np.nan, 5, 6, 7], 'c': [8, 9, np.nan, 10]})
missing_stat(df1)
3 columns, of which 2 columns with missing values
| column | #missing | missing_rate | |
|---|---|---|---|
| 0 | a | 0 | 0.00 |
| 1 | b | 1 | 0.25 |
| 2 | c | 1 | 0.25 |
# check missing statistics of single column
missing_stat(df1, "b")
Column b of dtype float64, 1 missing(s) (0.25)
# check statistics of numeric columns
numeric_stat(data)
| DurationInMonth | CreditAmount | InstallmentRateInPercentageOfDisposableIncome | PresentResidenceSince | AgeInYears | NumberOfExistingCreditsAtThisBank | NumberOfPeopleBeingLiableToProvideMaintenanceFor | |
|---|---|---|---|---|---|---|---|
| count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
| mean | 20.903000 | 3271.258000 | 2.973000 | 2.845000 | 35.546000 | 1.407000 | 1.155000 |
| std | 12.058814 | 2822.736876 | 1.118715 | 1.103718 | 11.375469 | 0.577654 | 0.362086 |
| min | 4.000000 | 250.000000 | 1.000000 | 1.000000 | 19.000000 | 1.000000 | 1.000000 |
| 25% | 12.000000 | 1365.500000 | 2.000000 | 2.000000 | 27.000000 | 1.000000 | 1.000000 |
| 50% | 18.000000 | 2319.500000 | 3.000000 | 3.000000 | 33.000000 | 1.000000 | 1.000000 |
| 75% | 24.000000 | 3972.250000 | 4.000000 | 4.000000 | 42.000000 | 2.000000 | 1.000000 |
| max | 72.000000 | 18424.000000 | 4.000000 | 4.000000 | 75.000000 | 4.000000 | 2.000000 |
# check statistics of categorical columns
categorical_stat(data)
| StatusOfExistingCheckingAccount | CreditHistory | Purpose | SavingsAccountAndBonds | PresentEmploymentSince | PersonalStatusAndSex | OtherDebtorsOrGuarantors | Property | OtherInstallmentPlans | Housing | Job | Telephone | ForeignWorker | Creditability | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 |
| unique | 4 | 5 | 10 | 5 | 5 | 4 | 3 | 4 | 3 | 3 | 4 | 2 | 2 | 2 |
| top | no checking account | existing credits paid back duly till now | radio/television | ... < 100 DM | 1 <= ... < 4 years | male : single | none | car or other, not in attribute Savings account... | none | own | skilled employee / official | none | yes | good |
| freq | 394 | 530 | 280 | 603 | 339 | 548 | 907 | 332 | 814 | 713 | 630 | 596 | 963 | 700 |
describe() function is used to generate descriptive statistics of
observed data. Beyond what we get from
pandas.core.frame.DataFrame.describe(), from describe() we can
check missing values in columns and type of column (numeric or
categorical).
describe(df1)
| a | b | c | |
|---|---|---|---|
| dtype | int64 | float64 | float64 |
| type | numeric | numeric | numeric |
| #missing | 0 | 1 | 1 |
| missing_rate | 0 | 0.25 | 0.25 |
| count | 4 | 3 | 3 |
| mean | 2.5 | 6 | 9 |
| std | 1.29099 | 1 | 1 |
| min | 1 | 5 | 8 |
| 25% | 1.75 | 5.5 | 8.5 |
| 50% | 2.5 | 6 | 9 |
| 75% | 3.25 | 6.5 | 9.5 |
| max | 4 | 7 | 10 |
# get a descriptive statistics
describe(data)
| StatusOfExistingCheckingAccount | DurationInMonth | CreditHistory | Purpose | CreditAmount | SavingsAccountAndBonds | PresentEmploymentSince | InstallmentRateInPercentageOfDisposableIncome | PersonalStatusAndSex | OtherDebtorsOrGuarantors | ... | Property | AgeInYears | OtherInstallmentPlans | Housing | NumberOfExistingCreditsAtThisBank | Job | NumberOfPeopleBeingLiableToProvideMaintenanceFor | Telephone | ForeignWorker | Creditability | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| dtype | object | int64 | object | object | int64 | object | object | int64 | object | object | ... | object | int64 | object | object | int64 | object | int64 | object | object | object |
| type | categorical | numeric | categorical | categorical | numeric | categorical | categorical | numeric | categorical | categorical | ... | categorical | numeric | categorical | categorical | numeric | categorical | numeric | categorical | categorical | categorical |
| #missing | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| missing_rate | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| count | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | ... | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 |
| unique | 4 | NaN | 5 | 10 | NaN | 5 | 5 | NaN | 4 | 3 | ... | 4 | NaN | 3 | 3 | NaN | 4 | NaN | 2 | 2 | 2 |
| top | no checking account | NaN | existing credits paid back duly till now | radio/television | NaN | ... < 100 DM | 1 <= ... < 4 years | NaN | male : single | none | ... | car or other, not in attribute Savings account... | NaN | none | own | NaN | skilled employee / official | NaN | none | yes | good |
| freq | 394 | NaN | 530 | 280 | NaN | 603 | 339 | NaN | 548 | 907 | ... | 332 | NaN | 814 | 713 | NaN | 630 | NaN | 596 | 963 | 700 |
| mean | NaN | 20.903 | NaN | NaN | 3271.26 | NaN | NaN | 2.973 | NaN | NaN | ... | NaN | 35.546 | NaN | NaN | 1.407 | NaN | 1.155 | NaN | NaN | NaN |
| std | NaN | 12.0588 | NaN | NaN | 2822.74 | NaN | NaN | 1.11871 | NaN | NaN | ... | NaN | 11.3755 | NaN | NaN | 0.577654 | NaN | 0.362086 | NaN | NaN | NaN |
| min | NaN | 4 | NaN | NaN | 250 | NaN | NaN | 1 | NaN | NaN | ... | NaN | 19 | NaN | NaN | 1 | NaN | 1 | NaN | NaN | NaN |
| 25% | NaN | 12 | NaN | NaN | 1365.5 | NaN | NaN | 2 | NaN | NaN | ... | NaN | 27 | NaN | NaN | 1 | NaN | 1 | NaN | NaN | NaN |
| 50% | NaN | 18 | NaN | NaN | 2319.5 | NaN | NaN | 3 | NaN | NaN | ... | NaN | 33 | NaN | NaN | 1 | NaN | 1 | NaN | NaN | NaN |
| 75% | NaN | 24 | NaN | NaN | 3972.25 | NaN | NaN | 4 | NaN | NaN | ... | NaN | 42 | NaN | NaN | 2 | NaN | 1 | NaN | NaN | NaN |
| max | NaN | 72 | NaN | NaN | 18424 | NaN | NaN | 4 | NaN | NaN | ... | NaN | 75 | NaN | NaN | 4 | NaN | 2 | NaN | NaN | NaN |
15 rows × 21 columns