===========================

학습내용

pandas

구조화된 데이터의 처리를 지원하는 Python 라이브러리

Python 계의 엑셀

data는 value 라고 부르기도 한다함.

기본적인걸 알아보자..

Series

Series : DataFrame 중 하나의 Column에 해당하는 데이터의 모음 Object. 걍 header를 포함한 column 한줄을 뜻하는듯

DataFrame : Data Table 전체를 포함하는 Object

list_data = [1,2,3,4,5]

list_name = ["a", "b", "c", "d", "e"]

example_obj = Series(data = list_data, index = list_name)

dict_data = {"a":1, "b":2, "c":3, "d":4, "e":5}

example_obj = Series(dict_data, dtype=np.float32, name="example_data")

이 외에도 받는 매개변수 많음.

numpy등의 정보들을 series로 바꾸려고 쓰는 듯. 기존 데이터에 접근하는 index가 생김. numpy.ndarray의 subclass다.

example_obj["a"]

example_obj["a"] = 3.2

example_obj = example_obj.astype(int) # 타입 변환

values

index

name

index.name

index가 기준. index 기준으로 추가되거나 빠짐

dataframe

Series를 모아서 만든 Data Table = 기본 2차원

raw_data = {
    "first_name": ["Jason", "Molly", "Tina", "Jake", "Amy"],
    "last_name": ["Miller", "Jacobson", "Ali", "Milner", "Cooze"],
    "age": [42, 52, 36, 24, 73],
    "city": ["San Francisco", "Baltimore", "Miami", "Douglas", "Boston"],
}
df = pd.DataFrame(raw_data, columns=["first_name", "last_name", "age", "city"])

사실 직접 써서 만드는 경우는 드물고 보통 csv들을 불러옴. 그냥 dataframe으로 생성도 가능하다고

DataFrame(raw_data, columns=["age", "city"])

DataFrame(raw_data, columns=["first_name", "last_name", "age", "city", "debt"])

df.first_name

df["first_name"]

df.loc[:, ["last_name"]]

df["age"].iloc[1:]

df.debt = df.age > 40

values = Series(data=["M", "F", "F"], index=[0, 1, 3])

df["sex"] = values

df.head(3).T

df.to_csv()

selection & drop

df["account"].head()  #  Series 데이터로 나옴

df[["account", "street", "state"]].head()  #  DataFrame으로 나옴

df[:3]

을 하면 index가 3 바로 전까지만 뽑아서 가져옴.

df["name"][:3]

account_serires[account_serires < 250000]

del df["account"]

df[["name", "street"]][:2]

df.loc[[211829, 320563], ["name", "street"]]

df.iloc[:10, :3]

중에 편한방식 쓰기

df.index = list(range(0, 15)) # df.reset_index(inplace=True) df 자체가 바뀜

# df.reset_index(inplace=True, drop=True) df 자체가 바뀌고 기존 index 버려짐

df.drop(1)
# df.drop(1, inplace=True) df 자체가 바뀜

또는

df = df.drop(1)

index가 1인 row가 날라간다

df.drop([0, 1, 2, 3])

df.drop("city", axis=1)

matrix = df.values

dataframe operations

s1 = Series(range(1, 6), index=list("abced"))

a    1
b    2
c    3
e    4
d    5
dtype: int64

s2 = Series(range(5, 11), index=list("bcedef"))

b     5
c     6
e     7
d     8
e     9
f    10
dtype: int64

s1 + s2 또는 s1.add(s2)

a     NaN
b     7.0
c     9.0
d    13.0
e    11.0
e    13.0
f     NaN
dtype: float64

df1 = DataFrame(np.arange(9).reshape(3, 3), columns=list("abc")):

	a	b	c
0	0	1	2
1	3	4	5
2	6	7	8

df2 = DataFrame(np.arange(16).reshape(4, 4), columns=list("abcd"))

	a	b	c	d
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15

df1 + df2

	a	b	c	d
0	0.0	2.0	4.0	NaN
1	7.0	9.0	11.0	NaN
2	14.0	16.0	18.0	NaN
3	NaN	NaN	NaN	NaN

df1.add(df2, fill_value=0)

	a	b	c	d
0	0.0	2.0	4.0	3.0
1	7.0	9.0	11.0	7.0
2	14.0	16.0	18.0	11.0
3	12.0	13.0	14.0	15.0

df1.mul(df2, fill_value=1)

	a	b	c	d
0	0.0	1.0	4.0	3.0
1	12.0	20.0	30.0	7.0
2	48.0	63.0	80.0	11.0
3	12.0	13.0	14.0	15.0

df = DataFrame(np.arange(16).reshape(4, 4), columns=list("abcd"))

	a	b	c	d
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15

s = Series(np.arange(10, 14), index=list("abcd"))

a    10
b    11
c    12
d    13
dtype: int32

df + s

	a	b	c	d
0	10	12	14	16
1	14	16	18	20
2	18	20	22	24
3	22	24	26	28

s2 = Series(np.arange(10, 14))

0    10
1    11
2    12
3    13
dtype: int32

df + s2

	a	b	c	d	0	1	2	3
0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

df.add(s2, axis=0) # index값이 다를 땐 이어붙이는게 아니라면 axis를 꼭 넣어줘야 할 듯

알아서 broadcasting 해서 더해준다

	a	b	c	d
0	10	11	12	13
1	15	16	17	18
2	20	21	22	23
3	25	26	27	28

lambda, map, apply

s1 = Series(np.arange(10))
s1.head(5)

0    0
1    1
2    2
3    3
4    4
dtype: int32

# f = lambda x: x**2
def f(x):
    return x + 5


s1.map(f)

0     5
1     6
2     7
3     8
4     9
5    10
6    11
7    12
8    13
9    14
dtype: int64

z = {1: "A", 2: "B", 3: "C"}
s1.map(z)

0    NaN
1      A
2      B
3      C
4    NaN
5    NaN
6    NaN
7    NaN
8    NaN
9    NaN
dtype: object

s2 = Series(np.arange(10, 30))
s1.map(s2)

0    10
1    11
2    12
3    13
4    14
5    15
6    16
7    17
8    18
9    19
dtype: int32

df = pd.read_csv("./data/wages.csv")
df.head()

	earn	height	sex	race	ed	age
0	79571.299011	73.89	male	white	16	49
1	96396.988643	66.23	female	white	16	62
2	48710.666947	63.77	female	white	16	33
3	80478.096153	63.22	female	other	16	95
4	82089.345498	63.08	female	white	17	43

df.sex.unique()

array(['male', 'female'], dtype=object)

def change_sex(x):
    return 0 if x == "male" else 1


df.sex.map(change_sex)

0       0
1       1
2       1
3       1
4       1
       ..
1374    0
1375    1
1376    1
1377    0
1378    0
Name: sex, Length: 1379, dtype: int64

df["sex_code"] = df.sex.map({"male": 0, "female": 1})
df.head(5)

	earn	height	sex	race	ed	age	sex_code
0	79571.299011	73.89	male	white	16	49	0
1	96396.988643	66.23	female	white	16	62	1
2	48710.666947	63.77	female	white	16	33	1
3	80478.096153	63.22	female	other	16	95	1
4	82089.345498	63.08	female	white	17	43	1

또는 이런 대체 기능은 replace를 써도 됨.

df.sex.replace({"male": 0, "female": 1})

0       0
1       1
2       1
3       1
4       1
       ..
1374    0
1375    1
1376    1
1377    0
1378    0
Name: sex, Length: 1379, dtype: int64

df.sex.replace(["male", "female"], [0, 1], inplace=True)

	earn	height	sex	race	ed	age	sex_code
0	79571.299011	73.89	0	white	16	49	0
1	96396.988643	66.23	1	white	16	62	1
2	48710.666947	63.77	1	white	16	33	1
3	80478.096153	63.22	1	other	16	95	1
4	82089.345498	63.08	1	white	17	43	1
...	...	...	...	...	...	...	...
1374	30173.380363	71.68	0	white	12	33	0
1375	24853.519514	61.31	1	white	18	86	1
1376	13710.671312	63.64	1	white	12	37	1
1377	95426.014410	71.65	0	white	12	54	0
1378	9575.461857	68.22	0	white	12	31	0

df = pd.read_csv("wages.csv")
df_info = df[["earn", "height", "age"]]
df_info.head()

	earn	height	age
0	79571.299011	73.89	49
1	96396.988643	66.23	62
2	48710.666947	63.77	33
3	80478.096153	63.22	95
4	82089.345498	63.08	43

f = lambda x: np.mean(x)
df_info.apply(f)

earn      32446.292622
height       66.592640
age          45.328499
dtype: float64
df_info.apply(np.mean)
df_info.mean()

def f(x):
    return Series(
        [x.min(), x.max(), x.mean(), sum(x.isnull())],
        index=["min", "max", "mean", "null"],
    )


df_info.apply(f)

	earn	height	age
min	-98.580489	57.34000	22.000000
max	317949.127955	77.21000	95.000000
mean	32446.292622	66.59264	45.328499
null	0.000000	0.00000	0.000000

applymap 하면 모든 곳에 적용

f = lambda x: x // 2
df_info.applymap(f).head(5)

	earn	height	age
0	39785.0	36.0	24
1	48198.0	33.0	31
2	24355.0	31.0	16
3	40239.0	31.0	47
4	41044.0	31.0	21

f = lambda x: x ** 2
df_info["earn"].apply(f)

0       6.331592e+09
1       9.292379e+09
2       2.372729e+09
3       6.476724e+09
4       6.738661e+09
            ...     
1374    9.104329e+08
1375    6.176974e+08
1376    1.879825e+08
1377    9.106124e+09
1378    9.168947e+07
Name: earn, Length: 1379, dtype: float64

pandas built-in functions

외울 필요는 없고 필요할 때마다 찾아보게 될 것이다.

df.describe()

key = df.race.unique()

dict(enumerate(sorted(df["race"].unique())))

df.sum(axis=1) # axis=1이면 row로, axis=0이면 column으로 합치고 계산

df.isnull()

df.isnull().sum() / len(df)

df.sort_values(["age", "earn"], ascending=True)

df.sort_values("age", ascending=False).head(10)

상관계수와 공분산을 구하는 함수

df.age.corr(df.earn)

df.age[(df.age < 45) & (df.age > 15)].corr(df.earn)

df.age.cov(df.earn)

df["sex_code"] = df["sex"].replace({"male": 1, "female": 0})

df.corr()

	earn	height	ed	age	sex_code
earn	1.000000	0.291600	0.350374	0.074003	0.337328
height	0.291600	1.000000	0.114047	-0.133727	0.703672
ed	0.350374	0.114047	1.000000	-0.129802	0.061747
age	0.074003	-0.133727	-0.129802	1.000000	-0.070036
sex_code	0.337328	0.703672	0.061747	-0.070036	1.000000

df.corrwith(df.earn)

[AI Math 5강] 딥러닝 학습방법 이해하기

이전 강의에서 배웠던 선형모델은 단순한 데이터를 해석할 땐 유용하지만 복잡한 문제는 잘 못푼다. 이를 개선하기 위해 비선형 모델인 신경망.

신경망의 구조와 내부에서 사용되는 softmax, 활성함수, 역전파 알고리즘

전시간에 했던 이 선형모델은 실제 모델에 쓰기엔 무리가 있음..

그래서 뉴럴네트워크라는 신경만 모델을 고려하겠다. 이건 비선형 모델이다. 분해해보면 선형 모델로 이루어져 있고, 선형 모데로가 비선형 모델의 결합으로 이루어져 있다.

각 행벡터 oi는 데이터 xi와 가중치 행렬 W 사이의 행렬곱과 절편 b 벡터의 합으로 표현된다고 가정한다.

전 시간까지는 beta에 대해서만 구했는데, beta는 1차원이고 이를 다차원으로 좀 더 확장시키는 거다. bias는 그냥 더하기 용도.. 결과값이 여러개 필요할 때도 있다. 예를들면 같은 데이터로 집값과 집의 크기 예측같은거?

눈에 띄는건 데이터가 n개 있으니 결과 O도 n개이고, W의 p차원에 맞춰서 o의 결과도 p차원임.

화살표가 w. x가 d개 있고 그에 따라 w 시작점이 d개. 끝점은 p개. bias + 하면 결과가 o.

X1은 x1, x2, ..., xd 개의 d 차원이고 저 그림 노드는 이 X1 하나에 대해 표현한 것. 결과도 O1 의 o1, o2, ..., op 차원이 되겠지.

마찬가지로 X2도 x1, x2, ... , xd일 것이고.. 쭉쭉.

출력 벡터 o에 softmax 함수를 합성하면 확률벡터 비슷하게 되므로 특정 클래스 k에 속할 확률로 해석할 수 있다. 즉 소프트맥스는 모델의 출력을 확률로 해석할 수 있게 변환해주는 연산. 사실 확률은 아니지만 학습을 위해 각각의 정답일 확률을 0이 아닌 매우 작은 값이 정답일 경우도 있을 수도 있다. 뭐. 그런거. 이런 데이터의 경우 이정도는 이 정답이고 저 정답이더라..

분류 문제를 풀 때 선형모델과 소프트맥스 함수를 결합하여 예측한다.

예) [1,2,0] -> [0.24,0.67,0.09]

import numpy as np

def softmax(vec):
    denumerator = np.exp(vec - np.max(vec, axis=-1, keepdims=True))
    numerator = np.sum(denumerator, axis=-1, keepdims=True)
    print("denumerator")
    print(denumerator)
    print("numerator")
    print(numerator)
    val = denumerator / numerator
    return val

vec = np.array([[1,2,0],[-1,0,1],[-10,0,10]])
print(softmax(vec))

denumerator
[[3.67879441e-01 1.00000000e+00 1.35335283e-01]
 [1.35335283e-01 3.67879441e-01 1.00000000e+00]
 [2.06115362e-09 4.53999298e-05 1.00000000e+00]]
numerator
[[1.50321472]
 [1.50321472]
 [1.0000454 ]]
[[2.44728471e-01 6.65240956e-01 9.00305732e-02]
 [9.00305732e-02 2.44728471e-01 6.65240956e-01]
 [2.06106005e-09 4.53978686e-05 9.99954600e-01]]

학습이 아니라 추론할 때는 굳이 softmax 사용할 필요 없이 one_hot 쓴다. 제일 큰것만 1로 만들고 나머지 0으로 해주는거.

시그마가 활성함수. 활성함수 까지 각각 적용해서 나온게 H

이렇게 출력값 마지막에 활성함수 계산해준다. 잠재백터 또는 히든백터. 뉴런이라고 부르는 부분.

활성함수. 비선형함수.

활성함수를 쓰지 않으면 딥러닝은 선형모형과 차이가 없습니다.

sigmoid 함수나 tanh 보다 이젠 ReLU를 많이 쓴다.

이런 식으로 쌓아서 쓴다.

신경망은 선형모델과 활성함수를 합성한 함수

다층(multi-layer) 퍼셉트론(MLP)은 신경망이 여러층 합성된 함수

위해서 했던 O = XW + b 자체가 여러개. 구분하기 위해 W(1), W(2), ...

이런식으로 층층히 쌓아 올라감. 활성함수 까지 해서 나온 결과를 다시 input으로 넣고 weight 계산하고 softmax 하고 나온 결과를 또 넣고..

그리고 backpropagation (역전파) 를 사용해 학습한다. 전 시간에 했던것 처럼 경사하강법. 미분값 빼주는거. 모든 층의 각각의 W에 대해 경사하강법을 적용한다.

그걸 결과값에서 부터 시작해서 chain rule로 시작점까지 모든 weight를 학습한다. 이게 학습의 핵심인듯.

여기를 정독하면 된다.

http://sanghyukchun.github.io/74/

==========================

퀴즈

import sympy as sym
from sympy.abc import x, y

k_f = sym.poly((x + y) ** 2)
z_f = sym.poly((k_f + 3) ** 3)
diff_zx = sym.diff(z_f,x)

print(diff_zx)
print(diff_zx.subs(x,0).subs(y,0))
print(diff_zx.subs(x,1).subs(y,1))

Poly(6*x**5 + 30*x**4*y + 60*x**3*y**2 + 36*x**3 + 60*x**2*y**3 + 108*x**2*y + 30*x*y**4 + 108*x*y**2 + 54*x + 6*y**5 + 36*y**3 + 54*y, x, y, domain='ZZ')
0
588

============================

피어세션

미니배치. 100개로 배치하면 10000개 중 100개 넣고 그 다음은 아까 했던 100개를 제외한 100개를 넣는 방식인 것 같다.

SGD. 학습에 효율적이다. 어제 배웠던 경사하강법이 신경망. 딥러닝을 구현한다.

경사 학습법은 error 가 낮은쪽으로 가는게 목적이다. 차원없이 학습 가능해짐

제곱해서 빠르게 얻을 수 있다.

다익스트라로 문제 변형해서 풀 수도 있다. Disjoint set

===========================

후기

전날 배웠던 거랑 이어져서 이해하니까 기분이 좋았다.

빛의 블로그

2021년 1월 27일 수요일

AITech 학습정리-[DAY 8] Pandas I / 딥러닝 학습방법 이해하기