Logistic回归模型
Logistic回归模型参数估计、代码实现
Logistic 回归
1. Logistic 回归模型
Logistic 回归由统计学家David Cox(1958)提出,其实质是将数据拟合成到Logistic 模型中,从而预测事件发生的可能性。由于因变量是二分类的(也可以是多分类),因此可以代表指定某种事件发生与不发生的概率。
设因变量 y y y的取值为 { 0 , 1 } \{0,1\} {0,1}, x 1 , x 2 , … x p x_1,x_2,\dots x_p x1,x2,…xp为 y y y的解释变量,Logistic 回归就是研究 X = ( x 1 , x 2 , … x p ) X =(x_1,x_2,\dots x_p) X=(x1,x2,…xp)对 y y y的影响关系。记
p = P ( y = 1 ∣ X ) ; 1 − p = P ( y = 0 ∣ X ) p = P(y=1|X);1-p = P(y=0|X) p=P(y=1∣X);1−p=P(y=0∣X)
则概率比 p / ( 1 − p ) p/(1-p) p/(1−p)的概率称作机会比(或优势比,odds)。这里因变量的期望为
E ( y ∣ X ) = 1 P + 0 ( 1 − p ) = p E(y|X) = 1P+0(1-p)=p E(y∣X)=1P+0(1−p)=p
按照线性模型建模思路,有
y = β 0 + β 1 x 1 + … β p x p + ε y = \beta_0+\beta_1x_1+\dots \beta_px_p+\varepsilon y=β0+β1x1+…βpxp+ε
其中 ε \varepsilon ε为扰动项。如果利用OLS方法估计,则为线性概率模型。但由于 y = 0 , 1 y =0,1 y=0,1,故扰动项 ε \varepsilon ε与 X X X存在相关性,从而导致内生性与异方差等问题。另外线性模型不能解释自变量 X X X取极端值时 y < 0 y<0 y<0或 y > 1 y>1 y>1的情形,故考虑用连接函数使得
{ p ( y = 1 ∣ X ) = Λ ( X , β ) p ( y = 0 ∣ X ) = 1 − Λ ( X , β ) \left\{\begin{array}{lr} p(y=1|X) =\Lambda(X,\beta)\\ \\ p(y=0|X) =1-\Lambda(X,\beta)\\ \end{array}\right. ⎩
⎨
⎧p(y=1∣X)=Λ(X,β)p(y=0∣X)=1−Λ(X,β)
其中 Λ ( ) \Lambda() Λ()表示连接函数, β \beta β为参数。连接函数可以用标准正态累计分布函数与逻辑分布函数来表示,如果使用标准正态累计分布函数,则得到Probit模型;如果采取逻辑分布函数则为Logit模型。但考虑到用标准正态累计分布函数不存在解析式,一般采用逻辑分布函数,即
P ( y = 1 ∣ X ) = p = e x p ( X ′ β ) 1 + e x p ( X ′ β ) = e x p ( β 0 + β 1 x 1 + … β p x p ) 1 + e x p ( β 0 + β 1 x 1 + … β p x p ) \begin{aligned} P(y=1|X)& =p = \frac{exp(X'\beta)}{1+exp(X'\beta)}\\ \\ &=\frac{exp(\beta_0+\beta_1x_1+\dots \beta_px_p)}{1+exp(\beta_0+\beta_1x_1+\dots \beta_px_p)} \end{aligned} P(y=1∣X)=p=1+exp(X′β)exp(X′β)=1+exp(β0+β1x1+…βpxp)exp(β0+β1x1+…βpxp)
Logit分布密度函数关于原点对称,期望为0,方程为 π 2 / 3 \pi^2/3 π2/3,厚尾。由上式可推出对数机会比
O d d s = ln ( p 1 − p ) = β 0 + β 1 x 1 + … β p x p Odds = \ln(\frac{p}{1-p})=\beta_0+\beta_1x_1+\dots \beta_px_p Odds=ln(1−pp)=β0+β1x1+…βpxp
上述模型表明,在其他不变条件下, x i x_i xi变动一个单位,其机会比对数将变化 β i \beta_i βi个单位,而非因变量变动 β i \beta_i βi个单位。
2.参数估计
由于 y y y服从0-1分布,故 y y y的概率函数可以写为
P ( y ) = p y ( 1 − p ) 1 − y ( y = 0 , 1 ) P(y) = p^y(1-p)^{1-y}(y=0,1) P(y)=py(1−p)1−y(y=0,1)
其似然函数为
L = ∏ P ( y ) = ∏ p y ( 1 − p ) 1 − y L= \prod {P(y)} = \prod { p^y(1-p)^{1-y}} L=∏P(y)=∏py(1−p)1−y
取对数得
l n L = ∑ [ y ln p + ( 1 − y ) ln ( 1 − p ) ] = ∑ [ y ln p 1 − p + l n ( 1 − p ) ] \begin{aligned} ln L& = \sum[y\ln p+(1-y)\ln (1-p)]\\ \\ &=\sum[y\ln \frac{p}{1-p}+ln(1-p)] \end{aligned} lnL=∑[ylnp+(1−y)ln(1−p)]=∑[yln1−pp+ln(1−p)]
将 p p p的表达式代入得
ln L = ∑ { y ( β 0 + β 1 x 1 + … β p x p ) − [ 1 + e x p ( β 0 + β 1 x 1 + … β p x p ) ] } \begin{aligned} \ln L = &\sum\{y(\beta_0+\beta_1x_1+\dots \beta_px_p)\\ \\ &-[1+exp(\beta_0+\beta_1x_1+\dots \beta_px_p)]\} \end{aligned} lnL=∑{y(β0+β1x1+…βpxp)−[1+exp(β0+β1x1+…βpxp)]}
其中一阶条件
∂ ln L ∂ β j = 0 ( j = 0 , 1 , … , p ) \frac{\partial\ln L}{\partial\beta_j} =0(j=0,1,\dots,p) ∂βj∂lnL=0(j=0,1,…,p)
于是求出极大似然估计量 β j ^ ( j = 0 , 1 , … , p ) \hat{\beta_j}(j=0,1,\dots,p) βj^(j=0,1,…,p)。再将 β j ^ ( j = 0 , 1 , … , p ) \hat{\beta_j}(j=0,1,\dots,p) βj^(j=0,1,…,p)代回 P ( y = 1 ∣ X ) P(y=1|X) P(y=1∣X)中得
P ( y = 1 ∣ X ) = e x p ( β ^ 0 + β ^ 1 x 1 + … β ^ p x p ) 1 + e x p ( β ^ 0 + β ^ 1 x 1 + … β ^ p x p ) \begin{aligned} P(y=1|X)& =\frac{exp(\hat{\beta}_0+\hat{\beta}_1x_1+\dots \hat{\beta}_px_p)}{1+exp(\hat{\beta}_0+\hat{\beta}_1x_1+\dots \hat{\beta}_px_p)} \end{aligned} P(y=1∣X)=1+exp(β^0+β^1x1+…β^pxp)exp(β^0+β^1x1+…β^pxp)
当然
P ( y = 0 ∣ X ) = 1 1 + e x p ( β ^ 0 + β ^ 1 x 1 + … β ^ p x p ) \begin{aligned} P(y=0|X)& =\frac{1}{1+exp(\hat{\beta}_0+\hat{\beta}_1x_1+\dots \hat{\beta}_px_p)} \end{aligned} P(y=0∣X)=1+exp(β^0+β^1x1+…β^pxp)1
3 软件实现
以数据集womenwk为例,构建如下模型:
work i = β 0 + β 1 age i + β 2 married i + β 3 children i + β 4 education i + ε i \text { work }_{i}=\beta_{0}+\beta_{1} \text { age }_{i}+\beta_{2} \text { married }_{i}+\beta_{3} \text { children }_{i}+\beta_{4} \text { education }_{i}+\varepsilon_{i} work i=β0+β1 age i+β2 married i+β3 children i+β4 education i+εi
其中work:是否就业;age:年龄;marrie:婚否;children:子女数;education:教育年限
Stata代码如下:
*------------------------ Logistic 回归--------------------
cd "D:\master\笔记\markdown笔记\计量经济学\二值选择模型"
use womenwk.dta,clear
*变量含义:
*数据集womenwk
*work:是否就业
*age:年龄
*marrie:婚否
*children:子女数
*education:教育年限
*---------------------------LPM估计----------------------
reg work age married children education,r
/*
Linear regression Number of obs = 2,000
F(4, 1995) = 192.58
Prob > F = 0.0000
R-squared = 0.2026
Root MSE = .41992
------------------------------------------------------------------------------
| Robust
work | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0102552 .0012236 8.38 0.000 .0078556 .0126548
married | .1111116 .0226719 4.90 0.000 .0666485 .1555748
children | .1153084 .0056978 20.24 0.000 .1041342 .1264827
education | .0186011 .0033006 5.64 0.000 .0121282 .025074
_cons | -.2073227 .0534581 -3.88 0.000 -.3121622 -.1024832
------------------------------------------------------------------------------
*/
*-----------------------------logit回归-----------------------------------
logit work age married children education,nolog
/*
Logistic regression Number of obs = 2,000
LR chi2(4) = 476.62
Prob > chi2 = 0.0000
Log likelihood = -1027.9144 Pseudo R2 = 0.1882
------------------------------------------------------------------------------
work | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0579303 .007221 8.02 0.000 .0437773 .0720833
married | .7417775 .1264705 5.87 0.000 .4938998 .9896552
children | .7644882 .0515289 14.84 0.000 .6634935 .865483
education | .0982513 .0186522 5.27 0.000 .0616936 .134809
_cons | -4.159247 .3320401 -12.53 0.000 -4.810034 -3.508461
------------------------------------------------------------------------------
*/
*稳健标准误logit
logit work age married children education,nolog r
/*
Logistic regression Number of obs = 2,000
Wald chi2(4) = 344.54
Prob > chi2 = 0.0000
Log pseudolikelihood = -1027.9144 Pseudo R2 = 0.1882
------------------------------------------------------------------------------
| Robust
work | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0579303 .0072054 8.04 0.000 .0438079 .0720527
married | .7417775 .1272191 5.83 0.000 .4924326 .9911224
children | .7644882 .0497584 15.36 0.000 .6669635 .8620129
education | .0982513 .019011 5.17 0.000 .0609904 .1355121
_cons | -4.159247 .327398 -12.70 0.000 -4.800936 -3.517559
------------------------------------------------------------------------------
*/
*机率比汇报
logit work age married children education,nolog or
/*
Logistic regression Number of obs = 2,000
LR chi2(4) = 476.62
Prob > chi2 = 0.0000
Log likelihood = -1027.9144 Pseudo R2 = 0.1882
------------------------------------------------------------------------------
work | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | 1.059641 .0076517 8.02 0.000 1.04475 1.074745
married | 2.099664 .2655457 5.87 0.000 1.638694 2.690307
children | 2.147895 .1106786 14.84 0.000 1.941563 2.376153
education | 1.10324 .0205779 5.27 0.000 1.063636 1.144318
_cons | .0156193 .0051862 -12.53 0.000 .0081476 .029943
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.
*/
*---------------------边际效应-----------------
*样本均值处边际效应
margins,dydx(*) atmeans
/*
Conditional marginal effects Number of obs = 2,000
Model VCE : OIM
Expression : Pr(work), predict()
dy/dx w.r.t. : age married children education
at : age = 36.208 (mean)
married = .6705 (mean)
children = 1.6445 (mean)
education = 13.084 (mean)
------------------------------------------------------------------------------
| Delta-method
| dy/dx Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0115031 .0014236 8.08 0.000 .0087129 .0142934
married | .1472934 .0248209 5.93 0.000 .0986453 .1959415
children | .151803 .0093768 16.19 0.000 .1334249 .1701812
education | .0195096 .0036991 5.27 0.000 .0122596 .0267596
------------------------------------------------------------------------------
.
end of do-file
*/
*---------------------指定变量取值处的边际效应-------------------
margins,dydx(*) at(age =30)
/*
Average marginal effects Number of obs = 2,000
Model VCE : OIM
Expression : Pr(work), predict()
dy/dx w.r.t. : age married children education
at : age = 30
------------------------------------------------------------------------------
| Delta-method
| dy/dx Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .011179 .0014719 7.59 0.000 .008294 .0140639
married | .1431427 .0232525 6.16 0.000 .0975687 .1887167
children | .1475253 .0074033 19.93 0.000 .1330151 .1620355
education | .0189598 .0034727 5.46 0.000 .0121534 .0257662
------------------------------------------------------------------------------
*/
*------------------准确预测率------------------
estat clas
/*
Logistic model for work
-------- True --------
Classified | D ~D | Total
-----------+--------------------------+-----------
+ | 1177 361 | 1538
- | 166 296 | 462
-----------+--------------------------+-----------
Total | 1343 657 | 2000
Classified + if predicted Pr(D) >= .5
True D defined as work != 0
--------------------------------------------------
Sensitivity Pr( +| D) 87.64%
Specificity Pr( -|~D) 45.05%
Positive predictive value Pr( D| +) 76.53%
Negative predictive value Pr(~D| -) 64.07%
--------------------------------------------------
False + rate for true ~D Pr( +|~D) 54.95%
False - rate for true D Pr( -| D) 12.36%
False + rate for classified + Pr(~D| +) 23.47%
False - rate for classified - Pr( D| -) 35.93%
--------------------------------------------------
Correctly classified 73.65%
--------------------------------------------------
*/
参考文献:
陈强(2014),高级计量经济学及stata应用(第二版)
更多推荐


所有评论(0)