KNN,英文全称为K-nearst neighbor,中文名称为K近邻算法,它是由Cover和Hart在1968年提出来的。
  KNN算法流程:
  输入:训练数据集  

T=(x1,y1),(x2,y2),...,(xN,yN)<script type="math/tex" id="MathJax-Element-210">T={(x_1,y_1),(x_2,y_2),...,(x_N,y_N)}</script>

  其中,xiXRn<script type="math/tex" id="MathJax-Element-211">x_i \in \mathcal{X} \subseteq R^n</script>为实例的特征向量,yiY={c1,c2,...,ck}<script type="math/tex" id="MathJax-Element-212">y_i \in \mathcal{Y}=\{c_1,c_2,...,c_k\}</script>为实例的类别,i=1,2,...,N<script type="math/tex" id="MathJax-Element-213">i=1,2,...,N</script>;实例特征向量x<script type="math/tex" id="MathJax-Element-214">x</script>;
  输出: 实例x<script type="math/tex" id="MathJax-Element-215">x</script>所属的类y<script type="math/tex" id="MathJax-Element-216">y</script>
  (1) 根据给点的距离度量,在训练集T<script type="math/tex" id="MathJax-Element-217">T</script>中找出与x<script type="math/tex" id="MathJax-Element-218">x</script>最近邻的k<script type="math/tex" id="MathJax-Element-219">k</script>个点,涵盖着k<script type="math/tex" id="MathJax-Element-220">k</script>个点的领域,记为Nk(x)<script type="math/tex" id="MathJax-Element-221">N_k(x)</script>;
  (2) 在Nk(x)<script type="math/tex" id="MathJax-Element-222">N_k(x)</script>中根据分类决策规则(如多数表决),决定x<script type="math/tex" id="MathJax-Element-223">x</script>的类别y<script type="math/tex" id="MathJax-Element-224">y</script>:
  
y=argmaxcjxiNk(x)I(yi=cj),i=1,2,...,N;<script type="math/tex" id="MathJax-Element-225"> y=arg \mathop{\; max}_{c_j} {\sum}_{x_i \in N_k(x)} \; I(y_i=c_j) , i=1,2,...,N;</script>

  
  在上式中,I<script type="math/tex" id="MathJax-Element-226">I</script>为指示函数,即当yi=cj<script type="math/tex" id="MathJax-Element-227">y_i=c_j</script>时,I<script type="math/tex" id="MathJax-Element-228">I</script>为1,否则I<script type="math/tex" id="MathJax-Element-229">I</script>为0。
  KNN特殊情况是k=1的情形,称为最近邻算法。对于输入的实例点(特征向量)x<script type="math/tex" id="MathJax-Element-230">x</script>,最近邻算法将训练数据集中与x<script type="math/tex" id="MathJax-Element-231">x</script>最近邻点的类作为x<script type="math/tex" id="MathJax-Element-232">x</script>的类。
  在KNN算法中,常用的距离有三种,分别为曼哈顿距离、欧式距离和闵可夫斯基距离。
  设特征空间X<script type="math/tex" id="MathJax-Element-233">\mathcal{X}</script>是n维实数向量空间Rn<script type="math/tex" id="MathJax-Element-234">R_n</script>, xi,xjX,xi=(x(1)i,x(2)i,...,x(n)i)T<script type="math/tex" id="MathJax-Element-235">x_i,x_j \in \mathcal{X}, \quad x_i=(x_i^{(1)},x_i^{(2)},...,x_i^{(n)})^T </script>, xj=(x(1)j,x(2)j,...,x(n)j)T<script type="math/tex" id="MathJax-Element-236">x_j=(x_j^{(1)},x_j^{(2)},...,x_j^{(n)})^T</script>, xi,xj<script type="math/tex" id="MathJax-Element-237">x_i,x_j</script>的Lp<script type="math/tex" id="MathJax-Element-238">L_p</script>距离定义为:
  
Lp(xi,xj)=(nl=1|x(l)ix(l)j|p)1p<script type="math/tex" id="MathJax-Element-239">L_p(x_i,x_j) = (\sum_{l=1}^{n} \; |x_i^{(l)}-x_j^{(l)}|^p)^{\frac{1}{p}} </script>

  这里 p1<script type="math/tex" id="MathJax-Element-240">p \ge 1</script>
  当p=1<script type="math/tex" id="MathJax-Element-241">p=1</script>时,称为曼哈顿距离(Manhattan distance), 公式为:
  
L1(xi,xj)=nl=1|x(l)ix(l)j|<script type="math/tex" id="MathJax-Element-242">L_1(x_i,x_j)= \sum_{l=1}^{n} |x_i^{(l)}-x_j^{(l)}| </script>

  当p=2<script type="math/tex" id="MathJax-Element-243">p=2</script>时,称为欧式距离(Euclidean distance),即
  
L2(xi,xj)=(nl=1|x(l)ix(l)j|2)12<script type="math/tex" id="MathJax-Element-244">L_2(x_i,x_j) = (\sum_{l=1}^{n} \; |x_i^{(l)}-x_j^{(l)}|^{2})^{\frac{1}{2}} </script>

  当p=<script type="math/tex" id="MathJax-Element-245">p=\infty</script>时,它是各个坐标距离的最大值,计算公式为:
  
L(xi,xj)=maxl|x(l)ix(l)j|<script type="math/tex" id="MathJax-Element-246"> L_{\infty}(x_i,x_j)= \mathop{max}_l \; |x_i^{(l)}-x_j^{(l)}|</script>

  案例1,已知二维空间的3个点 x1=(1,1)T<script type="math/tex" id="MathJax-Element-247">x_1=(1,1)^T</script>, x2=(5,1)T<script type="math/tex" id="MathJax-Element-248">x_2=(5,1)^T</script>, x3=(4,4)T<script type="math/tex" id="MathJax-Element-249">x_3=(4,4)^T</script>, 试求在p<script type="math/tex" id="MathJax-Element-250">p</script>取不同值时,Lp<script type="math/tex" id="MathJax-Element-251">L_p</script>距离下x1<script type="math/tex" id="MathJax-Element-252">x_1</script>的最近邻点。
  解析:对于x1<script type="math/tex" id="MathJax-Element-253">x_1</script>与x2,<script type="math/tex" id="MathJax-Element-254">x_2,</script>由于x1<script type="math/tex" id="MathJax-Element-255">x_1</script>与x2115<script type="math/tex" id="MathJax-Element-256">x_2在第1维上的数字分别为1、5,</script>在第2维上 数字都是1,所以计算x1<script type="math/tex" id="MathJax-Element-257">x_1</script>与x2<script type="math/tex" id="MathJax-Element-258">x_2</script>的距离时只需计算x(1)1<script type="math/tex" id="MathJax-Element-259">x_1^{(1)}</script>和x(1)2<script type="math/tex" id="MathJax-Element-260">x_2^{(1)}</script>即可,Lp(x1,x2)=4<script type="math/tex" id="MathJax-Element-261">L_p(x_1,x_2)=4</script>.
  对于x1<script type="math/tex" id="MathJax-Element-262">x_1</script>与x3<script type="math/tex" id="MathJax-Element-263">x_3</script>, 由于x1<script type="math/tex" id="MathJax-Element-264">x_1</script>与x3<script type="math/tex" id="MathJax-Element-265">x_3</script>在第1维上的数字不相同,在第2维上的数字也不相同,则x1<script type="math/tex" id="MathJax-Element-266">x_1</script>与x3<script type="math/tex" id="MathJax-Element-267">x_3</script>的曼哈顿距离为:
  
L1(x1,x3)=nl=1|x(l)ix(l)j|=2l=1|x(l)ix(l)j|=3+3=6<script type="math/tex" id="MathJax-Element-268">L_1(x_1,x_3) =\sum_{l=1}^{n} |x_i^{(l)}-x_j^{(l)}| = \sum_{l=1}^{2} |x_i^{(l)}-x_j^{(l)}| =3+3=6 </script>

  则x1<script type="math/tex" id="MathJax-Element-269">x_1</script>与x3<script type="math/tex" id="MathJax-Element-270">x_3</script>的欧式距离为:
  
L2(xi,xj)=(nl=1|x(l)ix(l)j|2)12=(2l=1|x(l)ix(l)j|2)12=32=42.4<script type="math/tex" id="MathJax-Element-271">L_2(x_i,x_j) = (\sum_{l=1}^{n} \; |x_i^{(l)}-x_j^{(l)}|^{2})^{\frac{1}{2}}=(\sum_{l=1}^{2} \; |x_i^{(l)}-x_j^{(l)}|^{2})^{\frac{1}{2}} = 3\sqrt{2}=42.4 </script>

  则x1<script type="math/tex" id="MathJax-Element-272">x_1</script>与x3<script type="math/tex" id="MathJax-Element-273">x_3</script>的L3<script type="math/tex" id="MathJax-Element-274">L_3</script>距离为:
  
L3(xi,xj)=(nl=1|x(l)ix(l)j|3)13=3.78<script type="math/tex" id="MathJax-Element-275">L_3(x_i,x_j) = (\sum_{l=1}^{n} \; |x_i^{(l)}-x_j^{(l)}|^3)^{\frac{1}{3}} = 3.78</script>

  在Matlab,可以直接求两个向量之间的距离。
  设xa=(1,1)<script type="math/tex" id="MathJax-Element-276">x_a=(1,1)</script>, xa=(4,4)<script type="math/tex" id="MathJax-Element-277">x_a=(4,4)</script>,向量xa,xb<script type="math/tex" id="MathJax-Element-278">x_a, \; x_b</script>组成矩阵D =[1 1; 4 4]
  (a)求向量(1,1)、(5,1)的曼哈顿距离

D = [1 1; 4 4];
%%求曼哈顿距离
res = pdist(D, 'cityblock')

  如图(1)所示:

这里写图片描述
图(1) 使用pdist( XXX , ‘cityblock’)求曼哈顿距离

  (b)求向量(1,1)、(5,1)的欧式距离
  在Minkowski distance公式中,当p=2时,就是欧式距离,而Minikowski的函数为 pdist(XXX, ‘minkowski’,2),代码如下:   

D = [1 1; 4 4]
%%求欧式距离
res = pdist(D, 'minkowski',2)

  如图(2)所示:

这里写图片描述
图(2) 使用pdist(XXX, ‘minkowski’,2)求曼哈顿距离

  (c)求向量(1,1)、(5,1)的L3<script type="math/tex" id="MathJax-Element-70">L_3</script>距离
  调用pdist(XXX, ‘minkowski’,3),代码如下:  

D = [1 1; 4 4];
%%求L3类型的距离
res = pdist(D, 'minkowski',3)

  如图(3)所示:

这里写图片描述
图(3) 求L3<script type="math/tex" id="MathJax-Element-71">L_3</script>类型的距离

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐