【Python】citeseer数据集的读取和处理

citeseer数据集的读取和处理惊了，论文里面citeseer数据集的节点数是3327，然而找了一圈，节点数都是3312。。。因为节点的缺失，程序还出现了不少错误。CiteSeer for Document ClassificationCiteSeer数据集包含3312种科学出版物，分为六类之一。引用网络由4732个链接组成。数据集中的每个出版物都用0/1值的词向量描述，该词向量指示字典中是否存

智慧的旋风

11924人浏览 · 2020-10-31 19:49:15

智慧的旋风 · 2020-10-31 19:49:15 发布

citeseer数据集的读取和处理

惊了，论文里面citeseer数据集的节点数是3327，然而找了一圈，节点数都是3312。。。因为节点的缺失，程序还出现了不少错误。

CiteSeer for Document Classification
- CiteSeer数据集包含3312种科学出版物，分为六类之一。引用网络由4732个链接组成。数据集中的每个出版物都用0/1值的词向量描述，该词向量指示字典中是否存在相应的词。该词典包含3703个独特的单词。数据集中的README文件提供了更多详细信息。
- Download link:
  - https://linqs-data.soe.ucsc.edu/public/lbc/citeseer.tgz
- 这些论文分为以下六类之一：
  - Agents
  - AI
  - DB
  - IR
  - ML
  - HCI

import numpy as np
import pandas as pd

cs_content = pd.read_csv('./data/citeseer/citeseer.content',sep='\t',header=None)
cs_content.shape

(3312, 3705)

cs_cite = pd.read_csv('./data/citeseer/citeseer.cites',sep='\t',header=None)
cs_cite.shape

(4732, 2)

ct_idx = list(cs_content.index)
paper_id = list(cs_content.iloc[:,0])
paper_id = [str(i) for i in paper_id] #论文id全部转换为string,paper_id不都是整数值，惊了！
mp = dict(zip(paper_id,ct_idx))
mp['zamir99grouper']

label = cs_content.iloc[:,-1]
label = pd.get_dummies(label)
label.shape

(3312, 6)

feature = cs_content.iloc[:,1:-1]
feature.shape

(3312, 3703)

mlen = cs_content.shape[0]
adj = np.zeros((mlen,mlen))

for i,j in zip(cs_cite[0],cs_cite[1]):
    if str(i) in mp.keys() and str(j) in mp.keys(): #数据集有问题！！在cites中有未出现过的paper_id
        x = mp[str(i)]
        y = mp[str(j)]
        adj[x][y] = adj[y][x] = 1

adj

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

feature = np.array(feature)
label = np.array(label)
adj = np.array(adj)

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

企业级日常办公用品直售推荐系统管理系统源码｜SpringBoot+Vue+MyBatis架构+MySQL数据库【完整版】

2048 AI社区

SpringBoot+Vue 流浪猫狗救助救援网站平台完整项目源码+SQL脚本+接口文档【Java Web毕设】

2048 AI社区

企业内部小型网络管理系统信息管理系统源码-SpringBoot后端+Vue前端+MySQL【可直接运行】

2048 AI社区

所有评论(0)

查看更多评论

智慧的旋风

@weixin_41650348

已为社区贡献3条内容