使用gensim读取词向量文件--gensim自带格式以及Glove格式

手头上有好几种word2vec的文件,最近又在频繁使用这些文件但是例如是否是二进制文件又有所不同,所以记录一下怎么使用

我的词向量文件整理

谷歌新闻300维,英文 使用语句:

model = gensim.models.KeyedVectors.load_word2vec_format('/Volumes/Public/word_embedding/GoogleNews-vectors-negative300.bin', binary=True)

Glove,英文,官网

官网下载的并不是直接gensim可以加载的,需要先使用下面这个glove2gensim文件转化一下

import gensim
import os
import shutil
import hashlib
from sys import platform


# 计算行数,就是单词数
def getFileLineNums(filename):
    f = open(filename, 'r')
    count = 0
    for line in f:
        count += 1
    return count


# Linux或者Windows下打开词向量文件,在开始增加一行
def prepend_line(infile, outfile, line):
    with open(infile, 'r') as old:
        with open(outfile, 'w') as new:
            new.write(str(line) + "\n")
            shutil.copyfileobj(old, new)


def prepend_slow(infile, outfile, line):
    with open(infile, 'r') as fin:
        with open(outfile, 'w') as fout:
            fout.write(line + "\n")
            for line in fin:
                fout.write(line)


def load(filename):
    num_lines = getFileLineNums(filename)
    gensim_file = '/Volumes/Public/word_embedding/glove.model.6B.300d.bin'
    gensim_first_line = "{} {}".format(num_lines, 300)
    # Prepends the line.
    if platform == "linux" or platform == "linux2":
        prepend_line(filename, gensim_file, gensim_first_line)
    else:
        prepend_slow(filename, gensim_file, gensim_first_line)

    model = gensim.models.KeyedVectors.load_word2vec_format(gensim_file)


load('/Volumes/Public/word_embedding/glove.6B.300d.txt')

得到/Volumes/Public/word_embedding/glove.model.6B.300d.bin文件,具体使用:

model = gensim.models.KeyedVectors.load_word2vec_format('/Volumes/Public/word_embedding/glove.model.6B.300d.bin', binary=False)

自己使用wiki中文预料训练的中文词向量模型,效果还可以,但是加载很慢

位置:/Volumes/Public/word_embedding/ec_wiki_w2v_vector.bin

model = gensim.models.KeyedVectors.load_word2vec_format('/Volumes/Public/word_embedding/ec_wiki_w2v_vector.bin', binary=False)

还有一些珍贵的模型,中文

model = gensim.models.KeyedVectors.load_word2vec_format('/Volumes/Public/word_embedding/baidu_word_embedding.txt',binary=False)

发表评论

电子邮件地址不会被公开。