surprise框架FAQ

你将在这里找到FAQ和其他没有在用户手册中展现的用例。

如何为每一个用户推荐Top-N

我们将在这里使用movielens-100k的数据集来为每一个用户提供评分最高的top-10的电影。我们先在整个数据集上用SVD算法，然后我们为没有出现在训练集上的每一组（user，item）数据进行评分预测。我们将为每一个用户提供专属的top-10的电影名单。

代码：

from collections import defaultdict

from surprise import SVD
from surprise import Dataset


def get_top_n(predictions, n=10):
    '''Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    '''

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n


# First train an SVD algorithm on the movielens dataset.
data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()
algo = SVD()
algo.fit(trainset)

# Than predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset()
predictions = algo.test(testset)

top_n = get_top_n(predictions, n=10)

# Print the recommended items for each user
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])

如何计算准确率和召回率

我们可以使用下面的模型来为每一个用户计算准确率和召回率

准确率：相关的推荐项目集合/推荐项目

召回率：相关的推荐项目集合/相关的项目

from collections import defaultdict

from surprise import Dataset
from surprise import SVD
from surprise.model_selection import KFold


def precision_recall_at_k(predictions, k=10, threshold=3.5):
    '''Return precision and recall at k metrics for each user.'''

    # First map the predictions to each user.
    user_est_true = defaultdict(list)
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():

        # Sort user ratings by estimated value
        user_ratings.sort(key=lambda x: x[0], reverse=True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
                              for (est, true_r) in user_ratings[:k])

        # Precision@K: Proportion of recommended items that are relevant
        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 1

        # Recall@K: Proportion of relevant items that are recommended
        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 1

    return precisions, recalls


data = Dataset.load_builtin('ml-100k')
kf = KFold(n_splits=5)
algo = SVD()

for trainset, testset in kf.split(data):
    algo.fit(trainset)
    predictions = algo.test(testset)
    precisions, recalls = precision_recall_at_k(predictions, k=5, threshold=4)

    # Precision and recall can then be averaged over all users
    print(sum(prec for prec in precisions.values()) / len(precisions))
    print(sum(rec for rec in recalls.values()) / len(recalls))

如何为每个用户计算K近邻

你可以使用算法对象中的get_neighbors()方法，它是一个仅用相似度来衡量的相关算法，就像KNN算法。

下面的例子是我们返回的电影Toy Story的10NN结果，结果输出:

The 10 nearest neighbors of Toy Story are:
Beauty and the Beast (1991)
Raiders of the Lost Ark (1981)
That Thing You Do! (1996)
Lion King, The (1994)
Craft, The (1996)
Liar Liar (1997)
Aladdin (1992)
Cool Hand Luke (1967)
Winnie the Pooh and the Blustery Day (1968)
Indiana Jones and the Last Crusade (1989)

由于电影名称于其原始/内部ID之间的转换，有很多模板，但是这一切都归结于使用get_neighbors()方法。

import io  # needed because of weird encoding of u.item file

from surprise import KNNBaseline
from surprise import Dataset
from surprise import get_dataset_dir


def read_item_names():
    """Read the u.item file from MovieLens 100-k dataset and return two
    mappings to convert raw ids into movie names and movie names into raw ids.
    """

    file_name = get_dataset_dir() + '/ml-100k/ml-100k/u.item'
    rid_to_name = {}
    name_to_rid = {}
    with io.open(file_name, 'r', encoding='ISO-8859-1') as f:
        for line in f:
            line = line.split('|')
            rid_to_name[line[0]] = line[1]
            name_to_rid[line[1]] = line[0]

    return rid_to_name, name_to_rid


# First, train the algortihm to compute the similarities between items
data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()
sim_options = {'name': 'pearson_baseline', 'user_based': False}
algo = KNNBaseline(sim_options=sim_options)
algo.fit(trainset)

# Read the mappings raw id <-> movie name
rid_to_name, name_to_rid = read_item_names()

# Retrieve inner id of the movie Toy Story
toy_story_raw_id = name_to_rid['Toy Story (1995)']
toy_story_inner_id = algo.trainset.to_inner_iid(toy_story_raw_id)

# Retrieve inner ids of the nearest neighbors of Toy Story.
toy_story_neighbors = algo.get_neighbors(toy_story_inner_id, k=10)

# Convert inner ids of the neighbors into names.
toy_story_neighbors = (algo.trainset.to_raw_iid(inner_id)
                       for inner_id in toy_story_neighbors)
toy_story_neighbors = (rid_to_name[rid]
                       for rid in toy_story_neighbors)

print()
print('The 10 nearest neighbors of Toy Story are:')
for movie in toy_story_neighbors:
    print(movie)

自然而然，对于只做稍改的用户同样适用。

如何序列化一个算法

预测算法可以使用dump()和load()函数进行序列化和加载，SVD算法是在数据集上进行训练并进行序列化的，然后再次导入，用于后面的计算。

import os

from surprise import SVD
from surprise import Dataset
from surprise import dump


data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()

algo = SVD()
algo.fit(trainset)

# Compute predictions of the 'original' algorithm.
predictions = algo.test(trainset.build_testset())

# Dump algorithm and reload it.
file_name = os.path.expanduser('~/dump_file')
dump.dump(file_name, algo=algo)
_, loaded_algo = dump.load(file_name)

# We now ensure that the algo is still the same by checking the predictions.
predictions_loaded_algo = loaded_algo.test(trainset.build_testset())
assert predictions == predictions_loaded_algo
print('Predictions are the same')

可以将算法连同其一起序列化，以便可以使用pandas dataframe来进一步分析或于其他算法进行比较。
下面的两个是例子：
KNN算法的转储和分析
 两种算法的比较

如何建立自己的预测算法

整个手册在这里

说明是raw 和inner ids

user和item都有原生(raw)的id和内置(inner)的id，一些方法可能会返回原生id（如predict()方法），有一些返回内置id。

原生id是从评分文件或 pandas 的dataframe中定义的。它们可以是字符串或者是数字。
要注意，如果评分数据是从标准的脚本文件中读取的，那么它们就是以字符串的形式展现的。你要知道当你使用像predict()或者其他方法那样接受原生id作为参数的时候，这一点很重要。

在测试集的构建中，每个原生的id会被映射到被称作是inner id的唯一整数中——能更适用于surprise地操作。原生id 和inner id的相互转化方法可以通过调用训练集中的 to_inner_uid(), to_inner_iid(), to_raw_uid()和to_raw_iid()方法实现。

我能使用使用自己数据集吗？数据集能是pandas dataframe形式吗？

可以，都可以，查看这里

如何调用算法参数

你可以按照这里的所说的算法来使用GridSearchCV 类调整算法参数。调优之后，你可能希望对算法性能进行无偏估计。

在训练集上如何获取准确度量

你可以使用Trainset对象中的build_testset()方法来构建测试集，测试集可以使用test()方法来调用。

from surprise import Dataset
from surprise import SVD
from surprise import accuracy
from surprise.model_selection import KFold


data = Dataset.load_builtin('ml-100k')

algo = SVD()

trainset = data.build_full_trainset()
algo.fit(trainset)

testset = trainset.build_testset()
predictions = algo.test(testset)
# RMSE should be low as we are biased
accuracy.rmse(predictions, verbose=True)  # ~ 0.68 (which is low)

查看样例文件以获取更多用例。

如何保存数据以进行无偏估计

如果你的目标是调用算法参数，你可能需要留出部分数据对算法性能进行无偏估计。例如：你可能需要将数据分为两组，一组用户girdsearch进行参数调整，另一组用于无偏估计。

import random

from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import GridSearchCV


# Load the full dataset.
data = Dataset.load_builtin('ml-100k')
raw_ratings = data.raw_ratings

# shuffle ratings if you want
random.shuffle(raw_ratings)

# A = 90% of the data, B = 10% of the data
threshold = int(.9 * len(raw_ratings))
A_raw_ratings = raw_ratings[:threshold]
B_raw_ratings = raw_ratings[threshold:]

data.raw_ratings = A_raw_ratings  # data is now the set A

# Select your best algo with grid search.
print('Grid Search...')
param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005]}
grid_search = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3)
grid_search.fit(data)

algo = grid_search.best_estimator['rmse']

# retrain on the whole set A
trainset = data.build_full_trainset()
algo.fit(trainset)

# Compute biased accuracy on A
predictions = algo.test(trainset.build_testset())
print('Biased accuracy on A,', end='   ')
accuracy.rmse(predictions)

# Compute unbiased accuracy on B
testset = data.construct_testset(B_raw_ratings)  # testset is now the set B
predictions = algo.test(testset)
print('Unbiased accuracy on B,', end=' ')
accuracy.rmse(predictions)

如何构建可再现的实验

一些算法随机地初始化参数（有时与numpy一起），交叉验证折叠也是随机产生地，如果你需要多次再现你的实验，你只需要在你的程序开始前设置RNG的种子(seed)就行了。

import random
import numpy as np

my_seed = 0
random.seed(my_seed)
numpy.random.seed(my_seed)

数据存储在哪，如何改变它的位置

默认情况下，通过surprise下载的数据集是被存储在用户目录下的/surprise_data文件夹下，也可以将其存放到其他位置，你只需要通过设置’SURPRISE_DATA_FOLDER’参数就可以改变默认目录了。