Datasets

These datasets are part of a research effort to identify potential relationships on productive networks. The main goal is to find users that share similar interests, but that have no direct connection between each other.

Common structure

Each dataset is provided through an SQLite database. The following diagram shows the database design, including associative tables.

enter image description here

The following shows the structure of the SQLite file:

sqlite> .tables
items tags tags_items users users_items users_tags

sqlite> .schema items
CREATE TABLE items (
id INTEGER NOT NULL CONSTRAINT pk_item_id PRIMARY KEY
);

sqlite> .schema tags
CREATE TABLE tags (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name VARCHAR(120) NOT NULL,
type VARCHAR(80) NOT NULL
);

CREATE TRIGGER insert_tag BEFORE INSERT ON tags
BEGIN
SELECT CASE WHEN COUNT(*) > 0
THEN RAISE(ABORT, "Tags must have a unique 'name' and 'type' combination.") END
FROM tags WHERE name = new.name AND type = new.type;
END;

sqlite> .schema users
CREATE TABLE users (
id INTEGER NOT NULL CONSTRAINT pk_user_id PRIMARY KEY
);

sqlite> .schema tags_items
CREATE TABLE tags_items (
tag_id INTEGER NOT NULL,
item_id INTEGER NOT NULL,
PRIMARY KEY(tag_id, item_id),
FOREIGN KEY(tag_id) REFERENCES tags(id),
FOREIGN KEY(item_id) REFERENCES items(id)
);

sqlite> .schema users_items
CREATE TABLE users_items (
user_id INTEGER NOT NULL,
item_id INTEGER NOT NULL,
PRIMARY KEY(user_id, item_id),
FOREIGN KEY(user_id) REFERENCES users(id),
FOREIGN KEY(item_id) REFERENCES items(id)
);

sqlite> .schema users_tags
CREATE TABLE users_tags (
user_id INTEGER NOT NULL,
tag_id INTEGER NOT NULL,
PRIMARY KEY(user_id, tag_id),
FOREIGN KEY(user_id) REFERENCES users(id),
FOREIGN KEY(tag_id) REFERENCES tags(id)
);

For convenience, a set of Python structures, populated from the database, are also made available. The structures are either arrays (a = [...]) or dictionaries(b = {key : value}). Finnally, a dictionary that aggregates all the structures is distributed in its pickled form. The structure is a follows:

AU = {artefact_id : user_id }
K = [keyword_id]
KA = {keyword_id : [artefact_id]}
KU = {keyword_id : [user_id]}
KW = {keyword_id : weight}
U = [user_id]
data.dict = {all the above}

Flickr

Flickr provides an API, and its documentation is available at <www.flickr.com/services/api> that facilitates querying its content. Through the API, it is trivial to obtain a user characterization from the user name or id. It is also possible to obtain a user’s list of photos and one photo’s list of keywords. The API also allows the querying of the system for a particular keyword, providing, as a result, the list of photos associated with the keyword.

Download

Requirements

The Python packages required to use the dataset are:

  • pickle
  • pandas
  • sqlite3
  • pylab
  • numpy

The classification analysis is done with the Skitkit Learn Python library.