*** Introduction to MuMiN *** The MuMiN dataset is a challenging misinformation benchmark for automatic misinformation detection models. The dataset is structured as a heterogeneous graph and features 21,565,018 tweets and 1,986,354 users, belonging to 26,048 Twitter threads, discussing 12,914 fact-checked claims from 115 fact-checking organisations in 41 different languages, spanning a decade. The paper describing the dataset is available at the following URL: https://arxiv.org/abs/2202.11684 The dataset has three different sizes and features two graph classification tasks: - Claim classification: Given a claim and its surrounding subgraph extracted from social media, predict whether the verdict of the claim is misinformation or factual. - Tweet classification: Given a source tweet (i.e., not a reply, quote tweet or retweet) to be fact-checked, predict whether the tweet discusses a claim whose verdict is misinformation or factual. *** Downloading and compiling MuMiN *** The dataset is built using our Python package, mumin. To install this, write `pip install mumin[all]` in your terminal. With the package installed, you can download and compile the dataset by writing the following: >>> from mumin import MuminDataset >>> dataset = MuminDataset(bearer_token, size='small') >>> dataset.compile() To be able to compile the dataset, data from Twitter needs to be downloaded, which requires a Twitter API key. You can get one for free at https://developer.twitter.com/en/portal/dashboard. You will need the Bearer Token (`bearer_token`). Note that this dataset does not contain all the nodes and relations in MuMiN-small, as that would take way longer to compile. The data left out are timelines, profile pictures and article images. These can be included by specifying `include_extra_images=True` and/or `include_timelines=True` in the constructor of MuminDataset. *** Working with MuMiN *** With a compiled dataset, you can now work directly with the individual nodes and relations using the dataset.nodes and dataset.rels dictionaries. For instance, you can get a dataframe with all the claims as follows: >>> claim_df = dataset.nodes['claim'] All the relations are dataframes with two columns, src and tgt, corresponding to the source and target of the relation. For instance, if we’re interested in the relation (:Tweet)-[:DISCUSSES]->(:Claim) then we can extract this as follows: >>> discusses_df = dataset.rels[('tweet', 'discusses', 'claim')] If you are interested in computing transformer embeddings of the tweets and images then run the following: >>> dataset.add_embeddings() From a compiled dataset, with or without embeddings, you can export the dataset to a Deep Graph Library heterogeneous graph object, which allows you to use graph machine learning algorithms on the dataset. To export it, you simple run: >>> dgl_graph = dataset.to_dgl() We have created a tutorial which takes you through the dataset as well as shows how one could create several kinds of misinformation classifiers on the dataset. The tutorial can be found here: https://colab.research.google.com/drive/1Kz0EQtySYQTo1ui8F2KZ6ERneZVf5TIN *** Documentation of the raw dataset files *** While we do not recommend using the raw dataset files directly, they are available for download. The raw dataset is compressed in a single zip file, `mumin.zip`, which contains the following files: - `claim` - `tweet` - `reply` - `article` - `user` - `tweet_discusses_claim` - `article_discusses_claim` - `user_retweeted_tweet` - `reply_reply_to_tweet` - `reply_quote_of_tweet` - `user_follows_user` All of these files are xz-compressed pickle files with protocol 4, which is compatible with Python 3.4 and above. The files can be opened with, e.g., the Python package `pandas`, as follows: >>> import pandas as pd >>> claim_df = pd.read_pickle('claim', compression='xz') The `claim` dataframe contains data on all the claims in the dataset, which contains the following columns: - `id` (str): The ID of the claim. - `keywords` (str): Keywords associated with the claim, separated by spaces. - `cluster_keywords` (str): Keywords associated with the cluster to which the claim belongs, separated by spaces. - `cluster` (int): The cluster to which the claim belongs. If the claim does not belong to a cluster, this is -1. - `date` (datetime64[ns]): The date and time of the claim, formatted as "YYYY-MM-DD HH:mm:ss". If there is no time available then this is set to 00:00:00 - `language` (categorical with 41 str categories): The language of the claim, which is the two-letter BCP-47 code of the language. - `embedding` (NumPy array): The embedding of the claim, which is a vector of 768 dimensions. - `label` (categorical with 2 str categories): The label of the claim, which is either "misinformation" or "factual". - `reviewers` (list of str): The list of URLs of fact-checking organisations that reviewed the claim. - `small_train_mask` (bool): Whether the claim is in the small training set. - `small_val_mask` (bool): Whether the claim is in the small validation set. - `small_test_mask` (bool): Whether the claim is in the small test set. - `medium_train_mask` (bool): Whether the claim is in the medium training set. - `medium_val_mask` (bool): Whether the claim is in the medium validation set. - `medium_test_mask` (bool): Whether the claim is in the medium test set. - `large_train_mask` (bool): Whether the claim is in the large training set. - `large_val_mask` (bool): Whether the claim is in the large validation set. - `large_test_mask` (bool): Whether the claim is in the large test set. The `tweet`, `reply` and `user` dataframes all have two columns. The first column, named `tweet_id` or `user_id`, is the official Twitter ID of the tweet/reply/user, and the second column, named `relevance`, is the approximate maximal cosine similarity of the tweet/reply/user to a claim in the dataset (ranges from 0 to 1). The `article` dataframe contains three columns: `id`, `url` and `relevance`. The `id` column is an ID for the article, the `url` being the URL of the article, and the `relevance` column is the same as for the tweets/replies/users. The rest of the dataframes are relations in the dataset, with filenames of the form `__`, where `` and `` are the names of the source and target nodes, and `` is the name of the relation (note that `` can contain underscores as well, but `` and `` cannot). These relation dataframes all contain three columns: - `src` (str): The ID of the source node. - `tgt` (str): The ID of the target node. - `relevance` (float): The maximal relevance among the source and target nodes. Of all this data, only the claims and article are potentially usable as-is. The Twitter data needs to be "rehydrated", being the procedure converting the IDs to actual tweets/replies/users. This procedure is done automatically when compiling the dataset with the mumin package, described above. *** License *** The source code is available under the MIT license, and the dataset is released under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0), in accordance with the Twitter policy, which can be found at https://developer.twitter.com/en/developer-terms/agreement-and-policy.