*** Introduction to MuMiN ***

The MuMiN dataset is a challenging misinformation benchmark for automatic
misinformation detection models. The dataset is structured as a heterogeneous
graph and features 21,565,018 tweets and 1,986,354 users, belonging to 26,048
Twitter threads, discussing 12,914 fact-checked claims from 115 fact-checking
organisations in 41 different languages, spanning a decade. The paper
describing the dataset is available at the following URL:

https://arxiv.org/abs/2202.11684

The dataset has three different sizes and features two graph classification
tasks:

- Claim classification: Given a claim and its surrounding subgraph extracted
  from social media, predict whether the verdict of the claim is misinformation
  or factual.
- Tweet classification: Given a source tweet (i.e., not a reply, quote tweet or
  retweet) to be fact-checked, predict whether the tweet discusses a claim
  whose verdict is misinformation or factual.


*** Downloading and compiling MuMiN ***

The dataset is built using our Python package, mumin. To install this, write
`pip install mumin[all]` in your terminal. With the package installed, you can
download and compile the dataset by writing the following:

>>> from mumin import MuminDataset
>>> dataset = MuminDataset(bearer_token, size='small')
>>> dataset.compile()

To be able to compile the dataset, data from Twitter needs to be downloaded,
which requires a Twitter API key. You can get one for free at
https://developer.twitter.com/en/portal/dashboard. You will need the Bearer
Token (`bearer_token`).

Note that this dataset does not contain all the nodes and relations in
MuMiN-small, as that would take way longer to compile. The data left out are
timelines, profile pictures and article images. These can be included by
specifying `include_extra_images=True` and/or `include_timelines=True` in the
constructor of MuminDataset.


*** Working with MuMiN ***

With a compiled dataset, you can now work directly with the individual nodes
and relations using the dataset.nodes and dataset.rels dictionaries. For
instance, you can get a dataframe with all the claims as follows:

>>> claim_df = dataset.nodes['claim']

All the relations are dataframes with two columns, src and tgt, corresponding
to the source and target of the relation. For instance, if we’re interested in
the relation (:Tweet)-[:DISCUSSES]->(:Claim) then we can extract this as
follows:

>>> discusses_df = dataset.rels[('tweet', 'discusses', 'claim')]

If you are interested in computing transformer embeddings of the tweets and
images then run the following:

>>> dataset.add_embeddings()

From a compiled dataset, with or without embeddings, you can export the dataset
to a Deep Graph Library heterogeneous graph object, which allows you to use
graph machine learning algorithms on the dataset. To export it, you simple run:

>>> dgl_graph = dataset.to_dgl()

We have created a tutorial which takes you through the dataset as well as shows
how one could create several kinds of misinformation classifiers on the
dataset. The tutorial can be found here:

https://colab.research.google.com/drive/1Kz0EQtySYQTo1ui8F2KZ6ERneZVf5TIN


*** Documentation of the raw dataset files ***

While we do not recommend using the raw dataset files directly, they are
available for download. The raw dataset is compressed in a single zip file,
`mumin.zip`, which contains the following files:

- `claim`
- `tweet`
- `reply`
- `article`
- `user`
- `tweet_discusses_claim`
- `article_discusses_claim`
- `user_retweeted_tweet`
- `reply_reply_to_tweet`
- `reply_quote_of_tweet`
- `user_follows_user`

All of these files are xz-compressed pickle files with protocol 4, which is
compatible with Python 3.4 and above. The files can be opened with, e.g., the
Python package `pandas`, as follows:

>>> import pandas as pd
>>> claim_df = pd.read_pickle('claim', compression='xz')

The `claim` dataframe contains data on all the claims in the dataset, which
contains the following columns:
- `id` (str): The ID of the claim.
- `keywords` (str): Keywords associated with the claim, separated by spaces.
- `cluster_keywords` (str): Keywords associated with the cluster to which the
  claim belongs, separated by spaces.
- `cluster` (int): The cluster to which the claim belongs. If the claim does
  not belong to a cluster, this is -1.
- `date` (datetime64[ns]): The date and time of the claim, formatted as
  "YYYY-MM-DD HH:mm:ss". If there is no time available then this is set to
  00:00:00
- `language` (categorical with 41 str categories): The language of the claim,
  which is the two-letter BCP-47 code of the language.
- `embedding` (NumPy array): The embedding of the claim, which is a vector of
  768 dimensions.
- `label` (categorical with 2 str categories): The label of the claim, which is
  either "misinformation" or "factual".
- `reviewers` (list of str): The list of URLs of fact-checking organisations
  that reviewed the claim.
- `small_train_mask` (bool): Whether the claim is in the small training set.
- `small_val_mask` (bool): Whether the claim is in the small validation set.
- `small_test_mask` (bool): Whether the claim is in the small test set.
- `medium_train_mask` (bool): Whether the claim is in the medium training set.
- `medium_val_mask` (bool): Whether the claim is in the medium validation set.
- `medium_test_mask` (bool): Whether the claim is in the medium test set.
- `large_train_mask` (bool): Whether the claim is in the large training set.
- `large_val_mask` (bool): Whether the claim is in the large validation set.
- `large_test_mask` (bool): Whether the claim is in the large test set.

The `tweet`, `reply` and `user` dataframes all have two columns. The first
column, named `tweet_id` or `user_id`, is the official Twitter ID of the
tweet/reply/user, and the second column, named `relevance`, is the approximate
maximal cosine similarity of the tweet/reply/user to a claim in the dataset
(ranges from 0 to 1).

The `article` dataframe contains three columns: `id`, `url` and `relevance`.
The `id` column is an ID for the article, the `url` being the URL of the
article, and the `relevance` column is the same as for the
tweets/replies/users.

The rest of the dataframes are relations in the dataset, with filenames of the
form `<src>_<rel>_<tgt>`, where `<src>` and `<tgt>` are the names of the source
and target nodes, and `<rel>` is the name of the relation (note that `<rel>`
can contain underscores as well, but `<src>` and `<tgt>` cannot). These
relation dataframes all contain three columns:
- `src` (str): The ID of the source node.
- `tgt` (str): The ID of the target node.
- `relevance` (float): The maximal relevance among the source and target nodes.

Of all this data, only the claims and article are potentially usable as-is. The
Twitter data needs to be "rehydrated", being the procedure converting the IDs
to actual tweets/replies/users. This procedure is done automatically when
compiling the dataset with the mumin package, described above.


*** License ***

The source code is available under the MIT license, and the dataset is released
under the Creative Commons Attribution-NonCommercial 4.0 International License
(CC BY-NC 4.0), in accordance with the Twitter policy, which can be found at
https://developer.twitter.com/en/developer-terms/agreement-and-policy.