# SearchPipeline¶

`SearchPipeline`

is a class that creates a pipeline for searching and retrieving data from the web based on a given topic.

#### Inputs¶

- topic (
`str`

): The topic to search for. - summarize (
`bool`

, optional): Whether to summarize the retrieved pages. Defaults to`False`

.

#### Components¶

`PageAggregator`

: Aggregates web pages related to the given topic.`PageSummarizer`

: Summarizes the aggregated pages if`summarize`

is set to`True`

.

### Example¶

Create a search pipeline for the topic "artificial intelligence" with summarization enabled.

```
import asyncio
import os
import json
from dria.client import Dria
from dria.factory import SearchPipeline
from dria.pipelines import PipelineConfig
dria = Dria(rpc_token=os.environ["DRIA_RPC_TOKEN"])
async def evaluate():
await dria.initialize()
pipeline = SearchPipeline(dria, PipelineConfig(pipeline_timeout=80)).build(
topic="Entropy-based sampling", summarize=True
)
res = await pipeline.execute(return_output=True)
with open("search_results.json", "w") as f:
f.write(json.dumps(res, indent=2))
if __name__ == "__main__":
asyncio.run(evaluate())
```

Expected output:

```
[
{
"content": "Title: Entropy-based Training Methods for Scalable Neural Implicit Sampler\n\nURL Source: https://arxiv.org/abs/2306.04952\n\nMarkdown Content:\n\\[2306.04952\\] Entropy-based Training Methods for Scalable Neural Implicit Sampler\n===============\n \n\n[Skip to main content](https://arxiv.org/abs/2306.04952#content)\n\n[![Cornell University](/static/browse/0.3.4/images/icons/cu/cornell-reduced-white-SMALL.svg)](https://www.cornell.edu/)\n\nWe gratefully acknowledge support from the Simons Foundation, [member institutions](https://info.arxiv.org/about/ourmembers.html), and all contributors. [Donate](https://info.arxiv.org/about/donate.html)\n\n[](https://arxiv.org/abs/%7Burl_path/('ignore_me'/)%7D)\n\n[![arxiv logo](/static/browse/0.3.4/images/arxiv-logo-one-color-white.svg)](https://arxiv.org/) \\> [stat](https://arxiv.org/list/stat/recent) \\> arXiv:2306.04952\n\n[Help](https://info.arxiv.org/help) | [Advanced Search](https://arxiv.org/search/advanced)\n\n Search\n\n[![arXiv logo](/static/browse/0.3.4/images/arxiv-logomark-small-white.svg)](https://arxiv.org/)\n\n [![Cornell University Logo](/static/browse/0.3.4/images/icons/cu/cornell-reduced-white-SMALL.svg)](https://www.cornell.edu/)\n\n GO\n\nquick links\n-----------\n\n* [Login](https://arxiv.org/login)\n* [Help Pages](https://info.arxiv.org/help)\n* [About](https://info.arxiv.org/about)\n\nStatistics \\> Machine Learning\n==============================\n\n**arXiv:2306.04952** (stat)\n\n\\[Submitted on 8 Jun 2023\\]\n\nTitle:Entropy-based Training Methods for Scalable Neural Implicit Sampler\n=========================================================================\n\nAuthors:[Weijian Luo](https://arxiv.org/search/stat?searchtype=author&query=Luo,+W), [Boya Zhang](https://arxiv.org/search/stat?searchtype=author&query=Zhang,+B), [Zhihua Zhang](https://arxiv.org/search/stat?searchtype=author&query=Zhang,+Z)\n\nView a PDF of the paper titled Entropy-based Training Methods for Scalable Neural Implicit Sampler, by Weijian Luo and Boya Zhang and Zhihua Zhang\n\n[View PDF](https://arxiv.org/pdf/2306.04952)\n\n> Abstract:Efficiently sampling from un-normalized target distributions is a fundamental problem in scientific computing and machine learning. Traditional approaches like Markov Chain Monte Carlo (MCMC) guarantee asymptotically unbiased samples from such distributions but suffer from computational inefficiency, particularly when dealing with high-dimensional targets, as they require numerous iterations to generate a batch of samples. In this paper, we propose an efficient and scalable neural implicit sampler that overcomes these limitations. Our sampler can generate large batches of samples with low computational costs by leveraging a neural transformation that directly maps easily sampled latent vectors to target samples without the need for iterative procedures. To train the neural implicit sampler, we introduce two novel methods: the KL training method and the Fisher training method. The former minimizes the Kullback-Leibler divergence, while the latter minimizes the Fisher divergence. By employing these training methods, we effectively optimize the neural implicit sampler to capture the desired target distribution. To demonstrate the effectiveness, efficiency, and scalability of our proposed samplers, we evaluate them on three sampling benchmarks with different scales. These benchmarks include sampling from 2D targets, Bayesian inference, and sampling from high-dimensional energy-based models (EBMs). Notably, in the experiment involving high-dimensional EBMs, our sampler produces samples that are comparable to those generated by MCMC-based methods while being more than 100 times more efficient, showcasing the efficiency of our neural sampler. We believe that the theoretical and empirical contributions presented in this work will stimulate further research on developing efficient samplers for various applications beyond the ones explored in this study.\n\nSubjects:\n\nMachine Learning (stat.ML); Machine Learning (cs.LG)\n\nCite as:\n\n[arXiv:2306.04952](https://arxiv.org/abs/2306.04952) \\[stat.ML\\]\n\n\u00a0\n\n(or [arXiv:2306.04952v1](https://arxiv.org/abs/2306.04952v1) \\[stat.ML\\] for this version)\n\n\u00a0\n\n[https://doi.org/10.48550/arXiv.2306.04952](https://doi.org/10.48550/arXiv.2306.04952)\n\nFocus to learn more\n\narXiv-issued DOI via DataCite\n\nSubmission history\n------------------\n\nFrom: Weijian Luo \\[[view email](https://arxiv.org/show-email/ed0de722/2306.04952)\\] \n**\\[v1\\]** Thu, 8 Jun 2023 05:56:05 UTC (2,556 KB) \n\nFull-text links:\n\nAccess Paper:\n-------------\n\nView a PDF of the paper titled Entropy-based Training Methods for Scalable Neural Implicit Sampler, by Weijian Luo and Boya Zhang and Zhihua Zhang\n\n* [View PDF](https://arxiv.org/pdf/2306.04952)\n* [TeX Source](https://arxiv.org/src/2306.04952)\n* [Other Formats](https://arxiv.org/format/2306.04952)\n\n [![license icon](https://arxiv.org/icons/licenses/by-4.0.png) view license](http://creativecommons.org/licenses/by/4.0/ \"Rights to this article\")\n\nCurrent browse context:\n\nstat.ML\n\n[< prev](https://arxiv.org/prevnext?id=2306.04952&function=prev&context=stat.ML \"previous in stat.ML (accesskey p)\") \u00a0 | \u00a0 [next \\>](https://arxiv.org/prevnext?id=2306.04952&function=next&context=stat.ML \"next in stat.ML (accesskey n)\") \n\n[new](https://arxiv.org/list/stat.ML/new) | [recent](https://arxiv.org/list/stat.ML/recent) | [2023-06](https://arxiv.org/list/stat.ML/2023-06)\n\nChange to browse by:\n\n[cs](https://arxiv.org/abs/2306.04952?context=cs) \n[cs.LG](https://arxiv.org/abs/2306.04952?context=cs.LG) \n[stat](https://arxiv.org/abs/2306.04952?context=stat) \n\n### References & Citations\n\n* [NASA ADS](https://ui.adsabs.harvard.edu/abs/arXiv:2306.04952)\n* [Google Scholar](https://scholar.google.com/scholar_lookup?arxiv_id=2306.04952)\n* [Semantic Scholar](https://api.semanticscholar.org/arXiv:2306.04952)\n\n[a](https://arxiv.org/static/browse/0.3.4/css/cite.css) export BibTeX citation Loading...\n\nBibTeX formatted citation\n-------------------------\n\n\u00d7\n\nData provided by:\n\n### Bookmark\n\n [![BibSonomy logo](/static/browse/0.3.4/images/icons/social/bibsonomy.png)](http://www.bibsonomy.org/BibtexHandler?requTask=upload&url=https://arxiv.org/abs/2306.04952&description=Entropy-basedTrainingMethodsforScalableNeuralImplicitSampler \"Bookmark on BibSonomy\")[![Reddit logo](/static/browse/0.3.4/images/icons/social/reddit.png)](https://reddit.com/submit?url=https://arxiv.org/abs/2306.04952&title=Entropy-basedTrainingMethodsforScalableNeuralImplicitSampler \"Bookmark on Reddit\")\n\n Bibliographic Tools\n\nBibliographic and Citation Tools\n================================\n\n Bibliographic Explorer Toggle\n\nBibliographic Explorer _([What is the Explorer?](https://info.arxiv.org/labs/showcase.html#arxiv-bibliographic-explorer))_\n\n Litmaps Toggle\n\nLitmaps _([What is Litmaps?](https://www.litmaps.co/))_\n\n scite.ai Toggle\n\nscite Smart Citations _([What are Smart Citations?](https://www.scite.ai/))_\n\n Code, Data, Media\n\nCode, Data and Media Associated with this Article\n=================================================\n\n Links to Code Toggle\n\nCatalyzeX Code Finder for Papers _([What is CatalyzeX?](https://www.catalyzex.com/))_\n\n DagsHub Toggle\n\nDagsHub _([What is DagsHub?](https://dagshub.com/))_\n\n GotitPub Toggle\n\nGotit.pub _([What is GotitPub?](http://gotit.pub/faq))_\n\n Links to Code Toggle\n\nPapers with Code _([What is Papers with Code?](https://paperswithcode.com/))_\n\n ScienceCast Toggle\n\nScienceCast _([What is ScienceCast?](https://sciencecast.org/welcome))_\n\n Demos\n\nDemos\n=====\n\n Replicate Toggle\n\nReplicate _([What is Replicate?](https://replicate.com/docs/arxiv/about))_\n\n Spaces Toggle\n\nHugging Face Spaces _([What is Spaces?](https://huggingface.co/docs/hub/spaces))_\n\n Spaces Toggle\n\nTXYZ.AI _([What is TXYZ.AI?](https://txyz.ai/))_\n\n Related Papers\n\nRecommenders and Search Tools\n=============================\n\n Link to Influence Flower\n\nInfluence Flower _([What are Influence Flowers?](https://influencemap.cmlab.dev/))_\n\n Connected Papers Toggle\n\nConnected Papers _([What is Connected Papers?](https://www.connectedpapers.com/about))_\n\n Core recommender toggle\n\nCORE Recommender _([What is CORE?](https://core.ac.uk/services/recommender))_\n\n* Author\n* Venue\n* Institution\n* Topic\n\n About arXivLabs\n\narXivLabs: experimental projects with community collaborators\n=============================================================\n\narXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.\n\nBoth individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.\n\nHave an idea for a project that will add value for arXiv's community? [**Learn more about arXivLabs**](https://info.arxiv.org/labs/index.html).\n\n[Which authors of this paper are endorsers?](https://arxiv.org/auth/show-endorsers/2306.04952) | [Disable MathJax](javascript:setMathjaxCookie\\(\\)) ([What is MathJax?](https://info.arxiv.org/help/mathjax.html))\n\n* [About](https://info.arxiv.org/about)\n* [Help](https://info.arxiv.org/help)\n\n* [Contact](https://info.arxiv.org/help/contact.html)\n* [Subscribe](https://info.arxiv.org/help/subscribe)\n\n* [Copyright](https://info.arxiv.org/help/license/index.html)\n* [Privacy Policy](https://info.arxiv.org/help/policies/privacy_policy.html)\n\n* [Web Accessibility Assistance](https://info.arxiv.org/help/web_accessibility.html)\n* [arXiv Operational Status](https://status.arxiv.org/) \n Get status notifications via [email](https://subscribe.sorryapp.com/24846f03/email/new) or [slack](https://subscribe.sorryapp.com/24846f03/slack/new)\n",
"llm_summary": "The article discusses Entropy-based Training Methods for Scalable Neural Implicit Samplers, a paper on scalable neural implicit samplers that uses entropy-based training methods to improve sampling efficiency. The paper provides BibTeX citation information and links to various tools for bibliographic exploration, such as Litmaps and scite.ai. Additionally, the article highlights related papers and recommenders, including Influence Flower and Connected Papers.\n\nThe article also mentions arXivLabs, an experimental project that allows collaborators to develop and share new arXiv features directly on the website. It invites readers to learn more about arXivLabs and collaborate on projects that add value for arXiv's community.",
"url": "https://arxiv.org/abs/2306.04952",
"summary": "These benchmarks include sampling from 2D targets, Bayesian inference, and sampling from high-dimensional energy-based models (EBMs)."
},
{
"content": "Title: Just a moment...\n\nURL Source: https://ietresearch.onlinelibrary.wiley.com/doi/full/10.1049/ell2.13308\n\nMarkdown Content:\nietresearch.onlinelibrary.wiley.com\n-----------------------------------\n\nVerifying you are human. This may take a few seconds.\n-----------------------------------------------------\n\nietresearch.onlinelibrary.wiley.com needs to review the security of your connection before proceeding.\n",
"llm_summary": "The content appears to be an attempt to access an academic article titled \"Just a moment...\" from the IET Research website, which is hosted on Wiley Online Library. However, before accessing the article, there are security checks being performed to ensure the connection is secure. The user will have to wait for a few seconds while these checks are processed.",
"url": "https://www.reddit.com/r/LocalLLaMA/comments/1fy9k8x/entropy_based_sampling_and_parallel_cot_decoding/",
"summary": "Does anyone have some good references in the literature this is inspired by? I found CoT decoding paper from deepmind but that's all I got."
},
{
"content": "Title: Entropy-based Sampling for Abstractive Multi-document Summarization in Low-resource Settings\n\nURL Source: https://aclanthology.org/2023.inlg-main.9.pdf\n\nMarkdown Content:\n> Proceedings of the 16th International Natural Language Generation Conference , pages 123\u2013133 September 11\u201315, 2023. \u00a92023 Association for Computational Linguistics\n\n123 Entropy-based Sampling for Abstractive Multi-document Summarization in Low-resource Settings \n\n## Laura Mascarell and Ribin Chalumattu and Julien Heitmann \n\n## ETH Zurich \n\n## {lmascarell,cribin,julien.heitmann}@inf.ethz.ch \n\n## Abstract \n\nResearch in Multi-document Summarization (MDS) mostly focuses on the English language and depends on large MDS datasets that are not available for other languages. Some of these approaches concatenate the source documents, resulting in overlong model inputs. Existing transformer architectures are unable to process such long inputs entirely, omitting documents in the summarization process. Other solutions address this issue by implementing multi-stage approaches that also require changes in the model architecture. In this paper, we introduce various sampling approaches based on infor-mation entropy that allow us to perform MDS in a single stage. These approaches also con-sider all source documents without using MDS training data nor changing the model\u2019s archi-tecture. Besides, we build a MDS test set of German news articles to assess the performance of our methods on abstractive multi-document summaries. Experimental results show that our entropy-based approaches outperform previous state-of-the-art on German MDS, while still re-maining primarily abstractive. We release our code 1 and MDS test set 2 to encourage further research in German abstractive MDS. \n\n## 1 Introduction \n\nIn light of the ever-growing volume of available information, it becomes essential to be able to au-tomatically summarize information from several sources. Multi-document Summarization (MDS) aims at condensing the most important information from different documents. Despite the advances in single-document summarization (Zhang et al., 2020), summarizing multiple related documents re-mains a greater challenge due to its input length and the presence of redundant information (Fan et al., 2019; Song et al., 2022). Therefore, some research focuses on implementing multi-stage approaches \n\n> 1Link to GitHub repository.\n> 2Link to Multi-GeNews repository.\n\nthat first identify the relevant information to then feed it into a summarization model (Lebanoff et al., 2018; Liu and Lapata, 2019a). More recent works utilize pre-trained language models (Lewis et al., 2020; Raffel et al., 2020; Xiao et al., 2022) fine-tuned for the summarization task and feed them with the source documents concatenated (Johner et al., 2021; Xiao et al., 2022). However, these approaches pose two major issues. First, concate-nated inputs exceeding the length limit of the model are truncated, which might lead to the omission of entire documents. Second, they rely on multi-document datasets that are scarce or unavailable in languages other than English. Hokamp et al. (2020) introduce a decoding strat-egy that adapts single- to multi-document summa-rization without using additional training data nor applying changes to the single-input model archi-tecture. At every decoding timestep, it averages the output probabilities of a single-document sum-marization model for each individual document, combining them into a single output. Instead of av-eraging all log-probabilities, which favours highly frequent tokens, we propose to make a more in-formed decision. In particular, we leverage entropy to measure the model confidence in the next to-ken prediction and thus select the most informative output. We implement different entropy-based ap-proaches and evaluate their performance on MDS of German text. Our main contributions are: \u2022 We present different entropy-based sampling approaches for the MDS task. These are spe-cially well-suited for languages like German that have limited or unavailable MDS data. \u2022 We build and release a new German MDS test set in the news domain that is more suitable for evaluating abstractive summarization than the existing MDS German dataset auto-hMDS (Zopf, 2018). We expect our dataset to foster research on German abstractive MDS. 124 \u2022 The experimental results demonstrate that our method achieves the state-of-the-art perfor-mance in German abstractive MDS in terms of ROUGE scores and manual evaluation. \n\n## 2 Related Work \n\nMulti-document Summarization Some prior work approaches MDS as a multi-stage process (Liu et al., 2018; Zhu et al., 2021) that first ex-tracts salient sentences from the source documents to then distill them into a summary using different methods such as graph-based modeling (Li et al., 2020; Chen et al., 2021) or modifying the attention mechanism (Perez-Beltrachini and Lapata, 2021). In contrast, Lebanoff et al. (2018) highlights the importance of adapting Single-document Summa-rization (SDS) models to summarize multiple doc-uments and propose an approach that adapts their attention weights. Similarly, other works propose various changes in the model architecture (Liu and Lapata, 2019a; Elsahar et al., 2021). The main disadvantage of these approaches is that they are tailored to specific model architectures. More recently, Xiao et al. (2022) introduce PRIMERA, a pre-trained model for MDS that can be applied in zero- or few-shot settings. The source documents are concatenated and fed into the Long-former Transformer model, which can handle long inputs up to 4,096 or even 16k tokens with current GPUs. Nevertheless, PRIMERA is only available for English and there is no alternative for other lan-guages. Similarly, Johner et al. (2021) performs MDS on German text using the pre-trained lan-guage model BART (Lewis et al., 2020) and con-catenating the source documents as input. 3 How-ever, BART input length is restricted to 1,024 to-kens, which may end up excluding entire docu-ments from the summarization process. Overall, our entropy-based approaches present the following advantages over prior work: (a) they do not require a pre-step to extract salient informa-tion (b) nor changes in the SDS model architecture with (c) no need for additional MDS training data, and (d) still considering all source documents in the summarization process. This work is built upon the dynamic ensemble approach from Hokamp et al. (2020), improving the decoding strategy by sam-pling on more informative predictions. \n\n> 3To the best of our knowledge, Johner et al. (2021) is the only work that tackles MDS in German besides Zopf (2018) with the auto-hMDS dataset.\n\nEntropy in Summarization Xu et al. (2020) leverage entropy to analyze the performance of Transformer-based models in the SDS task. Later, van der Poel et al. (2022) use entropy to determine when the model is uncertain about the next token prediction and apply Pointwise Mutual Information (PMI) instead to alleviate hallucination. Similarly, we apply the conditional PMI approach to MDS. Instead of finding the conditional entropy thresh-old through hyperparameter search as in van der Poel et al. (2022), we apply maximum probabilistic information entropy (Li et al., 2021). This novel entropy definition has been successfully used to reduce the size of image datasets by selecting the most informative samples. \n\n## 3 Entropy Background \n\nIn information theory, the entropy of a random vari-able denotes the amount of information, or lack thereof (i.e. uncertainty), associated with its possi-ble outcomes. Thus, given a probability distribution \n\np over all possible outcomes x1, . . . , x n of a ran-dom variable X, we quantify the entropy of X\n\nusing the standard Shannon entropy equation: \n\nH(X) = \u2212\n\n> n\n\nX\n\n> i=1\n\np(xi) log p(xi) (1) The entropy is then maximum for uniform dis-tributions, where all outcomes are equally likely, indicating high uncertainty. In the context of automatic text generation, we can leverage entropy to quantify the confidence of probabilistic models in their predictions (Xu et al., 2020; van der Poel et al., 2022). More specifically, summarization models aim at generating a sum-mary string y\u2217 of a given source document x that maximizes the scoring function: \n\ny\u2217 = argmax \n\n> y\u2208Y\n\nlog p(y | x), (2) where y is the sequence of tokens y0, . . . , y T\n\nfrom the model vocabulary V generated at ev-ery timestep t, 0 < t < T . During decod-ing, that is, the prediction of each sequence token \n\nyt \u2208 V , the model provides a probability distribu-tion p(\u00b7 | y<t , x) over V that also takes into account the context of the previous tokens. According to Equation 1, we can then use such distribution to measure the model\u2019s confidence in the prediction: 125 H(p(\u00b7 | y<t , x)) = \u2212 X\n\n> y\u2208V\n\n\u0010\n\np(y | y<t , x)\n\n\u00d7 log p(y | y<t , x)\n\n\u0011 (3) \n\n## 4 Entropy-based MDS \n\nGiven a set of documents X = { x1, . . . , xn}, the dynamic-ensemble approach ( DynE ) described in Hokamp et al. (2020) adapts single- to multi-document summarization as follows: at every de-coding timestep t, it computes the output prob-abilites for each individual source document us-ing a single-document summarization model; next, it averages these outputs to obtain a single log-probability distribution assigned to the token y:\n\np(y | X ) = 1\n\n|X | \n\nX\n\n> x\u2208X\n\np(y | y<t , x) (4) We leverage entropy information to adapt the \n\nDynE approach and implement various sampling strategies that select the most informative output at each decoding timestep t.\n\nMinimum Entropy ( Hmin ) Based on the hy-pothesis that low entropy indicates a higher con-fidence in the prediction, this approach picks the token prediction of the model instance with the lowest entropy min 1\u2264i\u2264|X | H(p(\u00b7 | y<t , xi)) . Note that this approach does not guarantee certainty in all token predictions. In those cases where all model instances exhibit high uncertainty, the se-lected instance could still have high entropy and thus provide an arbitrary prediction. \n\nMax-predicted Probability Threshold ( Hth ) Li et al. (2021) focus on the maximum-predicted probability pmax to determine the model\u2019s confi-dence and reduce redundancy in datasets of im-ages. Specifically, the authors state that a low max-imum probability indicates high entropy and con-sequently, low confidence in the prediction. There-fore, they propose to measure entropy as: \n\nH(X) = \u2212pmax log pmax (5) where pmax = max x\u2208X p(x). Figure 1 plots Equa-tion 5, showing the correlation between informa-tion entropy and pmax . Note that the entropy is highest when pmax is 0.35, with a positive corre-lation for probabilities below this threshold and a negative correlation for probabilities above it. \n\n0 0.2 0.4 0.6 0.8 1\n\n0\n\n0.2\n\n0.4\n\n0.6 pmax = 0.35 \n\npmax \n\n> H\n> Figure 1: Plot of the maximum probabilistic information entropy (Equation 5), which illustrates the correlation between the maximum-predicted probability pmax and information entropy, when 0\u2264pmax \u22641.\n\nInspired by Li et al. (2021) approach, we ap-ply maximum probabilistic information entropy in MDS, assuming that values of pmax below the threshold 0.35 indicate that the model is essentially guessing. At each decoding step, we obtain the maximum-predicted probability for each input doc-ument in X and proceed as follows: a) we choose the prediction with the highest \n\npmax among those above the threshold. The higher the probability, the lower the entropy. b) if all probabilities are below the threshold, we conclude that there is not enough information for the current prediction and we average their log-probabilities as in Equation 4. \n\nMutual Information Decoding ( Hpmi ) Several works apply mutual information approaches dur-ing decoding to favor more specific and informa-tive outputs (Li et al., 2016; Takayama and Arase, 2019). Later, van der Poel et al. (2022) observe that highly frequent tokens often indicate halluci-nated content and implement a decoding strategy that integrates mutual information to mitigate hal-lucination in single-document summarization. In particular, their approach optimizes for Pointwise Mutual Information (PMI) when the model is un-certain about its prediction, generating summaries that are more faithful to the source document: \n\np(y|x) = log p(y|y<t , x) \u2212 \u03bb log p(y|y<t ), (6) where 0 < \u03bb < 1 to avoid excessively penal-izing high-frequent tokens, which could lead to 126 ungrammatical outputs (Li et al., 2016). Based on these findings, we propose an additional varia-tion of our Hth approach, which applies PMI when there is no certainty in any of the predictions, that is, all probabilities are below the 0.35 threshold. 4\n\n## 5 Datasets \n\nThis section describes the datasets used to train and evaluate our MDS approaches. Specifically, we consider three pre-existing German datasets that are suitable for single-document\u2014GeWiki (Frefel, 2020) and 20m (Rios et al., 2021)\u2014and multi-document summarization\u2014auto-hMDS (Zopf, 2018). Moreover, we build Multi-GeNews, a MDS test set in the news domain that is specifically tai-lored for abstractive MDS. \n\n5.1 Single-document Summarization GeWiki This is the largest dataset available for single-document abstractive summarization in Ger-man, consisting of 240k summary-article pairs. Here, the lead text of Wikipedia articles are ex-tracted as summaries of the rest of the article. \n\n20m A single-document summarization dataset with 18 ,305 news articles and their correspond-ing manually-written summaries collected from the Swiss newspaper 20 Minuten (\u201820 Minutes\u2019). \n\n5.2 Multi-document Summarization auto-hMDS This multi-document summariza-tion dataset consists of 2 ,210 summaries from Wikipedia leads as in GeWiki and 10 ,454 source documents. The documents were obtained by au-tomatically querying the internet with summary sentences, resulting in a highly extractive dataset. Nonetheless, we consider it in our experiments for comparison with the related work. Despite being the largest MDS dataset in German, auto-hMDS is significantly smaller than its English counterpart Multi-News (Fabbri et al., 2019). 5\n\nMulti-GeNews Due to the lack of abstractive MDS datasets in German, we built a MDS test set to assess the performance of the proposed approaches. The data comes from the news portal of the Swiss media company SRF 6 and consists of news articles published between January and March 2020. \n\n> 4We use a \u03bbof 0.25 in our experiments, which we manually selected based on the impact of various values on the output.\n> 5over 56k summaries and 250k source documents.\n> 6https://www.srf.ch/news\n\nThe articles published on the SRF website are of-ten followed by a Mehr zum Thema (\u2018More on the topic\u2019) section with related articles on the subject. To build our test set, we first utilize this section to obtain clusters of related articles. Specifically, we collect the related article suggestions and fil-ter those published within one day of each other to ensure that they cover the same news. Next, we gen-erate the reference summaries, which will be used to compute the automatic scores, concatenating the lead paragraphs of the articles in each cluster. 7\n\nHence, the reference summaries are a combination of lead texts. We finally filter salient sentences and remove duplicated information from the refer-ence summaries using a pretrained extractive sum-marization model for German text. To build this model, we adapted the BertExt architecture (Liu and Lapata, 2019b) for the German language. 8 The adaption involved initializing the Bert component of the BertExt architecture using a German Bert checkpoint 9 and subsequently fine-tuning the entire model on the newswire 20m dataset. The resulting dataset consists of 754 unique ar-ticles grouped into 402 clusters. Each cluster con-tains two to six articles with a median of four ar-ticles and the corresponding generated reference summary. 10 The average length of the articles and summaries are 593 and 61 tokens, respectively. \n\n## 6 Experiments \n\nWe evaluate the performance of the entropy-based sampling approaches on our Multi-GeNews and the auto-hMDS datasets in terms of automatic ROUGE scores (Lin, 2004) and extractive fragment density \n\n\u03c1 (Grusky et al., 2018). Since we focus on abstrac-tive summarization, the latter allows us to measure the degree of extractiveness of the summaries, and in turn, abstractiveness\u2014higher \u03c1 values indicate that the summary is more extractive and contains larger text chunks from the source article. Further-more, we collect human annotations on a subset of the Multi-GeNews to assess the faithfulness of the generated summaries and get a deeper understand-ing on their quality (Section 7). \n\n> 7Similarly to the GeWiki dataset, we consider the lead paragraph of an article as its summary.\n> 8https://github.com/nlpyang/BertSum\n> 9https://huggingface.co/dbmdz/ bert-base-german-uncased\n> 10 Although an article can belong to different clusters, there are no identical clusters with the same articles.\n\n127 100 words 200 words \n\nMethod R1 \u2191 R2 \u2191 RL \u2191 \u03c1\u2193 R1 \u2191 R2 \u2191 RL \u2191 \u03c1\u2193\n\nmBART concat 18.4 6.2 12.5 27.9 24.5 7.7 15.0 35.6 mBART + DynE 23.4 6.9 15.1 2.2 26.8 7.0 16.0 1.9 mBART + Hmin 20.7 8.6 14.7 17.9 26.9 10.4 17.4 16.6 mBART + Hth 21.5 9.0 15.3 16.1 27.8 10.8 18.0 14.7 mBART + Hpmi 16.5 6.9 12.3 12.5 21.0 7.9 14.5 10.0 \n\nTable 1: Performance of the entropy-based approaches and the baseline models on the auto-hMDS dataset in terms of ROUGE scores and extractive fragment density \u03c1. The mBART model is fine-tuned on the auto-hMDS dataset by concatenating the source articles into a single input. Similarly, the mBART baseline is fed with the concatenated source articles. Overall, Hth achives the highest performance among the various methods evaluated. \n\n100 words \n\nMethod R1 \u2191 R2 \u2191 RL \u2191 \u03c1\u2193\n\nmBART concat 23.0 6.0 14.8 9.23 mBART + DynE 22.2 4.8 14.9 1.5 mBART + Hmin 23.4 5.6 15.0 2.46 mBART + Hth 24.5 6.2 15.6 2.72 mBART + Hpmi 23.9 7.2 16.1 2.78 \n\nTable 2: Performance of the entropy-based approaches and the baselines on our Multi-GeNews test set. The mBART model is fine-tuned on the 20m dataset as de-scribed in Section 6.1. The mBART baseline receives as input the source articles concatenated. \n\n6.1 Models \n\nThis section describes the implementation details to build the models used in our experiments. Namely, the two summarization models, individually fine-tuned on the newswire 20m and the auto-hMDS datasets, and the language model used by the point-wise mutual information decoding approach Hpmi .\n\nSummarization Models We evaluate the perfor-mance of our MDS approaches using two sum-marization models fine-tuned on the news domain dataset 20m 11 and the MDS dataset auto-hMDS, re-spectively. The latter allows us to compare the per-formance of our approaches against prior work on German MDS. The models are based on mBART, a multilingual sequence-to-sequence transformer-based model that effectively handles multiple lan-\n\n> 11 Since the GeWiki dataset is significantly larger than the in-domain 20m, we also considered to build a model using both datasets through behavioral fine-tuning. However, the performance on the single-document summarization task was inferior than simply fine-tuning on 20m. Several factors could contribute to this results such as a domain shift or a discrep-ancy in summary length distribution.\n\nguages including German (Liu et al., 2020) and initialized with the facebook/mbart-large-cc25 \n\ncheckpoint available at the Hugging Face Hub. 12 \n\nIn particular, we fine-tune the model on the 20m dataset for 10 epochs and batch size of 2 using the AdamW optimizer (Loshchilov and Hutter, 2019) with a learning rate of 3e \u2212 5. The gradient ac-cumulation steps is 16, resulting in a total effec-tive batch size of 32. To fine-tune the single-input mBART model with the multi-document summa-rization dataset auto-hMDS, we follow the work in Johner et al. (2021) and concatenate the source articles in a single input. We train the model with \n\n3e\u22125 learning rate and batch size of 2 for 5 epochs. \n\nLanguage Model We build a language model to apply the mutual information decoding approach. Specifically, we use the GPT-2 (Radford et al., 2019) checkpoint from Hugging Face 13 and fine-tune it on the same in-domain data as the corre-sponding mBART summarization model. To en-sure that both mBART and GPT-2 models share the same vocabulary, we train GPT-2 using the same tokenizer as mBART. We then fine-tune it for 3 epochs using the AdamW optimizer with learning rate of 5e \u2212 4 and batch size of 16. We set the max-imum context length of the model to 256 tokens, since we do not generate longer summaries than that. The gradient accumulation steps is set to 8, resulting in a total effective batch size of 128. \n\n6.2 Results \n\nTable 2 compares the performance of our entropy-based methods against the DynE (Hokamp et al., 2020) and mBART baselines on our Multi-GeWiki \n\n> 12 https://huggingface.co/facebook/ mbart-large-cc25\n> 13 https://huggingface.co/gpt2\n\n128 Ref. Das Coronavirus beherrscht die Medien \u2013 doch das sei derzeit kaum angebracht, sagt Wissenschaft-sredaktor H\u00e4usler. Innerhalb eines Tages kletterte die Zahl der best\u00e4tigten Infektionen in China auf 2744. In den umliegenden L\u00e4ndern gibt es bereits Erkrankte. [The coronavirus dominates the media - but that is hardly appropriate at the moment, says science editor H\u00e4usler. Within one day, the number of confirmed infections in China climbed to 2744, and there are already people ill in surrounding countries.] Rank 1 Hth Die Zahl der Corona-Toten in China ist auf 80 gestiegen. Die Zahl der Touristen, die sich mit dem Virus infiziert haben, ist auf \u00fcberschaubar. Die Tourismusbranche rechnet nicht damit, dass das Virus sich auf der ganzen Welt ausbreitet. SRF-Wissenschaftler Thomas H\u00e4usler und Daniel Sulzer von Interlaken erkl\u00e4ren, wie gef\u00e4hrlich die Situation f\u00fcr die chinesische Bev\u00f6lkerung ist. [The number of Covid deaths in China has risen to 80. The number of tourists infected with the virus is on manageable. The tourism industry does not expect the virus to spread around the world. SRF scientist Thomas H\u00e4usler and Daniel Sulzer from Interlaken explain how dangerous the situation is for the Chinese population.] Rank 2 Hpmi In China ist die Zahl der Corona-F\u00e4lle stark angestiegen. In der Volksrepublik gibt es bereits \u00fcber 700 F\u00e4lle. Die Zahl der Touristen, die sich mit dem Virus infiziert haben, ist doppelt so hoch wie bei der saisonalen Grippe. Ein Tourismusexperte erkl\u00e4rt, wie gef\u00e4hrlich die Situation f\u00fcr die chinesische Bev\u00f6lkerung ist. [In China, the number of Covid cases has risen sharply. There are already over 700 cases in the People\u2019s Republic. The number of tourists infected with the virus is twice as high as the seasonal flu. A tourism expert explains how dangerous the situation is for the Chinese population.] Rank 3 DynE Die Zahl der Corona-F\u00e4lle in der Schweiz steigt, die Zahl der Infizierten nimmt zu. Die Gefahr, sich mit dem Virus anzustecken, ist noch nicht bedroht. Trotzdem steigen die Zahlen wieder an. Laut Experten ist die Gefahr gr\u00f6sser, dass sich das Virus noch weiter ausbreitet. [The number of Covid cases in Switzerland is rising, the number of infected persons is increasing. The danger of contracting the virus is not yet threatened. Nevertheless, the numbers are rising again. According to experts, there is a greater risk that the virus will spread even further.] Rank 4 Hmin Der Tourismusverband Chinas rechnet mit einem R\u00fcckgang der Touristenzahlen. In Interlaken und Luzern gibt es nur noch wenige Berichte \u00fcber das Coronavirus. In der Schweiz gibt es aber Hoffnung: Vermehrt Japaner und Chinesen berichten von Infektionen mit dem Virus. Ein Tourismusdirektor glaubt, dass der Tourismus in der Volksrepublik eine globale Pandemie ausl\u00f6sen k\u00f6nnte. [The Tourism Association of China expects a decline in tourist numbers. In Interlaken and Lucerne there are only few reports of the coronavirus. In Switzerland, however, there is hope: Increasing numbers of Japanese and Chinese report infections with the virus. One tourism director believes that tourism in the People\u2019s Republic could trigger a global pandemic.] \n\nTable 3: Example of the summary ranking task for the input articles 18126230, 18127577, and 18130289, where at least two annotators agreed on the ranking position for each summary. In contrast to the entropy-based approaches, \n\nDynE is susceptible to generate overly general summaries. \n\ntest set. We use the single-document summariza-tion mBART model fine-tuned on the in-domain dataset 20min\u2014see details in Section 6.1. To feed the mBART baseline with multiple documents, we concatenate them in a single input as in Johner et al. (2021). Overall, the automatic ROUGE scores in-dicate that Hth and Hpmi achieve the highest per-formance. Similarly, Hth outperforms the other ap-proaches on the auto-hMDS dataset (see Table 1). \n\nAbstractiveness of the Summaries Table 1 and Table 2 reveal that DynE summaries are the most abstractive (lowest \u03c1 scores). In contrast, concate-nating the source articles as input results in highly extractive summaries, 14 and the gap is even more \n\n> 14\n\nThe results on the extractiveness of mBART summaries are also supported in Johner et al. (2021). \n\nsignificant with mBART fine-tuned on auto-hMDS, since the dataset is highly extractive (Table 1). Al-though we aim at generating abstractive summaries, the DynE approach is prone to generate highly fre-quent tokens, 15 resulting in general summaries that fail to consider relevant and specific information from the source articles (see example in Table 3). Instead, our entropy-based approaches generate summaries with a moderate level of abstractiveness that also include concrete information. \n\n## 7 Human Evaluation \n\nWe recruited three native German speakers to per-form a manual evaluation on the Multi-GeNews test \n\n> 15\n\nSince the DynE approach averages the log-probability outputs at each decoding step, common tokens obtain higher probabilities and are more likely to be predicted. 129 DynE Hmin Hth Hpmi \n\nDynE \n\nHmin \n\nHth \n\nHpmi \n\n> 11 11 9\n> 984\n> 912 4\n> 11 16 16\n> 0510 15\n> (a) Heatmap illustrating the distribution of rela-tive preference among the approaches. The x-axis indicates the preferred approach over the y-axis.\n> Hmin and Hth are the most favoured approaches, whereas Hpmi ranks as the least preferred.\n\nDynE Hmin Hth Hpmi \n\n> 5\n> 10\n> 15\n> 20\n> 25\n> frequency (%)\n\nTop Bottom \n\n> (b) Percentage of instances where each approach ranked at the top and the bottom positions, accord-ing to the annotators. While Hth summaries were consistently rated among the top positions, the an-notators rated Hpmi summaries low. Figure 2: Evaluation of the quality of the summaries among the different approaches. We only consider those instances where the majority of the annotators agreed on the (a) relative or (b) absolute ranking position.\n\nset. 16 This evaluation task is twofold: (1) assess the relative quality of the summaries, ranking them accordingly (Goyal et al., 2022; Ermakova et al., 2019) and (2) the faithfulness of the generated sum-maries to the source articles (Krishna et al., 2023), that is, whether the information presented in the summaries is supported by the articles. Since the task requires to read a considerable amount of text, we ask the participants to annotate a sample of the MDS test set. This sample comprises 20 randomly selected instances that meet the fol-lowing criteria: (a) each instance consists of three source articles, (b) the generated summaries end with a punctuation mark to avoid incomplete sen-tences, and (c) the token-level edit distance among the summaries is above five to ensure lexical differ-ences. For each participant, we randomly shuffle the evaluation instances and the required annota-tions to avoid any biases. Additionally, we do not provide them with any information about which specific approach generated each summary. \n\nSummary Ranking Task The objective of this task is to gain insights into human preferences of the generated summaries. For each instance (i.e. a set of related articles), we ask the participants to rank the generated summaries according to the \n\n> 16 The participants received a voucher worth CHF 75.- as compensation for their participation.\n\ninformativeness of the summaries and their prefer-ence. That is, they must evaluate how effectively the summaries are at capturing the essential infor-mation from the three source articles. Note that we are not evaluating other linguistic aspects such as cohesion or fluency, since we do not implement any specific methods to improve those. Table 3 provides an example of the annotation task. This task is specially challenging when multiple summaries either contain similar information or suffer from hallucinations, or both. In fact, the an-notators reported that it was hard to decide the rank of a summary between two consecutive positions. This negatively impacted on the inter-annotator agreement, resulting in a final Kendall\u2019s tau co-efficient of 0.22. In the analysis of this task, we concentrate on the relative performance of the ap-proaches and only consider instances with a major-ity agreement among annotators. Figure 2a illustrates the relative preference among the different approaches in this ranking task. The results demonstrate that the summaries from the approaches Hmin and Hth are consistently rated higher than the others. In contrast, the Hpmi sum-maries are the least preferred. These results are also supported in Figure 2b, where we compare the frequency with which each approach was ranked within the top two and the bottom two positions. Furthermore, Figure 2b shows that while the base-130 Hth \n\n> FDonald Trump hielt sich in der Nacht auf Mittwoch in den beiden Kammern des US-Kongresses seine dritte Rede ab. [Donald Trump delivered his third speech to both chambers of the U.S. Congress on Wednesday night.] FDie Rede ist von einem Triumphgehabe gegen die Demokraten. [There is talk of triumphant action against the Democrats.] FDas Verfahren gegen Trump ist nach wie vor im Gange. [The case against Trump is still ongoing.]\n> Hpmi\n> FDonald Trump hielt sich in den USA nicht an die Corona-Regeln. [Donald Trump did not follow the Covid rules in the USA.] FDie demokratische Mehrheit im Kongress hielt sich dagegen und sprach Trump ab. [The Democratic majority in Congress held against this and absolved Trump.] TDie Rede ist von einem Triumph f\u00fcr Trump. [The talk is of a triumph for Trump.]\n\nTable 4: Example of the faithfulness annotation task. The boolean in the first column represents whether the text span is factual (T=True) or not (F=False) based on the majority agreement among the annotators. Here, the \n\nHth summary was ranked at the top positions of the ranking and Hpmi at the bottom positions, even though the latter has a text span annotated as factual. The highlighted text indicate the common tokens between Hth and Hpmi until \n\nHpmi applies PMI, hallucinating on the Covid virus, although Covid is not even mentioned in the source articles. \n\nline summaries DynE receive mixed ratings, the \n\nHth summaries are consistently ranked in the top positions. This indicates a consistent preference for the Hth summaries over the baseline. \n\nFaithfulness Annotation Task van der Poel et al. (2022) leverage PMI to improve faithfulness and evaluate it in terms of automatic metrics (Section 4). The goal of this annoation task is to manually evalu-ate the faithfulness of our Hpmi approach, which ap-plies PMI to MDS, and compare it to the other pro-posed approaches that do not specifically address faithfulness. Specifically, we follow the guidelines described in Krishna et al. (2023) and split the summaries into text spans to ensure lower inter-annotator variance. 17 We then ask the annotators to judge whether each span is faithful to the source ar-ticles, that is, the statements can be verified against the articles. The final Fleiss\u2019 \u03ba (Fleiss, 1971) inter-annotator agreement is 0.62. Overall, the annotations indicate that hallucina-tion is a general issue in all generated summaries. To evaluate the impact of Hpmi on hallucination, we only consider those annotations where at least two annotators agree on the factuality label. The results show that Hmin and Hth obtain a factuality rate of 36% and 33.3%, respectively, while Hpmi achieves a slightly higher factuality rate of 36.2%. Given the small size of the evaluation sample, we con-\n\n> 17 Although the guidelines mainly refer to long summaries of at least 150 words, we found them also useful in our setting.\n\nclude that there is no significant improvement of factuality with the Hpmi approach on this task. Since Hpmi is an enhanced version of Hth , and \n\nHth is consistently preferred over Hpmi (Figure 2), we delve deeper into cases where Hpmi shows an improvement in factuality, yet it receives a lower rating than Hth . The results indicate that Hpmi in-deed redirects the prediction of the rest of the sum-mary, specially when applied early on as stated in Li et al. (2016). However, it does not necessarily address the issue of hallucination. For example, the first text span of Hth in Table 4 hallucinates the mo-ment when the speech occurs Nacht auf Mittwoch \n\n(\u2018Wednesday night\u2019), and it is therefore annotated as not factual. In contrast, the Hpmi generates a sentence about the Covid rules. However, none of the source articles refer to this topic, 18 which results in a more severe hallucination. \n\n## 8 Conclusion \n\nIn this work, we tackle Multi-document Summa-rization (MDS) in low-resource settings where there is a lack of MDS training data. We therefore present various sampling approaches built upon prior works that use single-document summariza-tion models for the MDS task. Specifically, we leverage information entropy as a metric to measure the model certainty in each token prediction. The experimental results on German MDS show that \n\n> 18 Source articles ids: 18163721, 18160037, and 18160205.\n\n131 our Hth approach, which specifically applies max-imum probabilitic information entropy, achieves the state-of-art in German abstractive MDS. In our experiments, we also assessed an extended version of the Hth approach that applies Pointwise Mutual Information (PMI) when all predictions exhibit un-certainty. Although PMI has been used in prior work to address hallucination, we observe in the manual evaluation that PMI changes the prediction of the rest of summary, but it does not inherently tackle hallucination. Future work should focus on addressing the issue of hallucination in auto-matic summarization, including further research on the efficacy of PMI to mitigate hallucinations. Additionally, it would be interesting to explore al-ternative approaches to enhance the Hth approach when there is uncertainty in the prediction. Finally, we built a MDS test set of German news articles that will help the research community to evaluate abstractive MDS on German text. \n\n## Ethics Statement \n\nHuman Annotation We recruited the annotators for the manual evaluation task on a voluntary ba-sis and provided them with information about the goals and scope of the task. The data was collected anonymously such that no conclusion can be drawn about any particular annotator. This human evalu-ation obtained the corresponding ethical approval from the Ethics Commission of ETH Zurich uni-versity (EK-2023-N-37). \n\nText Generation Models Ethical considerations documented for natural language generation sys-tems (Smiley et al., 2017; Kreps et al., 2022) also apply to our work. We do not anticipate any addi-tional concerns. \n\nSupplementary Materials Availability Statement: \n\nSource code for the presented entropy-based sam-pling approaches 19 in Section 4 and the Multi-GeNews dataset 20 described in Section 5.2 are available from GitHub. \n\n## Acknowledgements \n\nThis project is supported by Ringier, TX Group, NZZ, SRG, VSM, viscom, and the ETH Zurich Foundation. \n\n> 19 Link to GitHub repository.\n> 20 Link to Multi-GeNews repository.\n\n## References \n\nMoye Chen, Wei Li, Jiachen Liu, Xinyan Xiao, Hua Wu, and Haifeng Wang. 2021. SgSum:transforming multi-document summarization into sub-graph se-lection. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing ,pages 4063\u20134074, Online and Punta Cana, Domini-can Republic. Association for Computational Lin-guistics. Hady Elsahar, Maximin Coavoux, Jos Rozen, and Matthias Gall\u00e9. 2021. Self-supervised and controlled multi-document opinion summarization. In Proceed-ings of the 16th Conference of the European Chap-ter of the Association for Computational Linguistics: Main Volume , pages 1646\u20131662, Online. Association for Computational Linguistics. Liana Ermakova, Jean Val\u00e8re Cossu, and Josiane Mothe. 2019. A survey on evaluation of summarization methods. Information processing & management ,56(5):1794\u20131814. Alexander Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir Radev. 2019. Multi-news: A large-scale multi-document summarization dataset and abstrac-tive hierarchical model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages 1074\u20131084, Florence, Italy. Asso-ciation for Computational Linguistics. Angela Fan, Claire Gardent, Chlo\u00e9 Braud, and Antoine Bordes. 2019. Using local knowledge graph con-struction to scale Seq2Seq models to multi-document inputs. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , pages 4186\u20134196, Hong Kong, China. Association for Com-putational Linguistics. J.L. Fleiss. 1971. Measuring nominal scale agree-ment among many raters. Psychological Bulletin ,76(5):378\u2013382. Dominik Frefel. 2020. Summarization corpora of Wikipedia articles. In Proceedings of the Twelfth Lan-guage Resources and Evaluation Conference , pages 6651\u20136655, Marseille, France. European Language Resources Association. Tanya Goyal, Junyi Jessy Li, and Greg Durrett. 2022. News summarization and evaluation in the era of gpt-3. arXiv preprint arXiv:2209.12356 .Max Grusky, Mor Naaman, and Yoav Artzi. 2018. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Long Pa-pers) , pages 708\u2013719, New Orleans, Louisiana. As-sociation for Computational Linguistics. 132 Chris Hokamp, Demian Gholipour Ghalandari, Nghia The Pham, and John Glover. 2020. DynE: Dynamic ensemble decoding for multi-document summarization. arXiv preprint arXiv:2006.08748 .Timo Johner, Abhik Jana, and Chris Biemann. 2021. Error analysis of using BART for multi-document summarization: A study for English and German lan-guage. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa) , pages 391\u2013397, Reykjavik, Iceland (Online). Link\u00f6ping University Electronic Press, Sweden. Sarah Kreps, R. Miles McCain, and Miles Brundage. 2022. All the news that\u2019s fit to fabricate: AI-generated text as a tool of media misinforma-tion. Journal of Experimental Political Science ,9(1):104\u2013117. Kalpesh Krishna, Erin Bransom, Bailey Kuehl, Mohit Iyyer, Pradeep Dasigi, Arman Cohan, and Kyle Lo. 2023. LongEval: Guidelines for human evaluation of faithfulness in long-form summarization. In Proceed-ings of the 17th Conference of the European Chap-ter of the Association for Computational Linguistics ,pages 1650\u20131669, Dubrovnik, Croatia. Association for Computational Linguistics. Logan Lebanoff, Kaiqiang Song, and Fei Liu. 2018. Adapting the neural encoder-decoder framework from single to multi-document summarization. In \n\nProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages 4131\u20134141, Brussels, Belgium. Association for Com-putational Linguistics. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and com-prehension. In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics ,pages 7871\u20137880, Online. Association for Computa-tional Linguistics. Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A diversity-promoting ob-jective function for neural conversation models. In \n\nProceedings of the 2016 Conference of the North American Chapter of the Association for Computa-tional Linguistics: Human Language Technologies ,pages 110\u2013119, San Diego, California. Association for Computational Linguistics. Wei Li, Xinyan Xiao, Jiachen Liu, Hua Wu, Haifeng Wang, and Junping Du. 2020. Leveraging graph to improve abstractive multi-document summariza-tion. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 6232\u20136243, Online. Association for Computational Linguistics. Yang Li, Jiachen Yang, and Jiabao Wen. 2021. Entropy-based redundancy analysis and information screening. \n\nDigital Communications and Networks .Chin-Yew Lin. 2004. ROUGE: A package for auto-matic evaluation of summaries. In Text Summariza-tion Branches Out , pages 74\u201381, Barcelona, Spain. Association for Computational Linguistics. Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018. Generating wikipedia by summariz-ing long sequences. In International Conference on Learning Representations .Yang Liu and Mirella Lapata. 2019a. Hierarchical trans-formers for multi-document summarization. In Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 5070\u2013 5081, Florence, Italy. Association for Computational Linguistics. Yang Liu and Mirella Lapata. 2019b. Text summariza-tion with pretrained encoders. In Proceedings of the 2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages 3730\u20133740, Hong Kong, China. Association for Computational Linguistics. Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual denoising pre-training for neural machine translation. Transac-tions of the Association for Computational Linguis-tics , 8:726\u2013742. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Confer-ence on Learning Representations .Laura Perez-Beltrachini and Mirella Lapata. 2021. Multi-document summarization with determinantal point process attention. Journal of Artificial Intelli-gence Research , 71:371\u2013399. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog , 1(8):9. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text trans-former. The Journal of Machine Learning Research ,21(1):5485\u20135551. Annette Rios, Nicolas Spring, Tannon Kew, Marek Kostrzewa, Andreas S\u00e4uberli, Mathias M\u00fcller, and Sarah Ebling. 2021. A new dataset and efficient base-lines for document-level text simplification in Ger-man. In Proceedings of the Third Workshop on New Frontiers in Summarization , pages 152\u2013161, Online and in Dominican Republic. Association for Compu-tational Linguistics. Charese Smiley, Frank Schilder, Vassilis Plachouras, and Jochen L. Leidner. 2017. Say the right thing right: Ethics issues in natural language generation 133 systems. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing , pages 103\u2013108, Valencia, Spain. Association for Computa-tional Linguistics. Yun-Zhu Song, Yi-Syuan Chen, and Hong-Han Shuai. 2022. Improving multi-document summarization through referenced flexible extraction with credit-awareness. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech-nologies , pages 1667\u20131681, Seattle, United States. Association for Computational Linguistics. Junya Takayama and Yuki Arase. 2019. Relevant and informative response generation using pointwise mu-tual information. In Proceedings of the First Work-shop on NLP for Conversational AI , pages 133\u2013138, Florence, Italy. Association for Computational Lin-guistics. Liam van der Poel, Ryan Cotterell, and Clara Meis-ter. 2022. Mutual information alleviates hallucina-tions in abstractive summarization. In Proceedings of the 2022 Conference on Empirical Methods in Nat-ural Language Processing , pages 5956\u20135965, Abu Dhabi, United Arab Emirates. Association for Com-putational Linguistics. Wen Xiao, Iz Beltagy, Giuseppe Carenini, and Arman Cohan. 2022. PRIMERA: Pyramid-based masked sentence pre-training for multi-document summariza-tion. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Vol-ume 1: Long Papers) , pages 5245\u20135263, Dublin, Ireland. Association for Computational Linguistics. Jiacheng Xu, Shrey Desai, and Greg Durrett. 2020. Un-derstanding neural abstractive summarization models via uncertainty. In Proceedings of the 2020 Con-ference on Empirical Methods in Natural Language Processing (EMNLP) , pages 6275\u20136281, Online. As-sociation for Computational Linguistics. Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. 2020. PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization. In Pro-ceedings of the 37th International Conference on Machine Learning , volume 119, pages 11328\u201311339. Proceedings of Machine Learning Research (PMLR). Fangwei Zhu, Shangqing Tu, Jiaxin Shi, Juanzi Li, Lei Hou, and Tong Cui. 2021. TWAG: A topic-guided Wikipedia abstract generator. In Proceedings of the 59th Annual Meeting of the Association for Compu-tational Linguistics and the 11th International Joint Conference on Natural Language Processing (Vol-ume 1: Long Papers) , pages 4623\u20134635, Online. As-sociation for Computational Linguistics. Markus Zopf. 2018. Auto-hMDS: Automatic construc-tion of a large heterogeneous multilingual multi-document summarization corpus. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) , Miyazaki, Japan. European Language Resources Association (ELRA).\n",
"llm_summary": "The provided text appears to be a list of references for research papers related to natural language processing (NLP), specifically summarization, text generation, and other subfields. The papers cover various topics such as abstractive summarization, multi-document summarization, and document-level text simplification.\n\nSome notable mentions include papers on improving multi-document summarization through referenced flexible extraction with credit-awareness, alleviating hallucinations in abstractive summarization using mutual information, and pre-training models for abstractive summarization. Other papers explore the use of determinantal point process attention, topic-guided Wikipedia abstract generation, and automatic construction of large heterogeneous multilingual multi-document summarization corpora.",
"url": "https://www.reddit.com/r/singularity/comments/1fyacda/engineers_are_evaluating_a_new_sampling_method/",
"summary": "Entropy Based Sampling and Parallel CoT Decoding. The goal is to use entropy to make context aware sampling. This should allow us to simulate ..."
},
{
"content": "Title: Entropy Sampling \u2014 How to balance \u201crelevance\u201d and \u201csurprise\u201d in recommendation\n\nURL Source: https://medium.com/@reika.k.fujimura/entropy-sampling-how-to-balance-relevance-and-surprise-in-recommendation-2223417a38ce\n\nPublished Time: 2023-02-23T22:27:54.733Z\n\nMarkdown Content:\nImagine opening your laptop and wandering around the internet for a while. You will probably find a lot of recommendations coming up in front of you suggesting \u201cthese are the best items for you to go next!\u201d. Reflecting upon these moments, how many times did you find them irrelevant, too biased, or terrible for you?\n\nActually, you might even find them boorish or equal to a misplaced targeted advertisement, customized for someone else. Otherwise, just after you bought a new air fryer you might find so many other ones filling up the page that look almost the same. Then you might wish you could tell them that you have just bought a brand new nice one.\n\nIt is often difficult to balance relevance and diversity in real-world applications \u2014 After all, how can we get just the right extent of relevance, that is not too filtered at the same time so you won\u2019t miss the chance to find new products, contents or items?\n\nIndeed, there is plenty of research on how to introduce diversity on recommendations. For example, some papers suggest introducing maximal marginal relevance or intra-list diversity to reduce redundancy of result while keeping its relevance in the series of phrases. Thompson sampling tries to indirectly add diversity by adding some exploration to the pure recommendation results.\n\nHowever, these approaches tend to be highly technical and heuristic, meaning they are heavily dependent on each business problem. It seems there is still no simple, established, and generalizable method to solve this problem, and that is why so many recommendations fail to be \u201cnatural\u201d to us.\n\nHere, the Customization and Machine Learning (CaML) team at CBC tackled this problem with an original method, which turns out to be not only robust but also generalizable to different machine learning models. In this blog post, I will share our algorithm which borrows the notion of \u201centropy\u201d from statistical physics to balance \u201crelevance\u201d and \u201csurprise\u201d in our recommendations and how to understand it within this new context.\n\nIdea of Entropy\n---------------\n\nIn the context of data science, entropy is frequently used as a measurement of impurity, disorder or randomness of the information processed. For example, the decision tree algorithm learns to minimize the impurity of each leaf node by maximizing the homogeneity of information at every node, using entropy as the loss function. Entropy is also the basis of (as suggested by their names) relative entropy and cross entropy, which can be found in many places such as some dimensional reduction algorithms including t-SNE and UMAP.\n\nIn a broader perspective of information theory, entropy, or \u201cShannon entropy\u2019\u2019 refers to the \u201cinformational value\u201d present in a variable. For example, if you find a news article that reports a typical event such as a traffic jam on a snowy day, you are probably not surprised by the news. In other words, the \u201cinformational value\u201d of the message is not so significant. Alternatively, if an article pops up in front of you telling that some aliens just landed on the earth from a thousand light years away, it\u2019s quite a huge breaking news story. This implies that the rarity of the event is in proportion with the information it gives.\n\nThe mathematical representation of Shannon entropy is described by the following function as the normalized log of inverse probability;\n\nwhere _p_ is an array of _values = { p\\_0, p\\_1, \u2026 p\\_i, .. p\\_{n-1} }_ known as the probability mass function (PMF) which represents the probability _p\\_i_ occurring for event _i_.\n\nIf there is plenty of randomness, in other words if the probability distribution of a random variable is closer to even, its entropy gets higher and becomes closer to the maximum value log(_n_). Inversely, if there is little surprise or impurity, which means the probability distribution is more skewed, the entropy gets smaller to approach zero. For example, the entropy of a random variable following the Bernoulli distribution can be solved exactly as\n\nand when plotted displays this inverse relationship\n\nBinary entropy function\n\nAs the graph shows, the more skewed the probability distribution is, the smaller the entropy of a random variable becomes. And this is the same for multivariable random variables.\n\nConsider the multivariable case of rolling a die. We can no longer plot the entropy function as a line in two or even three dimensional space \u2014 we would need a six dimensional space! Given this limitation we need another way of understanding the entropy, one way of which is visualizing the shapes of the PMF and cumulative mass function (CMF) themselves which allude to an approximate value of entropy.\n\nIf you are playing with a fair die, the probabilities of getting each number is evenly \u2159, so the PMF and CMF look like the followings.\n\nPMF and CMF of playing a normal die\n\nAs shown above, when the probability distribution is totally even, PMF is flat and so its CMF is a linear function. Here, the entropy can be calculated as\n\nwhich is the maximum entropy.\n\nThis result of getting the maximum entropy suggests that we know nothing about which number we will get before playing the die. The \u201cinformation value\u201d is maximized when we play a purely random die.\n\nNow consider a very unfair die, the information value is minimized as we know that it will almost always return the same value, say, 1. In this case, the PMF is skewed to its limit and as so its CMF becomes a step function as shown below.\n\nPMF and CMF of playing a skewed die which only shows 1.\n\nThe entropy is calculated as\n\nOpposite from the previous case, getting zero entropy means that there is no additional information value we obtain from rolling the die, obviously because we know for sure that we will get 1 beforehand.\n\nWell, it sounds like both of these cases are too extreme \u2014 what happens if the entropy is in the middle of 1 and 0, say, 0.5?\n\nOne example of such a case is when we play a skewed die with the following probability distribution: _{1: 40%, 2: 25%, 3: 16%, 4: 10%, 5: 5%, 6: 3%}_\n\nNow its PMF and CMF are shown in the graphs below.\n\nComparison of the PMF and CMF for rolling a skewed die, with examples of getting numbers at each x.\n\nThe more PMF is skewed, the lower its entropy becomes. The lower the entropy becomes, the more the elbow of its CMF get close to the left edge. To put it the other way around, when we decrease or increase the entropy, the expectation of getting each event becomes more or less predictable.\n\nJust like the case of playing a die as discussed above, we defined \u201centropy\u201d from a list of scores obtained from a machine learning model, which suggests how likely each user will like each new item. Let\u2019s talk on a deeper level about how we implemented this in the next section.\n\nBalancing the relevance and diversity in the recommendation for a person who liked a coral pink cat. With too low \u201cexpected surprise\u201d, the recommendation looks boring, whereas with too high \u201cexpected surprise\u201d it looks irrelevant.\n\nAlgorithms of Entropy Sampling\n------------------------------\n\nOften in recommendation systems, the output of an algorithm is a group of lists which score items for each user. In our case of an item-based collaborative filtering, we obtain a user-item matrix of similarity scores as the raw output from the machine learning model. A higher score as an element in the user-item matrix means the item is more relevant to the user.\n\nItem-based collaborative filtering\n\nThere are several ways to create recommendations from the raw similarity score matrix. First, let\u2019s consider the simple and widely applicable case of recommending five CBC podcast shows to a user.\n\nThe recommendation matrix _R_, which is a collection of the lists of recommendations for all users, is obtained by multiplying the user-item matrix U to the similarity score matrix _S_. The user-item matrix _U_ stores the histories of each user\u2019s consumption / ratings in each row. So, the recommendation matrix _R_ can be thought of (considering the single user case) as the summation of cosine similarity values for item-to-item relations to be ranked. More precisely this is a weighted sum where the coefficients of the weights are unique to each user\u2019s listening history.\n\nIn such a way, this recommender works like this: user A listened to show X, and show X tends to be liked by the same people who like show Y. Therefore, given that user A has not listened to show Y they should be given it as a recommendation. Let\u2019s see the bare result first and see what this collaborative filtering model recommends.\n\nExample recommendation for user A using the CBC Listen app with Top 5 sampling. A person who liked _Metro Morning_, which is a news show, gets a recommendation of _Here and Now Toronto_, _Fresh Air_, _The Current_, _Ontario Today_, and _The Sunday Magazine_, all of which are news shows.\n\nThese shows are reasonable recommendations, but might be too obvious for the user\u2019s tastes. Furthermore, even if they are of interest to a user, if they open the CBC Listen App again the next day and see the same recommendations it would get awfully unhelpful fast!\n\nSo, let\u2019s think about adding some randomness here to avoid the too obvious, too fixed recommendations we saw above. The most straightforward solution would be to do random sampling from the similarity score distribution. This means, from the user side, this similarity score distribution can be transformed into a probability distribution or probability mass function of getting each item in recommendations. Now, applying the random sampling technique recommendations for user A look like:\n\nExample recommendation for user A sampled from the similarity score distribution. A person who liked _Metro Morning_, which is a news show, gets a recommendation of _The Loop_ (Edmonton local news), _Evil by Design_ (society), _The Debaters_ (comedy show), _World Report_ (news show), and _Party Lines_ (politics).\n\nNow another problem arises. This user liked _Metro Morning_, which is a news show, but got a lot of shows that don\u2019t sound like related ones. User A might be confused and wonder where _The Loop_ or _Party Lines_ came from.\n\nThings are becoming clear: if we can adjust the \u201centropy\u201d from the \u201cPMF\u201d \u2014 which in this case is the cosine similarity score matrix \u2014 we might be able to balance the randomness to achieve the best of the two cases.\n\nTo adjust the PMF to achieve the optimal entropy, we defined a function to get an approximate solution for inverting the procedure of calculating entropy: we input the entropy we want and get an adjusted PMF that approximately achieves that specified entropy.\n\nIf a function _y = f(x)_ is non-negative and monotone, the exponential function _y = {f(x)}^\u03b3_ where _\u03b3_ is a positive real number is also monotone.\n\nIf _\u03b3_ is zero all values transform to unity creating a uniform distribution of maximal entropy, as _\u03b3_ increases the entropy is lowered and the distribution becomes more skewed. The idea is that we try to find the optimal _\u03b3_ which allows us to sample from our recommendation vector to the user\u2019s taste. In the implementation, we search over the possible values of normalized entropy to quickly find the desired _\u03b3_ for the input PMF.\n\nTuning _\u03b3 to get optimal PMF._\n\nHow does it work?\n-----------------\n\nAs we have discussed above, we adjusted the entropy of PMF, the similarity score distribution in this case, to get the balance of surprise and relevance in our recommendation. After entropy sampling, the recommendation for User A looks like this.\n\nExample recommendation for user A sampled with entropy sampling. A person who liked _Metro Morning_, which is a news show, gets a recommendation of _Podcast Playlist_ (Producer\u2019s Pick), _Fresh Air_ (news), _Here and Now Toronto_ (news), _Evil by Design_ (society), and _Ontario Today_ (news).\n\nNow we have much more relevant recommendations like _Fresh Air_, _Here and Now Toronto_ and _Ontario Today_. Moreover, there are also some new genres of shows. This recommendation is not as boring as the top 5 recommendation, and is much more relevant than the random sampling recommendation.\n\nSince we adjust the PMFs for each user, each user\u2019s recommendation is optimized separately and the effect of entropy sampling would be robust among all users.\n\nConclusion\n----------\n\nIn this blog post, I showed how we balance \u201crelevance\u201d and \u201csurprise\u201d in our recommendation. Although there is plenty of research about how to introduce diversity in recommendation, we took a simple approach by introducing the concept of entropy.\n\nAs discussed in the result section, we could see an improvement in our recommendation, with more fine-tuned relevance and surprise in our delivery endpoint! Not only that, we find our approach to be robust and applicable to other recommendation models because it is a simple post-processing method.\n\nIn the field of classic thermodynamics where entropy originally came from, entropy is a measure of the molecular disorder, or randomness, of a system. Our entropy sampling method enables us to tune surprise just as raising the temperature and see molecules mixing up together.\n",
"llm_summary": "The blog post discusses how to balance \"relevance\" and \"surprise\" in recommendations using an entropy sampling method within a collaborative filtering model for a CBC Listen app. Initially, the model provided either overly obvious or completely unrelated show recommendations. By adjusting the probability mass function (PMF) derived from cosine similarity scores through tuning a parameter \\( \\gamma \\), the method optimizes the recommendation's entropy to achieve a balance between familiar and surprising content. The optimized approach delivers more relevant shows while still introducing variety, enhancing user experience without overwhelming them with predictable or unrelated suggestions. This technique is robust and applicable across different recommendation models as it serves as a simple post-processing step.",
"url": "https://arxiv.org/abs/2410.02268",
"summary": "We employ structural entropy to quantify global information and losslessly decompose it from the whole graph to individual nodes using the Shapley value."
}
]
```