Filter by tag

Using CrowdFlower to understand storytelling in science

Doing good science is often not enough on its own. For the results of that work to have an impact, they need to be disseminated effectively, and a big part of that comes down to writing — clear, engaging prose is more likely to be read and understood. Sadly, approachable writing isn’t many researchers’ strong suit.

Most people will find little to disagree with in the above paragraph, but the question remains how much of an effect writing style actually has on the impact of scientific publications. A recent paper in the journal PLOS One seeks to answer this question, and they made use of CrowdFlower as part of their investigation. The paper, titled “Narrative Style Influences Citation Frequency in Climate Change Science,” takes the abstracts of almost 750 scientific articles from a wide range of journals, and looks at how elements of writing style might help explain how widely they get shared.


Image source: PLOS blog

I really like the way this research is structured, because it’s a great example of drawing upon multiple disciplines to come up with better answers to important questions. Here, the authors made use of work in both psychology and literary theory to understand how findings get shared in yet a third discipline. To quantify the way language was used in the scientific articles, the authors of the study identified six key features of the prose. These included both stylistic features like narrative perspective, and content features like whether the writing makes an explicit recommendation.

But how do you accurately measure these things, across hundreds and hundreds of scientific articles? That’s where CrowdFlower comes in. The authors of the study had contributors rate each abstract according to the six features they identified, and because they recognized that “individual readers can perceive narrativity somewhat differently”, they had each abstract judged by seven independent contributors. Being able to collect redundant judgments is one of the key ways that CrowdFlower helps ensure quality labeled data, and along with this the authors made careful decisions about which contributors to give access to the job — so we can view these findings with a good degree of confidence.

The authors took these six measures of language use and combined them into one “Narrative Index”. They then compared this index to measures of how frequently each article was cited in the literature. What they found was a significant positive correlation: the more narrative the article, the more widely cited it tended to be:



Image source: PLOS One

At first glance, it looks like a not-very-strong relationship (and there is, in fact, an ongoing debate in the literature on statistical significance vs. effect size), but one thing to note is that we’re not looking at the raw citation counts here, but log counts. Log transforms are a standard practice when working with frequency information, because they help prevent extreme values from completely dominating the analysis. In fact, this regression line predicts about a 3.5x increase in citations between the far left and far right of the graph. And that’s even with outliers retained in the analysis.

The authors of the study dig deeper, as well — narrative style is not a property just of articles, but of different publication venues as well. Some are known to be sources of dry, lengthy technical reports while others are made up of short, engaging summaries of recent important findings. There’s a big current debate on the role of high-profile journals in scientific publishing that I won’t comment on here, but is worth thinking about. Nonetheless, when one looks at different journals’ impact factors (a measure based on how heavily cited their articles tend to be), a clear narrativity trend emerges there too:


Image source: PLOS One

The trend captured here at the journal is much starker than at the individual article level. There are a few things I don’t love about this comparison — points off for the truncated Y-axis, I don’t know why narrativity is on the Y here as opposed to before, and I’d like to see the range of article-level narrativity values behind each journal’s average measure — but overall this serves to validate that narrative style is definitely tied to impact factor as well.

Accessibility of scientific findings is a big issue today. Making findings understandable is important, and it’s often hard to strike a balance between readability and detail. I was quite pleased in preparing this writeup to discover not just the original article, but also a blog post (from the journal’s website) with a good, high-level summary of its findings. This type of hybrid approach is something I’d like to see more of.

Readability of scientific findings is important, but making them available in the first place is even more critical. With for-profit publishers facing increasing backlash for charging hefty sums to read articles (particularly those reporting on research conducted with federal funds), an increasing amount of attention is being paid to so-called “open-access” venues, which make findings available to the public for free. PLOS One, the journal this study was published in, is one such open-access publication. For our part, CrowdFlower believes in making scientific data more available as well; the research conducted in the article discussed here was done using our Data for Everyone plan, making our platform available without fees to researchers who agree to make the data they collect publicly available. The narrativity dataset can be downloaded from the article’s page on PLOS One, and is linked from our Data for Everyone library as well.


Nick Gaylord

Nick Gaylord

Nick is CrowdFlower's Senior Data Scientist, where he works primarily to help build their new machine learning offering, CrowdFlower AI. Prior to CrowdFlower, he was a data scientist at SF text analytics startup Idibon. He has a PhD from the University of Texas at Austin, where his research focused on human language comprehension and the construction of datasets for NLP applications. In his spare time he fixes bikes and collaborates on work applying cognitive science principles to the public health domain. You can follow him on Twitter at @texastacos.