becoming irrelevant

why

The below post on mastodon by goat farmer & astronomer Prof. Sam Lawler got me thinking.

Post by @sundogplanets@mastodon.social

View on Mastodon

It made me wonder how far back does the collective memory, or willingness to search, stretch? Roughly.

rough plan

get a big dump of articles
- from a single journal, perhaps?
for each article grab:
- year it was published
- year of every reference referenced
plot the distribution of publication year for references referenced for each published year
have a look

hypothesis

My gut wants to say^[1], that despite increased accessibility^[2] to humankind's research efforts^[3]^[4], there is perhaps a tendency to take for granted things that have been known for more than a few decades^[5], and researchers, therefore, perhaps don't feel the need to look further back in time. Or, perhaps they do, but are compelled to only cite the recent stuff to demonstrate that they are up to date^[6].

I would be inclined to point towards the rapid increase in number of papers published each year^[7] (Fig 0) as a key driver of this forgetting-the-old-stuff.

Fig 0: artist's impression of an exponential curve showing number of articles published each year

data

right. lets get some.

First, I went to Nature. I presumed that being a big deal in the whole scientific publishing game they would have an easily accessible database/API for grabbing lots of metadata. I found Springer Nature's Developer Portal, signed up and then didn't login.

A few searches later I happened upon Open Citations which looked very promising, and I started throwing some code together. However. The first DOI I threw at it, returned a different paper.

Being fidgety, I kept looking. CrossRef looked promising, and they even have a little notebook tutorial for how to get citations.

Then OpenAlex came into view. Perfect, that'll do. And there's even a handy python package - pyalex - that saves me the bother of writing my own requests. And it has a .sample() function. Yes please.

method

I pretty much followed the rough plan.

I'll spare you the details of what I originally did. It was fine, but also, not quite right.

I made good use of filter() to search only for records with type article, and that are in CrossRef as journal-article. I also filtered just on articles published in Nature.

Functionally inelegant code handled: the API's limits of submitting 50 DOIs at a time; appending lists to other lists; and gracefully keeping track of records returned that didn't have any references listed^[8]. Follow this up with a bit of wrangling^[9] to get the results into a nice dataframe with one row for each referenced paper, and columns: pub_year & ref_year. Next step: get the distribution of referenced publication dates, grouped by pub_year.

From 1950 - 2024 (inc) I aimed to sample 2000 articles each year. Not all of these 2000 had references (Table 0). Rather than stratifying / accounting for this difference between each year, I just used everything I had for the rest of this analysis^[10]. This amounted to: 74,033 papers and 1,167,446 referenced works^[11]

	number of articles with references	%
mean	987	49.3
median	980	49
min	536	26.8
max	1325	66.2

Table 0: Summary statistics for the number of articles that had references. Sampling 2000 articles for each of 75 years (1950-2024)

results

Headline: I was wrong.

In recent years, it seems, the the median age of a reference has increased from 3 in the 1990s to 6 in the late 2010s (Fig 1). And the number of references included has swollen, following a post-millennium dip^[12].

two timeseries 1950-2025. top shows the mean, median and IQR reference age (years), bottom shows the mean, median and IQR number of references per publication. — ⛶↗ Fig 1: Mean (brown dashed), median and IQR (purple / shaded) of: (top) reference age; and (bottom) number of references per publication.

As for the tail, in the 1950s 99% of references cited were ~55 years old, or younger. This shrank to a minimum of ~35 years in 1980, and since 2010 it has been steady at ~55 years. Using the age of a reference (pub_year - ref_year), a quick scipy.stats.mannwhiteneyu() suggests that the 1950 distribution is ✨significantly✨ different from the 2020 distribution (Fig 2). But I reckon you'd eye-balled that already.

ridgeline plot showing 15 distributions of referenced paper published dates, one every 5 years from 1950-2020. The ridges steadily move to the right as publication date increases. Each distribution is abruptly cut at its rightmost edge, and has a long tail to the left. Over time the distribution becomes less skewed, and flatter (but still skewed) — ⛶↗ Fig 2: Distribution of referenced paper publication date (x-axis) by parent-publication date 1950-2020. Align the right-edges of these to compare distribution of 'age' of reference.

earliest works

The earliest work referenced was from 1400^[13]. An oversight, meant that I didn't think to capture the DOI's of anything. Rummaging through the list of lists to ascertain when 1400 came up, I was able to ascertain it was this record in OpenAlex, which points to this DOI. For those readers that chose not to click on either of those: the OpenAlex record is a whoopsie, and it points to a DOI that has been removed. From the title, I think it is supposed to be this record (DOI), which is: Density Estimation for Statistics and Data Analysis by Bernard. W. Silverman, first published in 1986^[14]

This whoopsie is unfortunate but also helpful. OpenAlex contains, by virture of its gargantuan size^[15] and the infeasability of putting eyes-on everything, a few mistakes. Hopefully this doesn't detract too much from whatever conclusion I've drawn here.

The next oldest work is from 1619. Kepler's Harmonice Mundi V. This beauty includes a "long digression on astrology"^[16], which is duly followed by his third law of planetary motion. Well done. The 1999 Nature article that references Kepler is: Systematic enumeration of crystalline networks by Olaf Delgado Friedrichs et al.,.

In two dimensions, there are 11 different topological types of uninodal tilings, a result already known to Kepler

discussion(ish)

I'm going to stick my neck out and boldly assert that I think some of the above results might be a function of what is and is not in OpenAlex.

The explosion (Fig 0) makes staying up-to-date and having a good grasp of the original canonical works hard.

Looking at it from the original author's perspective^[17]: the vast majority (95%) of works cited are ~25 years old. Which perhaps means most work is ultimately destined to be ignored, forgotten, and under-appreciated. There's a good chance a career scientist will be conducting research for long enough to forget their own beginnings. Maybe, however, your research lives on and you just don't get the credit you deserve. All the while your hard-won results trickle down through the generations in a game of broken telephone^[18]we call ✨science✨.

footnotes

look at me, just ✨bursting✨ with confidence ↩︎
citation needed, probably ↩︎
thanks to the web, big databases, inter-library loans, ↩︎
no thanks to paywalls ↩︎
see this, most excellent, post by John Kennedy, esp. footnotes 10 & 32 ↩︎
or hamstrung by journals imposing limits on number of references ↩︎
for example. or Fig 1 from Hanson et al., 2024. ↩︎
more thoughtful filtering on my part could have caught these ones earlier in the process ↩︎
writing slightly opaque and needlessly grubby one-liners with pandas is my specialty ↩︎
lol. sorry ↩︎
no the same referenced work might be counted multiple times, and that's ok ↩︎
Nature's policy, or fashion? ↩︎
from my sample there were a total of 6 references to this one item. ↩︎
the guilt from not having read it made me change the way I was visualising the distributions from a KDE to just a plain old regular vanilla histogram ↩︎
decompresses to 1.6 TB ↩︎
so says the wikipedia entry ↩︎
I was about to repeat the analysis but from this perspective (because it would be fairer having all source articles come from the same journal), but OpenAlex only keeps counts_by_year for the last ten years. oh well. phew ↩︎
a.k.a. chinese whispers ↩︎

becoming irrelevant

# why

# rough plan

# hypothesis

# data

# method

# results

# earliest works

# discussion(ish)

# footnotes

why