File:Lotka law for the 15 most populated categories on arXiv (2023-07).svg

Page contents not supported in other languages.
This is a file from the Wikimedia Commons
From Wikipedia, the free encyclopedia

Original file(SVG file, nominally 1,800 × 1,440 pixels, file size: 351 KB)

Summary

Description
English: Lotka law for the 15 most populated categories on arXiv, as of 2023-07.

It is a log-log plot. The x-axis is the number of publications, and the y-axis is the number of authors with at least that many publications.

Data from https://www.kaggle.com/datasets/Cornell-University/arxiv

```python import pandas as pd import json

  1. Define empty lists to store data

categories_list = [] authors_parsed_list = []

  1. Open the file and iterate over each line

with open('arxiv-metadata-oai-snapshot.json', 'r') as file:

   for line in file:
       # Parse the JSON string
       paper = json.loads(line)
       
       # Extract the "categories" and "authors_parsed" fields
       categories = paper.get("categories", "")
       authors_parsed = paper.get("authors_parsed", [])
       
       # Split categories string into list and store
       categories_list.append(categories.split())
       
       # Store the authors_parsed data
       authors_parsed_list.append(authors_parsed)
  1. Create a DataFrame from the extracted data

df = pd.DataFrame({"categories": categories_list, "authors_parsed": authors_parsed_list})

categories_list = [category for categories in df['categories'] for category in categories] unique_categories = set(categories_list)

from collections import Counter

  1. Flatten the categories column into a single list

categories_list = [category for categories in df['categories'] for category in categories]

  1. Count the occurrence of each category

category_counts = Counter(categories_list)

  1. Sort categories by count in descending order

sorted_categories = sorted(category_counts.items(), key=lambda x: x[1], reverse=True)

  1. Print the sorted categories and their counts

for category, count in sorted_categories:

   print(f"{category}: {count}")

import pandas as pd

def count_authors(df, category_list):

   counter = {}
   # Filter rows that match the specified categories
   mask = df['categories'].apply(lambda x: any(category in x for category in category_list))
   filtered_df = df[mask]
   # Flatten the authors_parsed column
   flattened_authors = [author for authors in filtered_df['authors_parsed'] for author in authors]
   # Count the occurrences of each author
   for author in flattened_authors:
       author_name = author[1] + ' ' + author[0]
       counter[author_name] = counter.get(author_name, 0) + 1
   return counter
  1. Flatten the categories column into a single list

categories_list = [category for categories in df['categories'] for category in categories]

  1. Count the occurrence of each category

category_counts = Counter(categories_list)

  1. Sort categories by count in descending order

sorted_categories = sorted(category_counts.items(), key=lambda x: x[1], reverse=True)

  1. Plot

plt.rcParams.update({'font.size': 20})

fig, axs = plt.subplots(figsize=(10, 8))

n_categories = 15 for category, _ in sorted_categories[:n_categories]:

   result = count_authors(df, [category])
   data = pd.DataFrame(np.log(list(result.values())), columns=['x'])["x"]
   magnitudes = data.sort_values(ascending=False)
   unique_magnitudes = magnitudes.unique()
   print(category)
   # Compute reverse cumulative count of earthquakes for all data
   cumulative_counts = magnitudes.value_counts().sort_index(ascending=False).cumsum().sort_index()
   axs.scatter(cumulative_counts.index, np.log10(cumulative_counts.values), label=category, s=3)

axs.legend() axs.grid() axs.set_title(f"Lotka law for the {n_categories} most populated categories on arXiv") axs.set_xlabel("Log(publications)") axs.set_ylabel("Log(authors)") plt.show() plt.savefig("out.svg")

```
Date
Source Own work
Author Cosmia Nebula

Licensing

I, the copyright holder of this work, hereby publish it under the following license:
w:en:Creative Commons
attribution share alike
This file is licensed under the Creative Commons Attribution-Share Alike 4.0 International license.
You are free:
  • to share – to copy, distribute and transmit the work
  • to remix – to adapt the work
Under the following conditions:
  • attribution – You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
  • share alike – If you remix, transform, or build upon the material, you must distribute your contributions under the same or compatible license as the original.

Captions

Lotka law for the 15 most populated categories on arXiv (2023-07).

Items portrayed in this file

depicts

17 July 2023

image/svg+xml

File history

Click on a date/time to view the file as it appeared at that time.

Date/TimeThumbnailDimensionsUserComment
current01:14, 18 July 2023Thumbnail for version as of 01:14, 18 July 20231,800 × 1,440 (351 KB)Cosmia NebulaUploaded own work with UploadWizard
The following pages on the English Wikipedia use this file (pages on other projects are not listed):

Metadata