File:Lotka law for the 15 most populated categories on arXiv (2023-07).svg

Original file ‎(SVG file, nominally 1,800 × 1,440 pixels, file size: 351 KB)

This is a file from the Wikimedia Commons. Information from its description page there is shown below.
Commons is a freely licensed media file repository. You can help.

Summary

Description	English: Lotka law for the 15 most populated categories on arXiv, as of 2023-07. It is a log-log plot. The x-axis is the number of publications, and the y-axis is the number of authors with at least that many publications. Data from https://www.kaggle.com/datasets/Cornell-University/arxiv ```python import pandas as pd import json Define empty lists to store data categories_list = [] authors_parsed_list = [] Open the file and iterate over each line with open('arxiv-metadata-oai-snapshot.json', 'r') as file: for line in file: # Parse the JSON string paper = json.loads(line) # Extract the "categories" and "authors_parsed" fields categories = paper.get("categories", "") authors_parsed = paper.get("authors_parsed", []) # Split categories string into list and store categories_list.append(categories.split()) # Store the authors_parsed data authors_parsed_list.append(authors_parsed) Create a DataFrame from the extracted data df = pd.DataFrame({"categories": categories_list, "authors_parsed": authors_parsed_list}) categories_list = [category for categories in df['categories'] for category in categories] unique_categories = set(categories_list) from collections import Counter Flatten the categories column into a single list categories_list = [category for categories in df['categories'] for category in categories] Count the occurrence of each category category_counts = Counter(categories_list) Sort categories by count in descending order sorted_categories = sorted(category_counts.items(), key=lambda x: x[1], reverse=True) Print the sorted categories and their counts for category, count in sorted_categories: print(f"{category}: {count}") import pandas as pd def count_authors(df, category_list): counter = {} # Filter rows that match the specified categories mask = df['categories'].apply(lambda x: any(category in x for category in category_list)) filtered_df = df[mask] # Flatten the authors_parsed column flattened_authors = [author for authors in filtered_df['authors_parsed'] for author in authors] # Count the occurrences of each author for author in flattened_authors: author_name = author[1] + ' ' + author[0] counter[author_name] = counter.get(author_name, 0) + 1 return counter Flatten the categories column into a single list categories_list = [category for categories in df['categories'] for category in categories] Count the occurrence of each category category_counts = Counter(categories_list) Sort categories by count in descending order sorted_categories = sorted(category_counts.items(), key=lambda x: x[1], reverse=True) Plot plt.rcParams.update({'font.size': 20}) fig, axs = plt.subplots(figsize=(10, 8)) n_categories = 15 for category, _ in sorted_categories[:n_categories]: result = count_authors(df, [category]) data = pd.DataFrame(np.log(list(result.values())), columns=['x'])["x"] magnitudes = data.sort_values(ascending=False) unique_magnitudes = magnitudes.unique() print(category) # Compute reverse cumulative count of earthquakes for all data cumulative_counts = magnitudes.value_counts().sort_index(ascending=False).cumsum().sort_index() axs.scatter(cumulative_counts.index, np.log10(cumulative_counts.values), label=category, s=3) axs.legend() axs.grid() axs.set_title(f"Lotka law for the {n_categories} most populated categories on arXiv") axs.set_xlabel("Log(publications)") axs.set_ylabel("Log(authors)") plt.show() plt.savefig("out.svg") ```
Date	17 July 2023
Source	Own work
Author	Cosmia Nebula

Licensing

I, the copyright holder of this work, hereby publish it under the following license:

This file is licensed under the Creative Commons Attribution-Share Alike 4.0 International license.

You are free:

to share – to copy, distribute and transmit the work
to remix – to adapt the work

Under the following conditions:

attribution – You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
share alike – If you remix, transform, or build upon the material, you must distribute your contributions under the same or compatible license as the original.

File history

Click on a date/time to view the file as it appeared at that time.

	Date/Time	Thumbnail	Dimensions	User	Comment
current	01:14, 18 July 2023		1,800 × 1,440 (351 KB)	Cosmia Nebula	Uploaded own work with UploadWizard

File usage

The following pages on the English Wikipedia use this file (pages on other projects are not listed):

Lotka's law

Metadata

This file contains additional information, probably added from the digital camera or scanner used to create or digitize it.

If the file has been modified from its original state, some details may not fully reflect the modified file.

Width	1440pt
Height	1152pt

File:Lotka law for the 15 most populated categories on arXiv (2023-07).svg

Summary

Licensing

Captions

Items portrayed in this file

depicts

Lotka's law

scientometrics

power law

creator

some value

copyright status

copyrighted

copyright license

Creative Commons Attribution-ShareAlike 4.0 International

source of file

original creation by uploader

inception

17 July 2023

media type

image/svg+xml

File history

File usage

Metadata