Wikipedia:GLAM/Museum of New Zealand Te Papa Tongarewa/What We've Done/Myosotis Pilot

From Wikipedia, the free encyclopedia
WelcomeWikimedians - Please help usStaff Contribution Policy and Editing HelpWorkplanWhat we've done


Te Papa's Myosotis pilot project wanted to find out how we could effectively and sustainably contribute to Wiki projects using our collection images, metadata, and curatorial knowledge. We used OpenRefine to load 355 images of Myosotis specimens native to Aotearoa New Zealand, creating a reusable process that involves adding well-described content, improving and creating articles, and connecting with structured metadata.

The project:

  • Loaded 355 images of Myosotis specimens
  • Added the images to articles created by one of our Botany Curators, Stitchbird2
  • Added and updated Wikidata items for many species and people related to the set
  • Created a new Commons template, Template:TePapaColl
  • Created processes to support the selection, export, transformation, and upload for our images and data

If you've got any questions, suggestions, or just want to talk about the project, get in touch with Avocadobabygirl.

This page describes the project goals, how to publish a set like this to Wikimedia Commons, and specifics of how we made it happen.

What’s Wikimedia Commons? How do we use it?[edit]

Te Papa wants the collections and knowledge we hold to be accessible and impactful for anyone who wants them. We’re building up an ongoing programme of digital outreach work that works out the best (most effective, sustainable, enriching…) platforms we can push out onto and makes it happen.

By loading images and metadata to Wikimedia Commons, as well as including the pictures in Wikipedia articles and connecting into Wikidata, we put valuable and up to date material right where people go looking for it. Contributing to a scientifically-sound article on native forget me nots makes Wikipedia itself more complete, but also helps people who see that information when it goes elsewhere on the internet, like iNaturalist or Google search results.

We load not just high-quality and high-resolution images, but also detailed metadata and a link back to the record on our Collections Online site. This helps the image travel with its context: detailed and useful information that makes the images easier to find, use, and interpret in all sorts of ways.

On the front end, we display descriptive metadata that makes it really clear what you’re looking at, as well as extra info that’s useful to wikipedians and researchers. Behind the scenes we also hook in several structured data statements using Wikidata properties and items, making it easier to computationally interpret the image.

Loading all this material to Wikimedia Commons is done with OpenRefine. We use it to prepare our data, hook into Wikidata, and upload the images in bulk.

Read on to see how we select material, process images and data, and load it to the platform.

Selection criteria[edit]

Setting selection criteria and making your actual selections helps keep the size of the following work down.

Establish your criteria, using the following as a basis.

Criteria Reasoning
Set size is between 300 and 1000 images Safely within OpenRefine’s bulk upload capacity

Manageable amount to do some manual processing

Set prioritises New Zealand/Pacific material Supports Te Papa’s strategic goals

More likely to fill gaps on Wikimedia Plays to our strengths

Images are new to Wikimedia Commons Avoid duplication of effort
Images are public domain or have a CC BY licence Open licenses are required for Wikimedia Commons/Wikipedia
Images and data are high quality Small/unclear images and incorrect/inconstant data don’t support positive audience impact or reflect well on Te Papa

Preferably, images will also have a use case ready to go, like inclusion on specific Wikipedia articles.

It will also be easier to prepare and upload material is the records are all the same type (eg Specimen vs Object), but this isn’t required.

Image selection[edit]

Because we wanted to restrict our set to a small number of relevant and high quality images, we did a review of all images attached to the records we’d chosen.

Preparing the data for OpenRefine[edit]

Create a general list of the kinds of images you want to include. It’s good to do this as a spreadsheet including columns like:

  • record numbers
  • titles
  • species
  • locations.

Make sure that there is one row in your spreadsheet for each image.

You can now open it in OpenRefine as a new project.

Filtering and faceting in OpenRefine[edit]

Use OpenRefine’s faceting and filtering tools to remove records (each relating to a single image) you don’t want to include. Each record should relate to a single image. Some useful methods are:

  • Facet by species
  • Facet by specimen or catalogue record. Only keep those with multiple images.
  • Facet on empty fields
  • Facet on image metadata. For example: minimum longest edge, file type (tif, jpg), file size, creation date, filename (the filename may point to the type of image it is – specimen sheet, field image etc)
  • Facet on image creator

When you have filtered the records you don’t want to include, you can flag them using the All dropdown menu on the first column, then Edit rows, then Flag rows. When you’re done, you can then remove all flagged records from your project by selecting Remove all matching rows – it’s better to do this at the end, in case you change your mind.

Review your data[edit]

After narrowing down to a subset of records, it’s a good time to review your data.

Look out for things like:

  • Values showing in the correct fields
  • Consistency – dates, spelling of names, formatting
  • Missing or additional data that should be added, for example Wikidata QIDs for associated people and taxa
  • Sensitive information – cultural, personal, location and financial data that shouldn’t be published

Ensure that data supporting image use is correct. For example:

  • Individual rights statements are consistently applied and meet the requirements of the external platform. For example, Wikimedia Commons requires images to be freely licensed or in the public domain.
  • Images are already (or queued to be) published on your own platform. This ensures users can verify that an image has in fact been officially published and is reusable.
  • Images are published at their highest resolution

Wikidata prep[edit]

OpenRefine lets you reconcile columns of values against Wikidata items, thereby connecting each upload to structured data in all sorts of useful ways.

Reconciliation using OpenRefine

Linking up things like creators, species, what’s depicted in the image, and significant locations covers most of the things people want to know. You might also consider:

  • type status (both whether the specimen is a type, and what kind of type)
  • collection/institution it's held in
  • people involved in collecting or identifying it.

The easiest way to get a definite match is to include Wikidata identifiers – QIDs – in your source data.

Wikidata:Identifiers

Finding a QID on Wikidata[edit]

A lot of things are already on Wikidata, so there’s a good chance of finding a QID for the entity you’re working with. Sometimes, the difficult part is finding the right one.

Wikidata items are supposed to be one-to-one with a specific thing, so finding something that’s close isn’t going to be helpful. Alexander von Humboldt (Prussian naturalist) is not Alexander von Humboldt (boat), and a specimen of Myosotis antarctica subsp. traillii isn’t a specimen of Myosotis antarctica subsp. antarctica.

Partially-filled in Wikidata search box, showing results for "Myosotis antarc".

Start by searching from the box in the top right of Wikidata’s homepage. If the item you want doesn’t show up in the dropdown, hit enter to get a full search results page.

When looking for the right item, think about how you would be sure you’re looking at the right one:

  • Is the name at the right level of specificity?
  • Do birth/death dates, locations, associated institutions line up?
  • Has the name of the entity changed over time, with different ones being used in your data and on Wikidata?

You may find you need to do more research. If available information is scant and you can’t make a confirmed match, it may be safest to leave it out, and just use the entity’s name string instead.

Adding a new item to Wikidata[edit]

If there isn’t an item you can match, you can add your own one.

Help:Items tells you how to do that.

Create statements for the item to help make it clear what it is.

Statements on a Wikidata item for a nonbinary person, showing instance of 'human', and sex or gender of 'takatāpui' and 'non-binary'.

For example, a person’s record should include:

  • Instance of: human
  • Given name
  • Family name
  • Occupation
    • If you don’t have more definite information, add a contextually appropriate role here, like ‘botanical collector’
  • If it’s available the identifier from your system. For us, this is Te Papa agent ID

See Heidi Meudt’s Wikidata page for a more filled-in example.

Wikimedia Commons prep[edit]

Categories in Wikimedia Commons (and Wikipedia) group content together and help make it findable.

When applied to uploads, it’s best to use the most specific applicable category. For example, this specimen upload is a Myosotis, but only has the Myosotis pansa category.

Commons:How to create new categories or subcategories

Data mapping and transformation[edit]

The data actually required to load images to Wikimedia Commons is very simple – a filename and a license statement. But it’s possible to provide a lot more data.

If including more complex data, you’ll want to use a template. Templates for some object types are much more mature than others.

Naturalis have created a more comprehensive specimen template called Biohist.

Harvesting data[edit]

With your selections and data mapping in place, you can now re-export your data in a format that’s easy to process and upload in OpenRefine.

Processing in OpenRefine[edit]

Load the fresh export of data to OpenRefine as a new project, and do a final review of your data.

  • Ensure the filenames and filepath are correct
  • Remember that some things may appear to be doubled up, as they’re covering both descriptive and structured metadata

Wikitext[edit]

Generate Wikitext for each item by transforming the Wikitext column with the following value (adjust as needed, of course):

"== {{int:filedesc}} ==\n" +
"{{TePapaColl\n" +
if(isBlank(cells.BasisOfRecord.value), "", "|BasisOfRecord=" + cells.BasisOfRecord.value + "\n") +
if(isBlank(cells.QualifiedName.value), "", "|QualifiedName=" + cells.QualifiedName.value + "\n") +
if(isBlank(cells.CommonName.value), "", "|MāoriCommonName=" + cells.CommonName.value + "\n") +
if(isBlank(cells.GenusCommonName.value), "", "|GenusCommonName=" + cells.GenusCommonName.value + "\n") +
if(isBlank(cells.MātaurangaMāori.value), "", "|MātaurangaMāori=" + cells.MātaurangaMāori.value + "\n") +
if(isBlank(cells.Family.value), "", "|Family=" + cells.Family.value + "\n") +
if(isBlank(cells.RegistrationNumber.value), "", "|RegistrationNumber=" + cells.RegistrationNumber.value + "\n") +
if(isBlank(cells.InstitutionCode.value), "", "|HerbariumCode=" + cells.InstitutionCode.value + "\n") +
if(isBlank(cells.TypeStatus.value), "", "|TypeStatus=" + cells.TypeStatus.value + "\n") +
if(isBlank(cells.TypeOf.value), "", "|TypeOf=" + cells.TypeOf.value + "\n") +
if(isBlank(cells.Institution.value), "", "|Institution=" + cells.Institution.value + "\n") +
if(isBlank(cells.DateCollected.value), "", "|CollectionDate=" + cells.DateCollected.value + "\n") +
if(isBlank(cells.CollectedBy.value), "", "|CollectedBy=" + cells.CollectedBy.value + "\n") +
if(isBlank(cells.IdentifiedBy.value), "", "|IdentifiedBy=" + cells.IdentifiedBy.value + "\n") +
if(isBlank(cells.Country.value), "", "|Country=" + cells.Country.value + "\n") +
if(isBlank(cells.StateProvince.value), "", "|StateProvince=" + cells.StateProvince.value + "\n") +
if(isBlank(cells.CatalogueRestrictions.value), if(isBlank(cells.PreciseLocality.value), "", "|PreciseLocality=" + cells.PreciseLocality.value + "\n"), "") +
if(isBlank(cells.ElevationMetresFromTo.value), "", "|Elevation=" + cells.ElevationMetresFromTo.value + "\n") +
if(isBlank(cells.DepthMetresFromTo.value), "", "|Depth=" + cells.DepthMetresFromTo.value + "\n") +
if(isBlank(cells.SourceUrl.value), "", "|SourceURL=" + cells.SourceUrl.value + "\n") +
if(isBlank(cells.CreditLine.value), "", "|CreditLine=" + cells.CreditLine.value + "\n") +
"}}\n" +
"=={{int:license-header}}==\n" +
"{{cc-by-4.0}}\n" +
"[[Category:Botany in Te Papa Tongarewa]]\n" +
"[[Category:Uploaded by Te Papa staff]]\n" +
"[[Category:Herbarium specimens]]\n" +
if(isBlank(cells.CategoryScientificName.value), "", "[[Category:" + cells.CategoryScientificName.value + "]]\n") +
if(isBlank(cells.TypeStatus.value), "", "[[Category:Museum of New Zealand Te Papa Tongarewa type specimens]]\n")

Schema[edit]

Property Example item Qualifier property Example qualifier item
depicts Myosotis glabrescens
main subject Myosotis glabrescens
source of file file available on the internet described at URL https://collections.tepapa.govt.nz/object/470141
retrieved 10 October 2022
significant event plant collection point in time February 1890
significant person Donald Petrie subject has role botanical collector
country of origin New Zealand
location Otago Region
taxon name Myosotis glabrescens L.B.Moore taxon author Lucy Beatrice Moore
taxon author citation L.B.Moore
Boraginaceae
instance of type specimen
subject has role holotype of Myosotis glabrescens
collection Museum of New Zealand Te Papa Tongarewa Herbarium
Museum of New Zealand Te Papa Tongarewa
copyright status copyrighted
copyright license Creative Commons Attribution 4.0 International

Reporting and analytics[edit]

There are several tools that help gather analytics data about use of Wikipedia articles, Commons images, and more. They tend to provide a qualitative overview, so it’s good to supplement that with qualitative measures as well.

Using Wikimedia’s API to get pageviews[edit]

Wikimedia REST API documentation

This API gives you access to pretty much whatever you want to pull from Wikimedia, but what’s useful here is the pageviews data endpoint. This lets you send queries about how much use a given page is getting, customised with several parameters.

We run the following python script monthly, creating a simple report from a couple of text files that have lists of urls for the images and articles we want to keep track of.

from requests import get
import json
import html
import csv

headers = {"Accept": "application/json", "User-Agent": "[PUT YOUR LOGIN EMAIL HERE]"}

# Queries the API for each url, called by Report.get_views()
class WikiAPI():
	def __init__(self):
		self.pageviews_base_url = "https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article"

	def pageviews(self, project, access, agent, article, granularity, start, end):
		article = html.escape(article)
		slugs = [self.pageviews_base_url, project, access, agent, article, granularity, start, end]
		query_url = "/".join(slugs)

		response = json.loads(get(query_url, headers=headers).text)

		return response

# Takes a list of urls and query parameters, creates API queries, and writes the results to a csv
class Report():
	def __init__(self, mode=None, articles=None, start=None, end=None, granularity=None, project=None, access=None, agent=None):
		self.mode = mode
		self.articles = articles
		self.start = start
		self.end = end
		self.granularity = granularity
		self.project = project
		self.access = access
		self.agent = agent

		self.API = WikiAPI()

		if self.mode == "articles":
			self.report_file = "{start} - {end} wikipedia article views.csv".format(start=self.start, end=self.end)
		elif self.mode == "images":
			self.report_file = "{start} - {end} wikimedia image views.csv".format(start=self.start, end=self.end)

		self.open_file = open(self.report_file, "w", newline="", encoding="utf-8")

		self.write_report()

	def write_report(self):
		self.reportwriter = csv.writer(self.open_file, delimiter=",")
		self.reportwriter.writerow(["wikiUrl", "pageViews"])

		with open(self.articles, 'r', encoding="utf-8") as f:
			lines = f.readlines()
			for line in lines:
				wiki_url = line.split("/")[-1].strip()
				view_count = self.get_views(wiki_url)
				self.reportwriter.writerow([wiki_url, view_count])

		self.open_file.close()

	def get_views(self, article):
		view_count = 0
		response = self.API.pageviews(project=self.project, access=self.access, agent=self.agent, article=article, granularity=self.granularity, start=self.start, end=self.end)

		if "items" in response:
			for day in response["items"]:
				view_count += day["views"]

		return view_count

# Use to set parameters for the report
def run_report(mode=None):
	# Can be daily or monthly
	granularity = "daily"
	# YYYYMMDD or YYYYMMDDHH
	start = "20221001"
	# YYYYMMDD or YYYYMMDDHH
	end = "20221031"

	# Can be all-access, desktop, mobile-app, or mobile-web
	access = "all-access"
	# Can be all-agents, user, automated, or spider
	agent = "user"

	if mode == "articles":
		project = "en.wikipedia.org"
		articles = "tracked_articles.txt"

	elif mode == "images":
		project = "commons.wikimedia.org"
		articles = "tracked_uploads.txt"

	Report(mode=mode, articles=articles, start=start, end=end, granularity=granularity, access=access, agent=agent, project=project)

# mode can be "articles" or "images"
run_report(mode="images")

Use of images on Wiki project pages[edit]

Other tools let you see how categories of Commons images are used across the Wiki ecosystem, giving you a broad scale of how a set of images are being used and also letting you drill down.

We use Glamorous to check the usage of all images under Category:Collections of Te Papa.

Filtering to a date span shows a chart of views by project (such as English-language Wikipedia, Spanish-language Wikipedia, Wikidata) on the Daily views tab.

Chart of pageviews for pages including images from a specified Wikimedia Commons category. The chart is broken down by Wiki project.

Usage is also charted on the Global file usage tab.

List of Wiki projects, with numbers of distinct files used, pages using files, total file usages, and page views for the selected period.

And the File usage details tab provides a complete breakdown of every image in the category, showing for each one:

  • Number of uses
  • Page views across projects
  • Which pages it’s linked on
List of images in the selected categories, along with a count of uses and page views, and pages that the images are included on.

Tracking contributions[edit]

It can be useful to see how interest by contributors is building, based on how active they are after significant releases or other work.

The Programs and Events Dashboard provides a combined view of multiple users' contributions. Users can be added to the overall campaign or individual events.

We’re using ours to see how staff interest is (hopefully) building as we release more material and publicise the work internally. Staff who are interested in contributing as part of their work are added to the board, and we then look at our collective impact.

Another tool we may use is Herding Sheep - the idea is to ask participants at public edit-a-thons we hold to share their usernames, so we can get an idea of what kind of session or topic inspires the most ongoing activity as an editor.

Qualitative data[edit]

Although the available tools mainly focus on raw numbers, the wider Wiki ecosystem does provide good ways to collate qualitative data, which may tell you things like:

  • What questions people are trying to answer when they go to Wikipedia
  • What sort of problems you’ve helped them solve
  • What they think is still missing

We’re keeping an eye on our user Talk pages, as well as those for articles we’ve edited and images we’ve uploaded.

Other existing channels, including our website pop-up survey and high-resolution image download questionnaire, are also being watched for relevant comments. We are currently receiving feedback through emails to individual staff, and may set up a digital outreach address to publicise as an easy point of contact.

The main trick is to actually record these comments as they’re received. Even adding them to our simple monthly reporting spreadsheet is enough to get that information aggregated, analysed, and shared with the right people.

In the future, we’re considering running observational user testing to get qualitative feedback on the specifics of how we’re using these platforms, particularly regarding user experience and content decisions.