Wikipedia:Reference desk/Archives/Computing/2024 January 16

From Wikipedia, the free encyclopedia
Computing desk
< January 15 << Dec | January | Feb >> January 17 >
Welcome to the Wikipedia Computing Reference Desk Archives
The page you are currently viewing is a transcluded archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.


January 16[edit]

Algorithm to match U.S. state names to USPS state codes[edit]

Re this mini project, I wish to match a list of USPS state codes (set A) with a list of U.S. state and territory names which has spelling variations (set B) in any order. Set B may have missing items, in which case set A matches [null]. This is an example successful match:

Set A CO, CT, DC, VA, WA, WV
Set B Conn., Wash., Washington (District of Columbia), W. Virginia

yields

Set A Set B
CO [null]
CT Conn.
DC Washington (District of Columbia)
VA [null]
WA Wash.
WV W. Virginia

Can someone please suggest a robust, performant yet simple to program algorithm? The target language is JavaScript.

Thanks,
cmɢʟeeτaʟκ 09:03, 16 January 2024 (UTC)[reply]

@Cmglee: I've re-read this multiple times and can't understand what this program is supposed to do. Are you trying to convert a natural language name to a USPS code? One which could have many variations, such as eg DC being "District of Columbia", "Washington DC", "Washington (District of Columbia)" for example. Is that what you're trying to do? —Panamitsu (talk) 10:28, 16 January 2024 (UTC)[reply]
Example map
Thanks for your reply, @Panamitsu: Sorry if I didn't make it clear. Yes, I effectively wish to convert a natural-language name to a USPS code. It is helped that the names are presented in one file, and each state or territory occurs at most once, so if one name somewhat matches one code, no other name can match that code. Here's my use-case:
I have created a template SVG map showing the states of the USA (and Washington DC). In the SVG code, I refer to the states' shape and labels with the USPS codes AK, AL, AR, AZ etc.
My collaborator, Timeshifter, however, receives data in the following format:
Alabama 41.4
Alaska
Arizona 31.4
Arkansas 43.5
The order of the states or their exact spelling cannot be guaranteed. Some states may be missing data: in the above, I have "Alaska" without a number, but that line could be left out entirely.
The JavaScript can have a lookup table with their nominal spellings, but may need to make matches e.g. using Levenshtein distances etc to find the best mapping. In my contrived example, the lookup table might have "CO → Colorado, CT → Connecticut", so when the data has "Conn.", it figures that "CT" is more likely than "CO".
Does it make sense now? It sounds that this is already a common solved problem in computer science. Thanks, cmɢʟeeτaʟκ 11:38, 16 January 2024 (UTC)[reply]
Your lookup table (which should be a map, not a table) is backwards. Let's assume it is called "map" in your code. If you get "Missouri", you set state=map["Missouri"] because map["Missouri"]="MO". It is likely that many people have made this map for their code. You can make it for yours. You have 51 states (including Washington DC), so you type it in. If you get one you haven't seen before, add it. Because of the way this map is made, you can have map["Colorado"] = "CO", map["Col."] = "CO", map["colarodo"] = "CO", etc... 12.116.29.106 (talk) 16:45, 16 January 2024 (UTC)[reply]
I would not use Levenshtein distances; I would just put common spellings into the object. I would restrict the case. If a match is not found, then add the failing string to the map.
/**
 * Table that maps lowercase state names to USPS abbreviations
 * @type {Object.<string, string>}
 */
const mapStateToUSPS = { "al" : "AL", "alabama" : "AL", "ak" : "AK", "alaska" : "AK", "ca" : "CA", "calif" : "CA", "california" : "CA" }; // etc...

/**
 * Convert a state name to a USPS abbreviation
 * @param name {string} name of the state
 * @returns {?string} the USPS abbreviation
 */
function lookupUSPS(name) {
  // normalize the name
  // could also remove punctuation: "N. Carolina" to "N Carolina"
  var key = name.toLowerCase();

  if (mapStateToUSPS.hasOwnProperty(key))
    return mapStateToUSPS[key];
  else
    return null;
}
Glrx (talk) 18:56, 16 January 2024 (UTC)[reply]
For the example given, you cannot use Levenshtein distances. The example is CO and CT with the abbreviation Conn. For that, the distance between CO and Conn is 2. The distance between CT and Conn is 3. CO is less distant than CT. So, it will not claim that CT is a better match. It will claim CO is a better match. If you really want to go down this route, you have to do a Levenshtein check between the abbreviation Conn and all full state names. But, you have a string length weakness. Conn to Colorado has a distance of 6. Conn to Conneticut has a distance of 6. If you divide the distance by the max possible distance, Conn to Colorado is 6/8=0.75 and Conn to Conneticut is 6/10=0.60. Now, 0.60 is less than 0.75. But, you will certainly hit more problems. You are much better off making a very complete map of all abbreviations. 12.116.29.106 (talk) 19:53, 16 January 2024 (UTC)[reply]
Note that "Washington (District of Columbia)" and "W. Virginia" from your set B do not occur as such in our article List of U.S. state and territory abbreviations, so you may need to add more name variations by hand. You can normalize the key by ignoring case and removing spaces and periods, so that "W. Virg." is replaced by "wvirg"; this will reduce the number of entries. While edit distance should not be the primary criterion, some well-crafted version of least edit distance can perhaps be used as a fallback if the key is not found in the lookup table.  --Lambiam 22:08, 16 January 2024 (UTC)[reply]
Another option to finding these abbreviations is to get a list of redirects to each state/territory which have the redirect templates {{R from alternate name}} and {{R from misspellings}}. This isn't going to get them all of course, but it should be a good start. Manually mapping typos isn't too much work. Heck, I have a database of over 1,000 flags which I manually type out descriptions for and it only takes a couple of minutes to map each typo. —Panamitsu (talk) 03:02, 17 January 2024 (UTC)[reply]
Thank you very much, everyone, for your feedback. Though I originally wanted a generic fuzzy-matching function, I've scaled back my ambitions and adapted Glrx's idea by first converting the name to lowercase and removing any non-alphanumeric-or-underscore characters.
If there is an exact match, it returns it, otherwise it finds the one with the lowest Levenshtein distance. This doesn't make use of the fact that each state appears at most once, but is hopefully sufficient for foreseeable data. I'll add a note to add common abbreviations e.g. "conn" or "wvirg" should the need arise.
Cheers,
cmɢʟeeτaʟκ 10:48, 17 January 2024 (UTC)[reply]