Jump to content

Wikipedia:Reference desk/Archives/Computing/2019 August 30

From Wikipedia, the free encyclopedia
Computing desk
< August 29 << Jul | August | Sep >> August 31 >
Welcome to the Wikipedia Computing Reference Desk Archives
The page you are currently viewing is an archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.


August 30

[edit]

How to create a Boolean search in a database

[edit]

Can someone please help me figure out how to create a Boolean search in a database? The "search page" for the database is here: [1]. I believe that you have to hit the tab for "Advanced Search", not the one for "Basic Search". I want to create a "search" where a particular film has won two types of awards. Let's just say that I want a list of all films that won "Best Picture" and also won "Best Actress in a Leading Role". I have created a Boolean search that I think should work (namely, "Best Picture" AND "Best Actress"). But, what I get with that search is all of the Best Picture winners and all of the Best Actress winners, but not the "intersection" or "overlapping" of the two categories. Or, I get an error message that says "Your search yielded no results". Can anyone figure this out? Just so you know, you probably have to set the "Display Results by" -- at the bottom, left corner -- to "Film (alpha)". Thanks. Joseph A. Spadaro (talk) 19:35, 30 August 2019 (UTC)[reply]

I had the same problem, presumably because it's not able to find the intersection of data from the film table ("Best picture") with that from the actor table ("Best actress"), which would require a table join. I do have a workaround, though. On the first line, I picked "Award category" and then I checked both "Best Picture" and "Best Actress in a Leading Role". I didn't use anything on the 2nd or 3rd line, and picked "Film (chron)" for the display order. This gave me a list of one or two films (in each recent year) with the winning categories listed under each. From that list it's easy enough to manually find those where the two awards are under the same film name. One complication, though, is that the award categories have changed over time. They seem to map "Best Picture" to several different terms used over the years. The more problematic issue is that the "Best Actress" category didn't always specify whether it was a leading or suppporting role, and they awarded more than one per year for some reason. They just seem to return all Best Actress's from that era, so you would have to determine for yourself whether they were in a leading role or not. I count 11 that match your criteria:
1934 - It Happened One Night - Claudette Colbert
1936 - The Great Ziegfeld - Luise Rainer
1939 - Gone with the Wind - Vivien Leigh
1942 - Mrs. Miniver - Greer Garson
1975 - One Flew over the Cuckoo's Nest - Louise Fletcher
1977 - Annie Hall - Diane Keaton
1983 - Terms of Endearment - Shirley MacLaine
1989 - Driving Miss Daisy - Jessica Tandy
1991 - The Silence of the Lambs - Jodie Foster
1998 - Shakespeare in Love - Gwyneth Paltrow
2004 - Million Dollar Baby - Hilary Swank

The "Best Actress in a Leading Role" category was added in 1976, so those before that are just for "Best Actress". There weren't any ties in the years that matched. SinisterLefty (talk) 20:52, 30 August 2019 (UTC)[reply]

@SinisterLefty: Thanks. Yes, I was able to do it manually, also. I used the same exact method that you used ... and just manually looked for the years/films that had both awards. But, my question was really about making the Boolean search work. I have other (similar) searches that I want to do. And I want the computer to do the work; I don't want to manually do the work myself. Isn't my type of question the precise reason for performing a Boolean search such as this? That is, to get at the "intersection" (the "AND") of the two sets of data? And, presumably, it's the precise reason for this search function being made available ... and being one that shouldn't need a "work-around" ... no? Thanks. Joseph A. Spadaro (talk) 00:26, 31 August 2019 (UTC)[reply]
You say "should". Since you're using the interface that the Academy provides to the data, you are limited to the query types that it provides. "Shouldn't" that be obvious? You might think it sensible for them to accept the sort of query you want, but they don't. So unless you can convince them to offer the raw data, a workaround is all you can do. --76.69.116.4 (talk) 03:20, 31 August 2019 (UTC)[reply]
You are missing my point. A Boolean search -- of the precise type that the Academy here offers -- precisely should do what I want it to do. That is, perform the Boolean "AND" (intersection) function upon two sets of data. That's the entire point of the Boolean search function. A user should not need a "work-around" to get it to work as intended. Update: I did call the Academy. They acknowledged that this search function is "broken" and needs to be fixed. They will call or email me next week, to update me. Joseph A. Spadaro (talk) 03:33, 31 August 2019 (UTC)[reply]
Yes, I agree that it should handle your request. But I do see why an intersection of the film and actor tables is more complex than an intersection wholey within one table, like a film that won both Best Picture and Best Soundtrack. They may not be willing to put in the time and talent to support the more complex queries. Let's hope they are. SinisterLefty (talk) 03:54, 31 August 2019 (UTC)[reply]
@SinisterLefty: Thanks. Why do you say that a Best Picture / Best Soundtrack combo would be easier (less complex) than a Best Picture / Best Actress combo? They are simply "overlapping" or "intersecting" (i.e., performing a Boolean function on) two sets of data. Why would it matter if the second set of data is Best Soundtrack or Best Actress? Thanks. Joseph A. Spadaro (talk) 04:19, 31 August 2019 (UTC)[reply]
Well, the easy query would look something like:
SELECT TITLE,
       YEAR
  FROM FILM
 WHERE BEST_PICTURE_STATUS      = "W"
   AND BEST_SOUNDTRACK_STATUS   = "W";
While the hard one would likely need to join 3 tables, like so:
SELECT F.TITLE,
       F.YEAR,
       A.FIRST_NAME,
       A.LAST_NAME,
       I.ROLE_FIRST_NAME
       I.ROLE_LAST_NAME
  FROM FILM          F
       ACTOR_IN_FILM I
       ACTOR         A
 WHERE F.BEST_PICTURE_STATUS  = "W"
   AND I.FILM_INDEX_NUM       = F.INDEX_NUM
   AND I.ACTOR_SAG_NUM        = A.SAG_NUM
   AND I.ROLE_TYPE            = "Leading"
   AND I.AWARD_STATUS         = "W"
   AND A.GENDER               = "F";
So, the 2nd is considerably more difficult to generate automatically than the first. SinisterLefty (talk) 11:38, 31 August 2019 (UTC)[reply]
Yes, perhaps that is the issue. Thanks. Well, hopefully, they will fix this. Thanks. Joseph A. Spadaro (talk) 19:18, 31 August 2019 (UTC)[reply]

Thanks, all. Joseph A. Spadaro (talk) 19:19, 31 August 2019 (UTC)[reply]

Resolved

How would you fix this code?

[edit]

Hopefully this question is OK to ask here. Basically, I have to write a function that takes a list as its input and that eliminates all of the string items from this list. This is what I have so far:

def filter_list(l):

i = 0

while i < len(l):

if type(l[i]) is str:

del l[i]

i = i + 1

print(l)

What exactly am I doing wrong here? Futurist110 (talk) 22:28, 30 August 2019 (UTC)[reply]

First of all, if this is Python, you need some whitespace on the left. Note what Wikimedia does with lines that begin with space. —Tamfang (talk) 21:21, 1 September 2019 (UTC)[reply]
Well the modern fashion is for functional programming. In which you wouldn't delete anything, but you'd return (from the function) a list which only contained the good stuff - i.e. a copy of the partial list, not trying to modify the existing list.
If you try to modify an array like this (this is an array not a list) then you're using an index into that array (arrays are indexable lists). But when you delete a member, it shuffles them all along. Much better is to use a list and to use an iterator, rather than using an array index as a control variable. This iterator guarantees you a sequence of all the members, one by one, even if the underlying list changes. Andy Dingley (talk) 22:36, 30 August 2019 (UTC)[reply]
Agreed. But sticking with what's written, the obvious fix is to only iterate when we don't delete a string, or to go through the list in reverse order, so this won't be an issue. The loop structure isn't obvious either, so I can't tell it that's correct. I suggest using tabs to show nested loops.SinisterLefty (talk) 22:41, 30 August 2019 (UTC)[reply]
Yes, I don't know what language that is, but
if type(l[i]) is str:
del l[i]
i = i + 1

have to be in the scope of the While. Bubba73 You talkin' to me? 01:04, 31 August 2019 (UTC)[reply]

First, you can write formatted code with the <source> tag, like this:

def filter_list(l):
    i = 0
    while i < len(l):
        if type(l[i]) is str:
            del l[i]
        i = i + 1
    print(l)

Second, notice what happens to the list when you delete an element. Try working through, by hand, what happens if you run the function on the 2-element list ["a","b"]. What are the values of i and l the first time through the loop? What about the top of the loop, the second time? It is important practice to work through some small examples like this by hand (in your head, or with pen and paper), but for bigger ones you should also learn to use a debugger. The one Python comes with is called pdb (python debugger) and it is primitive, but better than nothing. There are some fancier ones out there too. d67.164.113.165 (talk) 10:17, 31 August 2019 (UTC)[reply]

" What about the top of the loop, the second time?" - This is important if the language moves the rest of the list up when you delete an item. Bubba73 You talkin' to me? 16:10, 31 August 2019 (UTC)[reply]
  • BTW - once you've got a few versions of this working, then try giving them each the same long list and then timing their performance. Any loop like this which has an array delete in it is likely to run very slowly. Maybe as badly as time. Andy Dingley (talk) 16:19, 31 August 2019 (UTC)[reply]
That's true. A fast way would be to go through the list and copy the ones you don't want want to keep to a new list. Bubba73 You talkin' to me? 17:10, 31 August 2019 (UTC)[reply]
You can also delete them in situ and move them down. But what you shouldn't do is to move them down repeatedly. Andy Dingley (talk) 19:03, 31 August 2019 (UTC)[reply]
There is not really any way to "delete them in situ", other than perhaps using a separate list to remember which ones to delete. But, I think that's a separate issue from understanding why the original code doesn't work. SinisterLefty, do you need more explanation about this? It is fine to try these different approaches in order to learn their strengths and drawbacks. If you can fix the code and post a working version, we can talk about its effiency afterwards, and also discuss a bit about "philosophy" (i.e. what other approaches you might use). 67.164.113.165 (talk) 20:33, 31 August 2019 (UTC)[reply]
"There is not really any way to "delete them in situ", other than perhaps using a separate list to remember which ones to delete."
Course there is. One index variable, and two state variables to record the start of the non-hole piece to be kept, and the new index for the last of the moved items. When you hit a hole, just keep incrementing the index and ignoring what's there. When you hit a non-hole, record the start index. When you next hit a hole, shift downwards the previous block (between the stored index and the current index). But only shift each old value once. Andy Dingley (talk) 21:36, 31 August 2019 (UTC)[reply]
I wasn't looking at efficiency, just trying to get it to work. And we don't always need to care about efficiency, for example if we know there will never be more than 10 items in the list, and we only need to sift out the character strings once, say if the list was generated by a form the user filled out and should only contain numeric data. If we will have thousands or millions of items to sift through, then yes, we need to be careful about CPU time, then. SinisterLefty (talk) 02:40, 1 September 2019 (UTC)[reply]
Do you understand yet what was wrong? That "i" was being incremented without taking into account that the list shrinks after the del operation. So the 3 issues are: a) being able to debug a problem like that, b) efficiency (less important than getting the code working), and c) a vague stylistic issue I'll describe next. c) is that modifying any value in place (in this case the list) tends to be error-prone, so there's a programming style ("functional programming") that among other things tries to avoid such modifications. The FP approach would be to make a new list that is a copy of the old list, minus the elements you want to remove:
def filter_list(l):
  return [x for x in l if type(x) is not str]
That is conceptually similar to a SQL-ish "select x from l where type(x) != str;". It incurs both speed and memory overhead (since it copies the list) and that triggers some old-school programmers, but IME it leads to getting code working sooner, and you can identify and optimize the performance hot spots afterwards. 67.164.113.165 (talk) 17:47, 1 September 2019 (UTC)[reply]
I think you've confused me with the OP, who is Futurist. I certainly understand point a, as I offered two solutions for it. (BTW, you have two point c's.) SinisterLefty (talk) 21:33, 1 September 2019 (UTC)[reply]
Sorry, yes, post was intended for OP. I only have one point c, the stylistic point about functional programming. I see the confusion so I should have written the sentence a bit differently. 67.164.113.165 (talk) 21:37, 1 September 2019 (UTC)[reply]
IMHO, often better without the list comprehension. Return the generator function instead (you can always comprehend the list outside the call). If it's a big list, this saves you the memory consumption. Andy Dingley (talk) 22:19, 1 September 2019 (UTC)[reply]
Yes that's another possibility, though it gets a little bit fancy since a Python generator is another bug-inducing mutating object. Most of the time (assuming you're going to access all the elements anyway), creating a temporary copy of even a large (not enormous) list is not too bad, since you're likely to release the old one immediately after creating the new one. I have Python 2 code break in Python 3 all the time because of that. Python generators are ok if you know what you are doing and are careful with them, but it's easy to make mistakes. 67.164.113.165 (talk) 22:50, 1 September 2019 (UTC)[reply]
Python Lists/Arrays are weird data structures, and I don't know how exactly they are implemented. However, I suspect that making a copy of the required items is less costly than repeatedly modifying the existing list. For classical arrays, removing an element while maintaining order is, on average, O(n), so doing this for up to n elements will give you O(n^2). For lists, removing an element can be O(1), but accessing it via the index is O(n), so this again leads to O(n^2). Making copies as in the list comprehension results in worst case O(n) performance. --Stephan Schulz (talk) 20:30, 3 September 2019 (UTC)[reply]
And having looked at the FAQ and the code, Python (CPython 2/3) "lists" are indeed variable length arrays, which makes deletion of elements (except at the end) costly. So unless the list is very short (in which case it does not matter), the list comprehension method will be faster than the deletion method. Also, the deletion method would be vaguely faster (but not in a Big-O-sense) if the iterator works back to front (because already deleted items in the back don't need to be moved). Of course, in reality this might break the cache prefetching. --Stephan Schulz (talk) 08:56, 4 September 2019 (UTC)[reply]