• Query for duplicate names where at least one record’s YesNo field is False

    Home » Forums » AskWoody support » Productivity software by function » MS Access and database help » Query for duplicate names where at least one record’s YesNo field is False

    Author
    Topic
    #484002

    I have a Find Duplicates query that identifies people with duplicate names in a table. When the initial set of data are imported to the table, this enables the user to check whether the same person appears twice; however, in some cases the duplicate names do in fact relate to different people, and these legitimate duplicates are flagged as True/Yes in a YesNo field.

    The table has new data appended from another source every month, and I want to identify whether any of the imported records equate directly to existing records in the database. The Find Duplicates query will pick out all duplicates, but as the list grows I would prefer not to show pairs of names from the original dataset that have already been flagged.

    However, a new record from the appended data might relate to an already-flagged original. I would therfore like to modify the Find Duplicates SQL to add the condition that at least one of the duplicate records should have the YesNo field set to False/No (if all records in a duplicate grouping are set to True/Yes, it would mean the records have already been checked and don’t need to be checked again).

    Of course, some duplications will be entirely new – ie both will have their YesNo fields set to False/No.

    This is an area of SQL I haven’t previously explored – any suggestions?

    (I know I can create a function to do this, but feel that an SQL solution would work faster.)

    Thanks and regards

    Alison C

    Viewing 4 reply threads
    Author
    Replies
    • #1337967

      Don’t you have access to the original SQL query? It would seem to me that it would be rather easy to use that query as a starting point and add the required changes to it.

      • #1338247

        Yes, here it is – sorry if it would have helped to include it in the original post. The problem for me is knowing what I need to add to the SQL!

        SELECT
        itblLGAClaimantsUnique.strLast,
        itblLGAClaimantsUnique.strFirst,
        itblLGAClaimantsUnique.dtmDoB,
        itblLGAClaimantsUnique.strLGAID,
        itblLGAClaimantsUnique.ysnLastFirstDupOK
        FROM
        itblLGAClaimantsUnique
        WHERE
        (((itblLGAClaimantsUnique.strLast)
        In (SELECT [strLast] FROM [itblLGAClaimantsUnique]
        As Tmp GROUP BY [strLast],[strFirst]
        HAVING Count(*)>1 And [strFirst] = [itblLGAClaimantsUnique].[strFirst])))
        ORDER BY
        itblLGAClaimantsUnique.strLast;

        I thought I could use a Domain Aggregate function to count how many False records there are with any particular last/first combination, although I know from experience that they can really slow down a query. Having said that, I suppose that’s what the Find Duplicates query is using already, so perhaps it doesn’t matter. I tried adding the following field:

        DCount(“[strLGAID]”,”itblLGAClaimantsUnique”,”[strLast] = ‘” & [strLast] & “‘ AND [strFirst] = ‘” & [strFirst] & “‘ AND [ysnLastFirstDupOK]=False”)

        This changes the WHERE clause to:

        WHERE
        (((itblLGAClaimantsUnique.strLast)
        In (SELECT [strLast] FROM [itblLGAClaimantsUnique]
        As Tmp GROUP BY [strLast],[strFirst]
        HAVING Count(*)>1 And [strFirst] = [itblLGAClaimantsUnique].[strFirst]))
        AND ((DCount(“[strLGAID]”,”itblLGAClaimantsUnique”,
        “[strLast] = ‘” & [strLast] & “‘
        AND [strFirst] = ‘” & [strFirst] & “‘
        AND [ysnLastFirstDupOK]=False”))>0))

        It correctly returns 0, 1 or whatever the number of False fields adds up to. But when I add >0 into the Criteria section, it spits the dummy with two successive error messages:

        Syntax error (missing operator) in query expression ‘[strLast] = ‘Nurk’ AND [strFirst] = ‘Fred’ AND [ysnLastFirstDupOK]=False”

        and

        Unknown

        Now, Fred Nurk is not one of the people who is duplicated, so I’m assuming the problem is because the record is being eliminated from the dynaset by the first part of the WHERE clause, and is therefore not available by the time it’s working on the DCOUNT bit. I thought I could fix it by leaving the first query as is, and creating a second query, based on the Find Duplicates query, with the new DCOUNT field and criteria of >0, but I get the same error messages.

        I could probably make it work (assuming the problem isn’t with something in my syntax) if I made a table from the first dynaset and then applied a query to the table; but I’m trying to change the data dynamically, and it seems to be an awful lot of steps to then use the resulting table’s changes to update the original data.

        So if all else fails, I can use a VBA function to do the same thing; I just thought it would be neater in the original SQL.

        Here is the guts of the VBA code for the function “Checked”, which works fine with the original Find Duplicates query:

        Dim dbs As Database
        Dim rst As Recordset
        Dim strSQL As String

        Set dbs = CurrentDb()
        strSQL = “SELECT itblLGAClaimantsUnique.strLast, ” & _
        “itblLGAClaimantsUnique.strFirst, ” & _
        “itblLGAClaimantsUnique.ysnLastFirstDupOK ” & _
        “FROM itblLGAClaimantsUnique ” & _
        “WHERE (((itblLGAClaimantsUnique.strLast)=””” & strLast & “””) ” & _
        “AND ((itblLGAClaimantsUnique.strFirst)=””” & strFirst & “””) ” & _
        “AND ((itblLGAClaimantsUnique.ysnLastFirstDupOK)=False));”
        Set rst = dbs.OpenRecordset(strSQL, dbOpenSnapshot)
        Checked = rst.RecordCount

        (I’m a lot more comfortable with VBA than SQL; my knowledge of the latter comes from pulling apart the code after I’ve used the query window to create it!)

        Finally, so you don’t think I’m a complete nong, [strLGAID] is not a substitute for an autonumber PKF; this is an interim table between importing some data, filtering for unique entries, and appending to the main table where the autonumber PKF is allocated!

        Thanks for your interest. 🙂

        Cheers

        Alison C

    • #1338249

      Maybe I didn’t exactly get what you wanted, but if I got it right, you can do it like this:

      SELECT * FROM
      (

      SELECT
      itblLGAClaimantsUnique.strLast,
      itblLGAClaimantsUnique.strFirst,
      itblLGAClaimantsUnique.dtmDoB,
      itblLGAClaimantsUnique.strLGAID,
      itblLGAClaimantsUnique.ysnLastFirstDupOK
      FROM
      itblLGAClaimantsUnique
      WHERE
      (((itblLGAClaimantsUnique.strLast)
      In (SELECT [strLast] FROM [itblLGAClaimantsUnique]
      As Tmp GROUP BY [strLast],[strFirst]
      HAVING Count(*)>1 And [strFirst] = [itblLGAClaimantsUnique].[strFirst])))
      ORDER BY
      itblLGAClaimantsUnique.strLast
      ) As T1

      WHERE T1.

      ysnLastFirstDupOK=False

      P.S.: For some reason, the editor is not helping with line breaking, so please make sure those last two lines are actually just one.

    • #1338250

      Thanks – that doesn’t give any errors, but only calls up records that are unchecked, which isn’t quite what I’m after.

      What I’m trying to do is identify if there are any unchecked records and, if so, pull out all records with matching names (including those that have previously been marked as checked). This is so that, each month, if a new record is imported with the same name as an existing one, the old and new can be displayed to identify if they are the same person or not; but if there are no new records, I don’t want to bother showing old records that have already been identified as two separate people.

      Sorry, I know it’s a bit convoluted and therefore hard to explain clearly; so I’m not surprised if you’ve had trouble working out what I’m after!

      I’ll take a closer look at your code and see if I can adapt it. Meanwhile, the VBA function is working, albeit slowly, so I do have a fall-back position.

      Cheers

      Alison C

      • #1338466

        Thanks – that doesn’t give any errors, but only calls up records that are unchecked, which isn’t quite what I’m after.

        What I’m trying to do is identify if there are any unchecked records and, if so, pull out all records with matching names (including those that have previously been marked as checked). This is so that, each month, if a new record is imported with the same name as an existing one, the old and new can be displayed to identify if they are the same person or not; but if there are no new records, I don’t want to bother showing old records that have already been identified as two separate people.

        Sorry, I know it’s a bit convoluted and therefore hard to explain clearly; so I’m not surprised if you’ve had trouble working out what I’m after!

        I’ll take a closer look at your code and see if I can adapt it. Meanwhile, the VBA function is working, albeit slowly, so I do have a fall-back position.

        Cheers

        Alison C

        Hi Alison,
        Is it possible for you to post a sample database, with some dummy names, but with enough combinations of your Yes/No field checked and unchecked? Also, post the results you expect the query to deliver. It may very well be that you need to do this using VBA code.

        Regarding the slowness of your procedure, how many records is your query + code working against? Are fields used for query criteria properly indexed? Is this a split application (FE/BE) with a network wire that separates you from the data? If “yes”, have you established a persistant connection to the BE database? Here is an article I wrote that provides many suggestions for getting better performance:

        Implementing a Successful Multiuser Access/JET Application
        http://www.accessmvp.com/TWickerath/articles/multiuser.htm

        You can likely speed up your VBA procedure by declaring the database and recordset variables as global variables, and set the database variable one time, instead of setting it for each iteration. Also, I highly recommend that you “disambiguate” your recordset declaration:

        Dim rst As Recordset —-> Dim rst As DAO.Recordset

        This will help prevent the possiblity of a situation where your code compiles fine, yet you experience run-time error 13: Type Mismatch. More information here:

        ADO and DAO Library References in Access Databases
        http://www.accessmvp.com/TWickerath/articles/adodao.htm

        • #1338692

          Hi Tom –

          I was beginning to think it might be a good idea to create some dummy data, but Ruirib’s response seems to have done what I need.

          However, I really appreciate your comments about speeding up databases – it’s one of my biggest problems, as I never seem to get the simple ones to deal with!

          At this stage it’s all in the same file, as I usually do the main development that way so I can fine tune table designs on the fly; I’ll separate the FE and BE when I’m happy that I have the data structure right.

          I have used persistent connections on other databases, and anticipated doing so on this one, but will be very interested to look at your article, as I might not be doing it as well as I could. We’re currently looking at a database that will grow every month. It starts at about 150 records, the first update will take it to about 2,500 records (first merge of two currently completely separate historical datasets) and will then settle down to increasing at a rate of about 10 per month (but with a lot of updated data in the existing records).

          I didn’t realise setting the database as a global variable would improve speed – I just assumed it would be better to clean it out of memory when I finished running the subroutine. Your recommendation makes a lot of sense in light of the number of iterations, though.

          Disambiguation looks as though it might help the problems I sometimes have when clients install a copy on a new PC that doesn’t have the correct references set up.

          As for indexing – I live in constant fear of being found to be a complete fraud if anyone else looks at my indexing. I try to keep it tight, but as I also try to offer a lot of flexibility in the user interface, I find myself saying “They might want to sort on this field” or “They might want to select on that one” or even “They might want to sort/select on Last Name only on some occasions, but on others it will be based on Last/First” and ending up falling down the other side of the optimisation bell curve with far too many indexed fields.

          A quick look at your links tells me two things immediately: I’m not doing everything wrong; and I’m in for some good reading when I get away from work tonight…

          Many thanks for your helpfull suggestions.

          Cheers

          Alison

          • #1338698

            I have used persistent connections on other databases, and anticipated doing so on this one, but will be very interested to look at your article, as I might not be doing it as well as I could.

            I usually just open a form in hidden mode that is bound to a table with one record in the BE database. The user never knows this form is open. You will need to close it first, in order to compact.

            I didn’t realise setting the database as a global variable would improve speed – I just assumed it would be better to clean it out of memory when I finished running the subroutine. Your recommendation makes a lot of sense in light of the number of iterations, though.

            The statement Set db = CurrentDB() is fairly costly, as far as time goes. If you do this only once, when a procedure runs, no big deal. But, if this line of code is being executed for each record in a query that calls your function, the penalty can become quite noticeable, especially if you are dealing with hundreds or thousands of records.

            Disambiguation looks as though it might help the problems I sometimes have when clients install a copy on a new PC that doesn’t have the correct references set up.

            Disambiguation only helps with reference priority issues. It does not help at all if a required reference is not selected.

            As for indexing – I live in constant fear of being found to be a complete fraud if anyone else looks at my indexing. I try to keep it tight, but as I also try to offer a lot of flexibility in the user interface, I find myself saying “They might want to sort on this field” or “They might want to select on that one” or even “They might want to sort/select on Last Name only on some occasions, but on others it will be based on Last/First” and ending up falling down the other side of the optimisation bell curve with far too many indexed fields.

            Isn’t this a question that you can ask your users? If you provide them with some type of Query-by-Form (QBF), then you can easily restrict which fields they can choose as criteria or for sorting. It sounds like (?) you may be providing your users with direct access to the database window? That’s something I don’t do.

            A quick look at your links tells me two things immediately: I’m not doing everything wrong; and I’m in for some good reading when I get away from work tonight…

            Many thanks for your helpfull suggestions.

            Great! And you’re very welcome.

            • #1338705

              Clearly, the next thing I need to tackle is how to split a message into multiple quotes – thought it might have something to do with “Multi-quote this message” but if it does, I’m too tired at the end of a long day to work out how…

              Meanwhile, though, here are my responses to each of your points:

              – Persistent connections: That’s how I’ve been doing them, right down to closing for the purpose of compacting – good to know I got that right!
              – Global variable for Db: Makes sense; I have then used AutoExec to run a short function using Set Db when the database is opened – is that the best way?
              – Disambiguation: That should fix about a third of the problems I encounter with references on other people’s computers, at least.
              – Indexing: Don’t worry, I don’t give users access to the database window – don’t want fiddling around by the people who know just enough to be dangerous! No, it’s more that the dialog boxes I provide for initiating reports or data searching often have a number of options – mainly because I did ask the users, and they kept thinking of more options they wanted. And I do admit to empathising with the users a bit too much sometimes – other developers I know tend just to say “Can’t do that” to keep it uncomplicated, whereas I look at all the ways they can get something from the database then try to give them the tools. There’s a happy medium there somewhere, I just have to find it…

              Cheers

              Alison

            • #1338730

              Hi Alison,

              The multiple quotes thing is simple…just use the HTML quote and end quote tags:

              [*Quote*] and [*/Quote*]

              Don’t include the 4 asterisks I added, before and after the square brackets. I did this so that I could show how to use this tag, without it actually taking effect.

              Using an Autoexec macro to run a function to establish a global DB variable is certainly a good method. The best method, of course, is entirely dependent on your situation. For example, if you only need the global DB variable for code that only one person in 20 users is ever going to run, then you might want to delay creating such a global variable. I do recommend using a naming convention, such as gdb (or gDB), where the “g” indicates global. Likewise, use a lowercase “m” for module-level declarations, in code behind forms and reports, where once the form or report closes, the variable goes out of scope.

              [Quote]
              “… – other developers I know tend just to say “Can’t do that” to keep it uncomplicated…”
              [/Quote]

              It all depends on the situation. When users are having to pay for each request, they tend to self-regulate, and turn down the requests. On the other hand, when users do not have to pay the costs, it is not unusual to hear how a feature is a “must-have”, and, after adding it, you find out that very few people actually use it. But, I will say that that kind of attitude, in general, will get a developer in trouble. Someone else, like yourself, will be able to come in and ‘eat their lunch’.

    • #1338275

      I hadn’t really realized that was what you wanted. I will propose a different query in a few hours.

    • #1338435

      See if this does it:

      SELECT [/FONT][/LEFT][/COLOR]
      itblLGAClaimantsUnique.strLast, [/FONT][/LEFT][/COLOR]
      itblLGAClaimantsUnique.strFirst, [/FONT][/LEFT][/COLOR]
      itblLGAClaimantsUnique.dtmDoB, [/FONT][/LEFT][/COLOR]
      itblLGAClaimantsUnique.strLGAID, [/FONT][/LEFT][/COLOR]
      itblLGAClaimantsUnique.ysnLastFirstDupOK[/FONT][/LEFT][/COLOR]
      FROM [/FONT][/LEFT][/COLOR]
      itblLGAClaimantsUnique[/FONT][/LEFT][/COLOR]
      WHERE [/FONT][/LEFT][/COLOR]
      (((itblLGAClaimantsUnique.strLast) [/FONT][/LEFT][/COLOR]
      In (SELECT [strLast] FROM [itblLGAClaimantsUnique] [/FONT][/LEFT][/COLOR]
      As Tmp
      WHERE [strLast] IN (SELECT [strLast] FROM [itblLGAClaimantsUnique] itb WHERE ysnLastFirstDupOK=False )
      GROUP BY [strLast],[strFirst]
      [/FONT]
      [/LEFT][/COLOR]
      HAVING Count(*)>1 And [strFirst] = [itblLGAClaimantsUnique].[strFirst])))[/FONT][/LEFT][/COLOR]
      ORDER BY [/FONT][/LEFT][/COLOR]
      itblLGAClaimantsUnique.strLast[/FONT][/LEFT][/COLOR]

      • #1338690

        Hi Ruirib –

        That last solution seems to give me what I want, and faster than the VBA function I had written.

        I need to apply this in a couple of other contexts (checking Last/DoB and First/DoB as well as the same 3 options on another table) so would like to get my head around the key bit of the code – ie the WHERE section – that you provided.

        WHERE
        (((itblLGAClaimantsUnique.strLast)
        [/FONT]

        In (SELECT [strLast] FROM [itblLGAClaimantsUnique] [/FONT][/LEFT][/COLOR]As Tmp
        WHERE [strLast] IN (SELECT [strLast] FROM [itblLGAClaimantsUnique] itb WHERE ysnLastFirstDupOK=False )
        GROUP BY [strLast],[strFirst]
        [/FONT]
        [/LEFT][/COLOR]HAVING Count(*)>1 And [strFirst] = [itblLGAClaimantsUnique].[strFirst])))[/FONT][/LEFT][/COLOR]
        I think I get most of it – it’s just a little bit in the second WHERE line that I’m not sure of: what does “itb” do?

        Many thanks for the time you have spent on this – I really appreciate your persistence!

        Cheers

        Alison

        • #1338694

          Aaarrrggghhh! Deficient testing on my part – I tested with neither records’ fields checked and one record’s field checked, but for some reason completely omitted to test with both records’ fields checked – and unfortunately have now found that the query returns both checked records (haven’t even started to play with three matching records). As the reason I’m trying to do this is to eliminate the display of matching records where they have all been OK’d already, I’m back to the drawing board!

          I thought I’d made sense of the SQL, so can’t really figure out what it doesn’t like, but will fiddle around and see where it takes me.

          Thanks again

          Cheers

          Alison

          • #1338695

            OK! I’ve got it – at least, I know why it’s throwing up the apparent error. There is somebody else with the same surname, but different first name; because there is no duplicate First/Last combination for this person, the check box is false – I wasn’t bothering with the check boxes of people who didn’t have both the first name and last name matching. So yes, the “WHERE [strLast] IN (SELECT [strLast] FROM [itblLGAClaimantsUnique] itb WHERE ysnLastFirstDupOK=False )” is working the way we’ve asked it to.

            I suppose it would work if I had a compound field that concatenated the First and Last names, but will now go and do some more testing to see if I can avoid having to go back and create this field in other objects, or if I can just use it dynamically.

            Cheers

            Alison

            • #1338700

              Hi Alison,

              Here’s a thought….can you append all names to the table and then use the Soundex function, with the appropriate constant, to help locate possible duplicates? You can display the possible duplicates in a form which includes a checkbox where you, or another human being [but not a computer] decides on actual duplicates. You may have to experiment with the constant to find the best value for the majority of names in your database. If memory serves me correctly, a value of 4 works pretty good for most names.

              More information on Soundex, and an alternate method (Levenshtein Distance) here:

              April 2005: “Close” only counts in horseshoes…and databases (467 KB)
              A couple of techniques to help determine when entries are “close enough” to be considered the same

              available on this page, by Access MVP Doug Steele:
              http://www.accessmvp.com/djsteele/SmartAccess.html

            • #1338706

              Not familiar with Soundex, but it looks worth a try. Last one in the office so heading home before the lights go out, but will check it out tomorrow and see if it provides a better way of doing things.

              Thanks again.

              Alison

            • #1338709

              Just a quick comment on speed, regarding the use of SQL vs. VBA. I have yet to find a situation where VBA beats a SQL based alternative. Of course, in Access, your SQL options are limited to a single statement, even if it can be complex, but if what you want done can be done in SQL, it will give you the fastest option, for sure.

              If you need further help with this specific issue, I am happy to keep helping, so just let me know.

            • #1338731

              Just a quick comment on speed, regarding the use of SQL vs. VBA. I have yet to find a situation where VBA beats a SQL based alternative. Of course, in Access, your SQL options are limited to a single statement, even if it can be complex, but if what you want done can be done in SQL, it will give you the fastest option, for sure.

              I agree. Running set operations with SQL is always faster versus running equivalent code in VBA.
              To get around the single statement limitation, I often times create a procedure in VBA code that simply calls one query after another. This is a technique I use at work, where I’m allowed read-only access to a huge Oracle database. I have a procedure that starts by running delete queries to clear local tables out, then a series of append queries to add data back in, followed by various update queries to massage the data as required. And, in the case of this data, I also have to run a non-SQL based procedure to replace non-printing characters that come with the data.

        • #1338708

          Hi Ruirib –

          That last solution seems to give me what I want, and faster than the VBA function I had written.

          I need to apply this in a couple of other contexts (checking Last/DoB and First/DoB as well as the same 3 options on another table) so would like to get my head around the key bit of the code – ie the WHERE section – that you provided.

          WHERE
          (((itblLGAClaimantsUnique.strLast)
          [/FONT]

          In (SELECT [strLast] FROM [itblLGAClaimantsUnique] [/FONT][/LEFT][/COLOR]As Tmp
          WHERE [strLast] IN (SELECT [strLast] FROM [itblLGAClaimantsUnique] itb WHERE ysnLastFirstDupOK=False )
          GROUP BY [strLast],[strFirst]
          [/FONT]
          [/LEFT][/COLOR]HAVING Count(*)>1 And [strFirst] = [itblLGAClaimantsUnique].[strFirst])))[/FONT][/LEFT][/COLOR]
          I think I get most of it – it’s just a little bit in the second WHERE line that I’m not sure of: what does “itb” do?

          Many thanks for the time you have spent on this – I really appreciate your persistence!

          Cheers

          Alison

          Hi Alison,

          The itb bit is an alias. Basically it assigns a shorter, easier to write name to a table, in this case itblLGAClaimantsUniquou

          [/B]

          , that you can then use when referring to that specific table. It wasn’t really needed, in this case. I use alias systematically when writing SQL, as it makes it easier to read and write. You could do without it, in this case.

          [/LEFT]

    Viewing 4 reply threads
    Reply To: Query for duplicate names where at least one record’s YesNo field is False

    You can use BBCodes to format your content.
    Your account can't use all available BBCodes, they will be stripped before saving.

    Your information: