• A simpl(ish) way of getting the nth word

    Home » Forums » AskWoody support » Productivity software by function » MS Excel and spreadsheet help » A simpl(ish) way of getting the nth word

    Author
    Topic
    #500934

    The Problem

    I recently needed to fetch a value from a (comma-separated) text list, which is an equivalent problem to fetching a word from a (space-separated) sentence.

    Google turned up some interesting but convoluted results (including the possibility that it couldn’t be done without VBA). Inspired by the solution at SourceDaddy ( http://sourcedaddy.com/ms-excel/get-nth-word.html ), I came up with what I think is a simpler approach.

    The Setup

    There are four inputs:
    A2 contains the list or sentence.
    B2 contains a numeric index indicating which item list or word we want to fetch.
    C2 contains the delimiting character (either a comma or space).
    D2 contains the marker character (we’ll use the caret ^, but you can choose any character or sequence that won’t otherwise appear in the list).

    For clarity, let’s name each of these cells: A2 as List, B2 as Index, C2 as Delimit, D2 as Mark

    The Solution

    Now the formula is:
    =MID(LEFT(List, FIND(Mark, SUBSTITUTE(List&Delimit, Delimit, Mark, Index))-1), FIND(Mark, SUBSTITUTE(Delimit&List, Delimit, Mark, Index)), 99)

    Notice that there are only three references to the original List, and only two references to the Index.
    There are four references to the Mark and six to the Delimit character, but these could be literals.

    The Overview

    Here’s an overview of what’s going on:

    The second FIND(SUBSTITUTE()) gives us the start of the item we want; the first FIND(SUBSTITUTE()) gives us the end of the item we want; LEFT() returns the string up to the end of the item we want; MID() starts from the start of the item we want and runs to the end of the remaining string (up to 99 characters long — this can easily be increased).

    The Full Analysis

    Let’s break it right down.

    Both SUBSTITUTE() functions replace the nth occurrence of the delimiting character in the text (where n=Index) with the Mark character. SUBSTITUTE() is used because it can replace just the nth occurrence (something that FIND() cannot do directly). That’s fine for finding the delimiter *after* the entry we want (since the nth comma will follow the nth item in the list, for example), but we should be looking for the n-1th delimiter to get the delimiter *before* our entry. In fact, both cases can fail if we don’t add the Delimit character to the start or end of the list first.

    It’s easier to think about this if you add delimiters to *both* the start and end of the list: Delimit & List & Delimit. Now every item in the list (or word in the sentence) is surrounded by the delimiting character. For example: “21,22,23,24,25” becomes “,21,22,23,24,25,”. If we replace the nth comma with a caret ^, that caret will appear just before the item we want in the list: n=4 gives us “,21,22,23^24,25,”. Similarly, we can replace the n+1th comma with a double caret ^^: n=4 gives us “,21,22,23,24^^25,”. This would still work if we wanted the first or last word (which is where it would break if we didn’t modify the list, regardless of whether we searched for the n-1th, nth or n+1th delimiter).

    But note that we’re not trying to mark the start *and* end of our word at the same time. We only need to mark one or the other. That means we can use the same Mark character in both lists (no need for a *double* caret for the end-of-word marker). It also means that if we want the delimiter *before* our word, we add an extra Delimit character to the *start* of the list and replace the nth occurrence. But if we want the delimiter *after* our word, we add an extra Delimit character to the *end* of the list and still replace the nth occurrence. This is kind of elegant, to me. And it doesn’t require any special treatment for the first and last entries in the list.

    Using our examples, we get the following two lists (when n=4): “,21,22,23^24,25” and “21,22,23,24^25,”

    We can now use the FIND() function to get the character index of the Mark character in each list. In our examples: 10 and 12.

    The final bit of cleverness (to my mind) is to use these numbers in the right way. If we go back to the original list, we know our word starts at character 10 and ends at character 11 (=12-1). We can certainly use the first number with the MID() function (and we do). But MID() wants a *length*, not an “end character”. Instead, we can use the LEFT() function to first grab the full list just to the end of our entry, chopping off everything to the right (including the delimiter — that’s what the -1 is for). That gives us: “21,22,23,24” (which is 11 characters long). Now we can finish it off using MID(), from character 10 (in our example) to the end of the string (or at least the first 99 characters, if it’s really long).

    The Out-of-Range Errors

    The finishing touch is to put the whole thing inside an IFERROR() function, in case someone asks for an item that’s not in the list (eg: if the list/sentence has 10 items/words, it doesn’t make sense to ask for the 0th word, the -5th item or the 21st word). If you want to trap the two types of error (Index too small, Index too big) separately, put an IFERROR() around the LEFT() function and another around the whole thing — the first catches “too big”; the second, “too small”.

    The Discussion

    The thing I most like about this approach is that we never calculate the same thing twice. The usual approach would be: MID(List, FIND(start index), FIND(end index) – FIND(start index)), which means calculating the start index twice. I hate that. I’d love to be able to do the whole thing with just one reference to List and one reference to Index, but I don’t think this is possible. (Let me know if it is!)

    The other great thing is the use of SUBSTITUTE() to mark the nth occurrence of the delimiter. Ideally, FIND() would be able to find the nth occurrence directly, but it can’t — and SUBSTITUTE() adds just a single extra step. (Credit goes to Source Daddy for this.)

    The Terms

    I’d love to hear people’s thoughts on my approach. If you like it, feel free to use it. You don’t need to credit me, but please don’t claim it as your own.

    If you have an approach that you think is better, I’d also love to hear about it. Especially if you can get it down to just a single reference to List!

    I hope my analysis is also useful for those learning how to build complex functions in Excel.

    (I apologise if this is a known solution to the problem. That would be a case of parallel evolution. Apart from the inspiration from Source Daddy, I’ve developed this on my own.)

    Viewing 2 reply threads
    Author
    Replies
    • #1515190

      You might also be interested in: http://www.mrexcel.com/forum/excel-questions/867602-find-split-numerical-value-string.html#post4211716. In that post, the numbers in the ‘{99,297,495}’ array represent the 1st, 3rd & 5th elements, respectively (i.e. 1*99, 3*99 & 5*99).

      Cheers,
      Paul Edstein
      [Fmr MS MVP - Word]

    • #1527818

      At Paul’s link, we have:

      =SUM(MID(SUBSTITUTE(SUBSTITUTE(A1,”(“,REPT(” “,99)),”)”,REPT(” “,99)),{99,297,495},99)+0)

      As I understand this, the delimiters are replaced with 99 spaces, the MID() command is used three times simultaneously, and SUM() is applied to the three results. Dropping the latter two functions, I think the following will extract just the nth word:

      =TRIM(MID(SUBSTITUTE(List, Delimit, REPT(” “,99)), 99*(Index-1)+1, 99))

      This avoids the use of a specific marker character, but assumes each list element has no leading or trailing spaces. Otherwise, try:

      =SUBSTITUTE(MID(SUBSTITUTE(List, Delimit, REPT(Mark,99)), 99*(Index-1)+1, 99), Mark, “”)

      I’ve checked both of these against my original formula, and they seem to work nicely.

      I believe the other requirement for any of these to work is that the total length of all items in the list is less than 99. (More simply, checking that LEN(List)<=99 should provide a buffer.)

      Conclusion

      Snakehips' approach (that Paul Edstein has linked to) appears to be beautifully adaptable to the problem at hand. The latter two formulae are shorter that my original and contain fewer references to the variables (just one each, in the first case). What's more, avoiding the FIND() function means an enclosing IFERROR() function is not longer needed.

      The only two issues I can see are: 1) If the total length of all elements in the list are longer than 99, the wrong results will be returned for later elements in the list (effectively a buffer overflow); and 2) that lists with many elements will temporarily result in very long strings, which might overflow the data type, or may slow down the calculation (probably only noticeable if the formula is used in thousands of cells). Both issues can be addressed by changing the buffer length from 99 (increasing or decreasing, respectively), as well as checking the length of the list.

      Excellent result! Thank you, Paul (and Snakehips).

    • #1527820

      Interesting. Thanks for sharing!

      Muad

      • #1527826

        Maud

        ..it’s Maud not Muad

        (..and how come I could reply to a SOLVED thread??????)

        • #1528196

          ..it’s Maud not Muad

          Perhaps it was one of those air-kiss noises?

    Viewing 2 reply threads
    Reply To: A simpl(ish) way of getting the nth word

    You can use BBCodes to format your content.
    Your account can't use all available BBCodes, they will be stripped before saving.

    Your information: