• MSHTML, VBA and HttpResponse code (IE 6)

    Author
    Topic
    #399370

    I’m scraping a website using VBA (from Access) and MSHTML (via IE 6), similar to covered on thread starting with post post 292415.
    My question is: how do I tell what response code I received? How do I tell whether I got a 200 (OK) or 404 (Not found)? The actual page returned for those codes varies depending on the web server, but I should be able to get the HttpResponse code. My basic code is:

    Dim objMSHTML As New MSHTML.HTMLDocument
    Dim objDocument As MSHTML.HTMLDocument
    
    'This function is only available with Internet Explorer 5 and later
    Set objDocument = objMSHTML.createDocumentFromUrl(sURL, vbNullString)
      
    'Tricky, to make the function wait for the document to complete, usually the
    'transfer is asynchronous. Note that this string might be different if you have
    'another language than English for Internet Explorer on the machine where the code is
    'executed.
    While objDocument.readyState  "complete"
    	DoEvents
    Wend
      
    'OK, now we've got the page
      
    If objDocument.Title = "404 Not Found" Then
    	'This is not a robust solution
    	'Need to get "objDocument.HttpResponseCode" or similar
    	'...
    
    Viewing 1 reply thread
    Author
    Replies
    • #770954

      Interesting and tough problem. It appears that you need to do quite a bit of low-level API work to get this information, using several functions before you can request the status line with HttpQueryInfo. Some resources for you:

      Win32 Internet HTTP Functions in Visual Basic MSDN, Sept. 1996
      FIX: Internet Transfer Control 5.0 Has Bug with “HEAD” Request MSKB #171271
      HTTP Status Codes (Platform SDK: Windows Internet) MSDN

      I look forward to seeing the solution. grin

      • #771032

        Oh that I was using Java or Perl!
        I’ll dive into the problem with these leads – thank you very much.
        I’ll post the solution when I get it, but it might not be today!
        Peter

        • #771985

          The code on the first of your links works, although the MS site is missing the sample application.
          The second link applies to the control, that the code doesn’t actually need.
          Unfortunately, the [InternetReadFile] function retrieves the page as text, not as a MSHTML object. So I’d need to either rewrite my system to parse the document itself (too hard), or create a MSHTML object from the text (too likely to bomb with bad HTML , and I don’t know whether this is possible ) or make multiple requests (too much traffic).
          So my problem is now refined to how do I get the Http Status Code for a MSHTML request.
          I’ve redefined part of my code so that the effect of a 404 is minimised, so I don’t need the answer to this problem now.

          [philosophy]
          I’m still interested from a curiosity point of view. Why would MS hide (so effectively) this standard piece of information? On the web, headers are almost as important as content (from a program’s point of view).
          [/philosophy]

          Thanks for your help.

          • #772013

            I realize it’s terribly cumbersome, but I thought you could use the API calls for the sole purpose of obtaining the header information, but continue to use the rest of your code “as is.” As for why it isn’t part of the MSHTML document object model, good question!!

          • #772014

            I realize it’s terribly cumbersome, but I thought you could use the API calls for the sole purpose of obtaining the header information, but continue to use the rest of your code “as is.” As for why it isn’t part of the MSHTML document object model, good question!!

        • #771986

          The code on the first of your links works, although the MS site is missing the sample application.
          The second link applies to the control, that the code doesn’t actually need.
          Unfortunately, the [InternetReadFile] function retrieves the page as text, not as a MSHTML object. So I’d need to either rewrite my system to parse the document itself (too hard), or create a MSHTML object from the text (too likely to bomb with bad HTML , and I don’t know whether this is possible ) or make multiple requests (too much traffic).
          So my problem is now refined to how do I get the Http Status Code for a MSHTML request.
          I’ve redefined part of my code so that the effect of a 404 is minimised, so I don’t need the answer to this problem now.

          [philosophy]
          I’m still interested from a curiosity point of view. Why would MS hide (so effectively) this standard piece of information? On the web, headers are almost as important as content (from a program’s point of view).
          [/philosophy]

          Thanks for your help.

      • #771033

        Oh that I was using Java or Perl!
        I’ll dive into the problem with these leads – thank you very much.
        I’ll post the solution when I get it, but it might not be today!
        Peter

    • #770955

      Interesting and tough problem. It appears that you need to do quite a bit of low-level API work to get this information, using several functions before you can request the status line with HttpQueryInfo. Some resources for you:

      Win32 Internet HTTP Functions in Visual Basic MSDN, Sept. 1996
      FIX: Internet Transfer Control 5.0 Has Bug with “HEAD” Request MSKB #171271
      HTTP Status Codes (Platform SDK: Windows Internet) MSDN

      I look forward to seeing the solution. grin

    Viewing 1 reply thread
    Reply To: MSHTML, VBA and HttpResponse code (IE 6)

    You can use BBCodes to format your content.
    Your account can't use all available BBCodes, they will be stripped before saving.

    Your information: