Make encoding uniform before comparing strings in PHP

I'm working on a feature which requires me to get the contents of a webpage, then check to see if certain text is present in that page. It's a backlink checking tool.

The problem is this - the function runs perfectly most of the time, but occasionally, it flags a page for not having a link when the link is clearly there. I've tracked it down to the point of visually comparing the strings in the output, and they match just fine, but using the == operator, php tells me they don't match.

Recognizing that this is probably some sort of encoding issue, I decided to see what would happen if I used base64_encode() on them, so I could see if doing so produced different results between the two strings (which appear to be exactly the same).

My suspicions were confirmed - using base64_encode on the strings to be compared yielded a different string from each. Problem found! The problem is, I don't have any idea how to solve it.

Is there some way I can make these strings uniform based on the outputted text (which matches), so that when I compare them in php, they match?

13.10.2009 20:22:55
can you reliably check the character encoding for the websites you're comparing?
dnagirl 13.10.2009 20:28:19
@Peter: thanks, +1 for manning up and admitting it. :)
Kip 14.10.2009 01:24:44
6 ОТВЕТОВ
РЕШЕНИЕ

I'm not entirely sold on your belief that it is the encoding. PHP is going to internally store all its strings in the same format. Could you try this code? It will compare the ascii value of each character in both strings, which might reveal something you're not seeing by visually comparing the strings.

$str1 = ...;
$str2 = ...;

if(strlen($str1) != strlen($str2)) {
  echo "Lengths are different!";
} else {
  for($i=0; $i < strlen($str1); $i++) {
    if(ord($str1[$i]) != ord($str2[$i]) {
      echo "Character $i is different! str1: " . ord($str1[$i]) . ", str2: " . ord($str2[$i]);
      break;
    }
  }
}
2
13.10.2009 21:01:19
Well, this is me, hanging my head in shame. One of the strings I was comparing had two spaces between 2 words. Of course, when html renders, it won't show 2 spaces in a row, so until I looked at the source, the strings appeared to match perfectly (and did match perfectly, according to the firefox search tool). Thanks for the good answers everyone, sorry the real one had to be so simple...
Peter 13.10.2009 21:55:15

Without application code it's difficult to say what's happening.

Try using trim() on the strings to remove trailing whitespace, which is invisible to the naked eye.

You may find strcmp gives better results as well.

1
13.11.2011 14:17:57
I'm using trim and and strtolower to ensure the strings match. strcmp returns a -1. I'd post the source code, but I'm not sure it would help - the comparison bit is very normal, and to see the rest of the code (where the page is fetched and parsed), I'd be pasting a thousand lines of code.
Peter 13.10.2009 20:36:23
You'd have to check the strings byte by byte to see why they are different. Something like iconv might be the best way to get a uniform encoding.
David Snabel-Caunt 13.10.2009 20:40:17

what about running both through a sanatizing filter (if you have php >5.2.0). I don't know that it will do anything, but it may.

http://www.phpro.org/tutorials/Filtering-Data-with-PHP.html#12

0
13.10.2009 20:27:49

Try mb_strstr() and trim(), as pointed by dcaunt.

0
13.10.2009 20:38:08

You could try using the Dom Extension to PHP. On creating a new Dom Document you can specify the encoding of the underlying document / webpage. According to This website, internally everything is done in UTF-8. You could then find the dom nodes you were interested in, and compare the Text Content of the node

If you were not using webpages, with an associated specified character encoding, I would suggest using the multibyte functions, in particular mb_detect_encoding and mb_convert_encoding

0
13.10.2009 20:53:45

If you can't reliably get the encoding, you can use mb_convert_encoding.

$string1 = mb_convert_encoding($string1, 'utf-8', 'auto');
$string2 = mb_convert_encoding($string2, 'utf-8', 'auto');

If you can determine the encoding (from the http headers or meta tags) you should specify the encoding instead of using "auto."

$string1 = mb_convert_encoding($string1, 'utf-8', $encoding1);
$string2 = mb_convert_encoding($string2, 'utf-8', $encoding2);
0
13.10.2009 21:10:03