Issue: Incorrect string value:

Issue Tools
- View Changes

February 1, 2017 5:21 PM

Alfa1

Distinguished Member

Incorrect string value:

Code:

Importing edits to wiki pages
An exception occurred: Mysqli statement execute error : Incorrect string value: '\xB1-meth...' for column 'section' at row 1 in [path]/library/Zend/Db/Statement/Mysqli.php on line 214
#0: Zend_Db_Statement_Mysqli->_execute() in [path]/library/Zend/Db/Statement.php at line 297
#1: Zend_Db_Statement->execute() in [path]/library/Zend/Db/Adapter/Abstract.php at line 479
#2: Zend_Db_Adapter_Abstract->query() in [path]/library/Zend/Db/Adapter/Abstract.php at line 574
#3: Zend_Db_Adapter_Abstract->insert() in [path]/library/XenForo/DataWriter.php at line 1638
#4: XenForo_DataWriter->_insert() in [path]/library/XenForo/DataWriter.php at line 1627
#5: XenForo_DataWriter->_save() in [path]/library/vw/XenForo/DataWriter.php at line 352
#6: vw_XenForo_DataWriter->_save() in [path]/library/XenForo/DataWriter.php at line 1419
#7: XenForo_DataWriter->save() in [path]/vault/core/controller/dm/xf.php at line 415
#8: vw_DM_Controller_XF->save() in [path]/vault/core/controller/import/handle/vw3/revision/vw.php at line 343
#9: vw_Import_Handle_VW3_Revision_Controller->add_edit() in [path]/vault/core/controller/import/handle/vw3/revision/vw.php at line 272
#10: vw_Import_Handle_VW3_Revision_Controller->do_edit() in [path]/vault/core/controller/import/handle/vw3/revision/vw.php at line 152
#11: vw_Import_Handle_VW3_Revision_Controller->do_edits() in [path]/vault/core/controller/import/steps/vw3/vw.php at line 466
#12: vw_Import_Steps_VW3_Controller->{closure}() in [path]/vault/core/controller/progress/steps/vw.php at line 83
#13: vw_Progress_Steps_Controller->call() in [path]/vault/core/controller/progress/steps/vw.php at line 53
#14: vw_Progress_Steps_Controller->execute() in [path]/vault/core/controller/progress/vw.php at line 92
#15: vw_Progress_Controller->exec_script() in [path]/vault/core/controller/progress/vw.php at line 74
#16: vw_Progress_Controller->execute() in [path]/vault/core/controller/cp/progress/vw.php at line 35
#17: vw_CP_Progress_Controller->process() in [path]/vault/core/controller/cp/impex/vw.php at line 108
#18: vw_CP_ImpEx_Controller->import() in [path]/vault/core/controller/cp/impex/vw.php at line 33
#19: vw_CP_ImpEx_Controller->execute() in [path]/library/vw/XenForo/ControllerAdmin/Wiki.php at line 118
#20: vw_XenForo_ControllerAdmin_Wiki->actionIndex() in [path]/library/XenForo/FrontController.php at line 351
#21: XenForo_FrontController->dispatch() in [path]/library/XenForo/FrontController.php at line 134
#22: XenForo_FrontController->run() in [path]/admin.php at line 13

Issue Details

Issue Number 4899

Issue Type Bug

Project VaultWiki 4.x Series

Category Importing

Status Fixed

Priority 2 - Fatal / Database Errors

Affected Version 4.0.16

Fixed Version (none)

Milestone (none)

Software DependencyXenForo 1.x

License TypePaid

Users able to reproduce bug 0

Users unable to reproduce bug 0

Attachments 0

Assigned Users (none)

Tags (none)

February 2, 2017 10:10 AM

pegasus

VaultWiki Team

I see an invalid UTF-8 byte right there. This might be a little touch and go. In vault/core/model/string/vw.php, find:

Code:

	public function mb_clean($input)
	{
		$utf8 = '#([\x09\x0A\x0D\x20-\x7E]' .		# ASCII
			'|[\xC2-\xDF][\x80-\xBF]' .				# non-overlong 2-byte
			'|\xE0[\xA0-\xBF][\x80-\xBF]' .			# excluding overlongs
			'|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}' .	# straight 3-byte
			'|\xED[\x80-\x9F][\x80-\xBF]' .			# excluding surrogates
			'|\xF0[\x90-\xBF][\x80-\xBF]{2}' .		# planes 1-3
			'|[\xF1-\xF3][\x80-\xBF]{3}' .			# planes 4-15
			'|\xF4[\x80-\x8F][\x80-\xBF]{2})#S';	# plane 16

		$string = '';
		$matches = array();

		$pcre_limit = $this->pcre_limit;

		if ($pcre_limit AND strlen($input) > $pcre_limit)
		{
			$chunks = str_split($input, $pcre_limit);
		}
		else
		{
			$chunks = array($input);
		}

		foreach ($chunks AS $chunk)
		{
			while (preg_match($utf8, $chunk, $matches))
			{
				$string .= $matches[0];
				$chunk = substr($chunk, strlen($matches[0]));
			}
		}

		return $string;
	}

Replace with:

Code:

	public function mb_clean($input, $replace = '')
	{
		$utf8 = '#([\x09\x0A\x0D\x20-\x7E]' .		# ASCII
			'|[\xC2-\xDF][\x80-\xBF]' .				# non-overlong 2-byte
			'|\xE0[\xA0-\xBF][\x80-\xBF]' .			# excluding overlongs
			'|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}' .	# straight 3-byte
			'|\xED[\x80-\x9F][\x80-\xBF]' .			# excluding surrogates
			'|\xF0[\x90-\xBF][\x80-\xBF]{2}' .		# planes 1-3
			'|[\xF1-\xF3][\x80-\xBF]{3}' .			# planes 4-15
			'|\xF4[\x80-\x8F][\x80-\xBF]{2})#S';	# plane 16

		$string = '';
		$matches = array();

		$pcre_limit = $this->pcre_limit;

		if ($pcre_limit AND strlen($input) > $pcre_limit)
		{
			$chunks = str_split($input, $pcre_limit);
		}
		else
		{
			$chunks = array($input);
		}

		foreach ($chunks AS $chunk)
		{
			while ($chunk AND preg_match($utf8, $chunk, $matches, PREG_OFFSET_CAPTURE))
			{
				if ($replace AND $matches[0][1])
				{
					$string .= str_repeat($replace, $matches[0][1]);
				}

				$string .= $matches[0][0];
				$chunk = substr($chunk, strlen($matches[0][0]) + $matches[0][1]);
			}

			if ($replace AND $chunk)
			{
				$string .= str_repeat($replace, strlen($chunk));
			}
		}

		return $string;
	}

In vault/core/controller/dm/base/vw.php, find:

Code:

	public function verify_utf8(&$text)
	{
		$regex = '/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]';
		$regex .= '|[\x00-\x7F][\x80-\xBF]+';
		$regex .= '|([\xC0\xC1]|[\xF0-\xFF])[\x80-\xBF]*';
		$regex .= '|[\xC2-\xDF]((?![\x80-\xBF])|[\x80-\xBF]{2,})';
		$regex .= '|[\xE0-\xEF](([\x80-\xBF](?![\x80-\xBF]))|(?![\x80-\xBF]{2})|[\x80-\xBF]{3,})/';

		$text = preg_replace($regex, '?', $text);

		$regex = '/\xE0[\x80-\x9F][\x80-\xBF]';
		$regex .= '|\xED[\xA0-\xBF][\x80-\xBF]/S';

		$text = preg_replace($regex, '?', $text);
	}

Replace with:

Code:

	public function verify_utf8(&$text)
	{
		$text = vw_Hard_Core::model('String')->mb_clean($text, '?');
	}

Messing with the byte sequences has the potential to import non-ASCII characters incorrectly. Would have been good to have seen this on your test import. This must be from a new edit on your source wiki in the past month or two.

February 2, 2017 1:48 PM

Alfa1

Distinguished Member

That makes me very nervous. I don't feel confident doing it because I am afraid it could mess up my website data.

Reply
February 5, 2017 8:13 PM

Alfa1

Distinguished Member

Originally Posted by pegasus

This might be a little touch and go. Messing with the byte sequences has the potential to import non-ASCII characters incorrectly.

Could you please explain this further? I think my site contains a lot of non-ASCII characters. As you describe it it seems that your fix may result in data corruption or incorrect import of non-ASCII characters. The last thing I want is a messed up database. This is why I am afraid of your fix.

I have now migrated my site from vb to xenforo without Vaultwiki.

Reply
February 6, 2017 8:39 AM

pegasus

VaultWiki Team

This change alters the regular expressions that allow UTF-8 bytes into your content. While the old regex was not corrupting characters, according to the error you posted, it was allowing characters that were not UTF-8.

The old regex had years of usage behind it without corruption. I just gave you a new regex; it simply does not have that level of use behind it. This regex should function similarly to the old one: the old one searched for non-UTF-8 characters and replaced them with a ?. The new one is slower, searches for valid UTF-8 characters, and any other characters are replaced with a ?.

But the worst that will happen is that you will have to reinstall VaultWiki and try the import again with a different regex. This is no different than any of the previous imports you've done. It will not affect the original data, just the imported copy.

Reply
February 6, 2017 3:11 PM

Alfa1

Distinguished Member

What I am worried about is that the imported copy will have bad data without me noticing it for some time. Can this be avoided or checked somehow?

Reply
February 6, 2017 3:33 PM

pegasus

VaultWiki Team

Pick 3 - 5 articles that you know contain a large number / variety of special characters and review them after the import. Also, retain the final backup of your VW3 database for a long time, in case you notice something later. It would not be a major issue for us to repair 1 or 2 articles from the backup. I like keeping backups indefinitely.

Reply
February 16, 2017 10:39 AM
pegasus

VaultWiki Team
I have noticed that the above edit to vault/core/model/string/vw.php removes trailing "0"s from content, like if a title or a post ends in a number "1,000" with no markup after it.

To fix it, change:

Code:

while ($chunk AND preg_match($utf8, $chunk, $matches, PREG_OFFSET_CAPTURE))

To:

Code:

while ($chunk !== '' AND preg_match($utf8, $chunk, $matches, PREG_OFFSET_CAPTURE))

I have not noticed any other issues yet.
Reply
February 17, 2017 7:42 PM

Alfa1

Distinguished Member

Originally Posted by pegasus

I have not noticed any other issues yet.

Errr! I am going to pass on importing until I know that I can import without corrupting thousand+ wiki articles.

Reply
February 17, 2017 7:46 PM

Alfa1

Distinguished Member

Originally Posted by pegasus

But the worst that will happen is that you will have to reinstall VaultWiki and try the import again with a different regex. This is no different than any of the previous imports you've done. It will not affect the original data, just the imported copy.

The difference is that we are live now. Once it is imported our members will start adding content.
Basically you are saying that this needs to be tested fully on a development install and import to the live environment is not a good idea at this time.

Reply
February 18, 2017 12:18 PM

pegasus

VaultWiki Team

That would have been ideal. But obviously the problem wiki data was not a part of the source wiki in your last test of the import. You can either test again with this new source data, or try it live.

Personally, the only problem I noticed with the edits above involving trailing zeros, and I gave the fix for that. I have tested with a few articles that contain non-ASCII characters, and I did not see any problems; for the characters that were considered valid, they were left untouched. But I cannot test and review that it works for all 1,112,064 characters in the UTF-8 character set. That is almost beyond the testing anyone can do, even you if you had another test import to try. I do know that it works for the invalid string cited above "\xB1-meth..." and correctly changes it to a valid string as "?-meth...", and that the same regex used by this cleaner is used by other applications without reported issue.

Reply
February 21, 2017 5:39 PM

Alfa1

Distinguished Member

Do you have any idea how to best test the imported articles for the existence of non-ASCI characters?
I mean: how do I know if any non-ASCI characters are in the source text?
I do find these German Characters being correctly imported: http://german.about.com/od/writingge...a-Keyboard.htm
For example: äß
Does this mean I am good to go or are there other things that would be wise to test out?

Reply
February 22, 2017 8:24 AM
pegasus

VaultWiki Team
Generally I just check articles for characters that I know are being used. Since you use German characters, and the articles you checked appear to be fine, then you should be mostly fine.

If you want to check other articles, the following will test if the article contains only ASCII characters.

Code:

$ascii_only = mb_check_encoding($source_text, 'ASCII'); if (!$ascii_only) { // contains non-ASCII characters }
Reply
February 22, 2017 10:46 AM

Alfa1

Distinguished Member

How and where do I apply that code?

Reply
February 22, 2017 11:03 AM

pegasus

VaultWiki Team

I thought you were writing a script to test for non-ASCII characters. That code is out-of-context; I was providing you with the PHP function that does what you were asking about.

Reply
February 22, 2017 1:52 PM

Alfa1

Distinguished Member

No I am just looking trough posts and I can do database searches.

Reply

+ Reply

All times are GMT -4. The time now is 5:38 AM.

This site uses cookies to help personalize content, to tailor your experience, and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.

Learn more… Accept Remind me later

Welcome to VaultWiki.org, home of the wiki add-on for vBulletin and XenForo!

Issue: Incorrect string value:

Issue Tools