Issue: First part of Unicode is removed by importer

Issue Tools
- View Changes

April 29, 2017 1:52 PM

Alfa1

Distinguished Member

First part of Unicode is removed by importer

it seems like in many (but not all) cases, if the importer found a character with a unicode value that was above a certain range, it just cut off the upper part of the value. So when

U+067E : ARABIC LETTER PEH

got turned into this:

U+007E : TILDE

...it's like the upper part of the number (06) got cut off, and turned into 00. You can see it happen again and again:

U+0646 : ARABIC LETTER NOON
U+0046 : LATIN CAPITAL LETTER F
^ missing 06

U+062F : ARABIC LETTER DAL
U+002F : SOLIDUS {slash, virgule}
^ missing 06

U+015A : LATIN CAPITAL LETTER S WITH ACUTE
U+005A : LATIN CAPITAL LETTER Z
^ missing 01

...sort of like it translated all the numbers in a chart into two-digit numbers, even if there were four-digit numbers in the original, so in those cases it just chopped off the first two digits. But that metaphor doesn't neatly explain all the cases. I would guess that the importer assumed it only had to deal with a limited unicode character set, so when it hit a character from a more extended character set, it just gave the closest result it could, either a chopped off result or just garbage.

Issue Details

Issue Number 5034

Issue Type Bug

Project VaultWiki 4.x Series

Category Importing

Status Fixed

Priority 1 - Security / Login / Data Loss

Affected Version 4.0.17

Fixed Version 4.0.18

Milestone (none)

Software DependencyXenForo 1.x

License TypePaid

Users able to reproduce bug 0

Users unable to reproduce bug 0

Attachments 0

Assigned Users (none)

Tags (none)

May 7, 2017 11:48 AM

pegasus

VaultWiki Team

Fixed in the next release. The conversion of some multiple-byte HTML entities back into their UTF-8 codepoints was adding/subtracting extra bits or using invalid ranges for those codepoints. The function that did this was based on a vBulletin function that does the same thing; the only explanation is that this bug existed in vBulletin already. Switching to the XenForo version of the same function causes these entities to convert correctly.

Reply
June 10, 2017 7:23 PM

Alfa1

Distinguished Member

I will try it with the new version.
Should the importer charset be defined as latin1 ?

Reply
June 11, 2017 10:43 AM

pegasus

VaultWiki Team

It would not hurt.

Reply

+ Reply

All times are GMT -4. The time now is 1:44 PM.

This site uses cookies to help personalize content, to tailor your experience, and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.

Learn more… Accept Remind me later

Welcome to VaultWiki.org, home of the wiki add-on for vBulletin and XenForo!

Issue: First part of Unicode is removed by importer

Issue Tools