• Register
    • Help

    striker  0 Items
    Currently Supporting
    • Home
    • News
    • Forum
    • Wiki
    • Support
      • Manage Subscriptions
      • FAQ
      • Support For
        • VaultWiki 4.x Series
        • VaultWiki.org Site
    • What's New?
    • Buy Now
    • Manual
    • 
    • Support
    • VaultWiki 4.x Series
    • Bug
    • 100K Auto-Link Problem

    1. Welcome to VaultWiki.org, home of the wiki add-on for vBulletin and XenForo!

      VaultWiki allows your existing forum users to collaborate on creating and managing a site's content pages. VaultWiki is a fully-featured and fully-supported wiki solution for vBulletin and XenForo.

      The VaultWiki Team encourages you to join our community of forum administrators and check out VaultWiki for yourself.

    Issue: 100K Auto-Link Problem

    • Issue Tools
      • View Changes
    1. issueid=2696 May 12, 2012 8:19 AM
      pegasus pegasus is offline
      VaultWiki Team
      100K Auto-Link Problem

      Imagine a wiki using VaultWiki that has 100,000 or more articles (and yes, they already exist). Now imagine that such a wiki has auto-links turned on. Finding matches against a 100,000 article database is incredibly slow.

      Surely there are some improvements that can be made.
      - We already ignore articles that are below a specified length (default: 4).

      We can also ignore articles that are above a specified length if we already know the article doesn't contain any words that are that long. If the longest word in this post is "VaultWiki", don't go looking for "scraunched" or "antimicrobial". But this method breaks down quickly when wiki articles have multiple words in the title. Two metrics would be needed for titles: total_length and max_word_length.

      We can use caches of auto-link viable titles in smaller groups (circa 10K) that are sorted by title length, and iterate those caches as necessary until we run out of body text. However, this is very similar to the method we currently employ with the difference of using a bit less memory and smaller regexes.

      The other way is a method we have avoided because it potentially leads to the problem faster (but in reverse) - to look up the article titles based on a list of possible titles found in the text. This requires compounding every word into multiple phrases with its adjacent words, up to 125 characters in length (the maximum length before matching becomes ambiguous). At an average word length of 5, that means that each word in the article would generate 25 lookups, so a 10,000 character article (2,000 words) would lookup 50,000 articles. Since we figured articles were more frequently going to reach 10,000 characters than wikis would reach 50,000 articles, we chose the current matching system.

      Perhaps a good solution is a combination: when character length exceeds article count, do title matching, else do reverse lookup.

      There must be other ways or alternative sort methods that could yield a no-autolink-possible conclusion before running out of text or article titles. I'm open to suggestions.
    Issue Details
    Issue Number 2696
    Issue Type Bug
    Project VaultWiki 4.x Series
    Category BB-Code Parsing
    Status Fixed
    Priority 2 - Fatal / Database Errors
    Affected Version 4.0.0 Alpha 1
    Fixed Version 4.0.0 Gamma 3
    Milestone VaultWiki 4 Gamma X
    Software DependencyAny
    License TypePaid
    Users able to reproduce bug 0
    Users unable to reproduce bug 0
    Attachments 0
    Assigned Users (none)
    Tags (none)




    1. February 13, 2014 4:19 PM
      pegasus pegasus is offline
      VaultWiki Team
      By implementing the max_word_length vs total_length optimizations, the time taken to process large auto-link sets is reduced by roughly 10%. By storing these values in the page list itself, rather than compiling them at runtime, the time taken for auto-links is reduced dramatically (about 70% overall).

      In a test with a list of 14,000 possible titles, the original process took between .25 and .40 seconds on an average-length test page (varying by cache hits/misses and server load). Scaling linearly, this suggests that since applying the recent memory optimization to the page list, 100K would take 1.8-2.8 seconds. By adding a number of optimizations (removing unnecessary function calls, reordering conditionals, calculating max_word_length in the page vs max_word_length of each title vs max_length of each title and storing those values in the page/page list), the same test took between .06-.08 seconds to complete. I have not tested on a full 100,000 list, but this test seems to suggest that 100K would now only take 0.5-0.65 seconds. Since these tests were only performed on a single page, it is unclear what the results would be for a forum thread with 20 posts per page (it's likely the time would be nearly 20x, but it is highly dependent on the number of characters in the content processed).

      Of course, auto-link output is saved in the post cache, so this only matters for an unprimed cache.

      I will begin looking at the reverse lookup variant now to see if that actually leads to improvement, and if so, that the improvement exceeds the max_word optimization for at least a subset of list sizes vs content length. max_word only applies when doing direct matching of every item in the page list.
      Reply Reply  
    2. February 13, 2014 6:13 PM
      pegasus pegasus is offline
      VaultWiki Team
      The reverse lookup variant is obscenely fast. In test cases, it took a ratio of 1:10 (number of pages : number of characters) before forming page lookups from sequential characters became slower than simply matching all pages against the characters present.

      What this means for most cases, especially the case of regular forum posts, is ridiculously fast auto-link discovery and resolution. Only once the number of characters in a post exceeds the number of pages in the wiki by 10x will the method of discovery need to change (and it detects this automatically).

      In the same example page I used above that took 0.06-0.08 seconds in max_length optimizations, the reverse lookup optimization reduced auto-link time to 0.01-0.02 seconds.

      Because the reduction is so dramatic, most wikis should notice an improvement in load times - whether you have 10 pages, 10,000 pages, or 100,000 pages.
      Reply Reply  
    3. February 13, 2014 8:10 PM
      pegasus pegasus is offline
      VaultWiki Team
      Reopening.

      By combining the philosophies of the reverse lookup optimization and the max_word_length optimization, we can create a reverse max_word_length optimization, which in theory can reduce both memory consumption and process time by up to 90% or more compounded over the previous optimizations (depending on wiki size and titling conventions - it could also be a 0% optimization on some wikis) during the auto-link discovery process. This in turn would delay the inefficiency of the reverse lookup optimization across long content, allowing for a sliding ratio of 1:90 before direct matching becomes more efficient.
      Reply Reply  
    4. February 13, 2014 9:08 PM
      pegasus pegasus is offline
      VaultWiki Team
      Marking as Fixed. All optimizations are now implemented:
      1. Get list_max_length and list_max_word_length for the page list.
      2. If list_max_length exceeds 10 x (content_max_length / list_max_word_length), reverse lookup is probably the fastest method.
      3. Compile search list. Use each word less than list_max_word_length, and append all subsequent words until one is greater than list_max_word_length or until the whole search term exceeds list_max_length.
      4. Replace the page list with the search list for the current process.

      OR
      1. Get list_max_length and list_max_word_length for the page list.
      2. If content_max_word_length is not cached and the page list exceeds 1000 items, split the content and find the content_max_word_length.
      3. Use the smaller of content_max_word_length or list_max_word_length (sanity).


      Then
      1. Try each item in the page list whose max_length exceeds min_word_length and doesn't exceed content_max_word_length. If the item is multiple words, try only if the item's max_word_length exceeds min_word_length.
      2. If reverse lookup was not used, test that the item exists in content.
      3. Compile content replacement rule (wrap auto-link tags) based on the result set.
      Reply Reply  
    + Reply

    Assigned Users
    Loading Please Wait
    Tags
    Loading Please Wait
    • Contact Us
    • License Agreement
    • Privacy
    • Terms
    • Top
    All times are GMT -4. The time now is 12:05 PM.
    This site uses cookies to help personalize content, to tailor your experience, and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Learn more… Accept Remind me later
  • striker
    Powered by vBulletin® Version 4.2.5 Beta 2
    Copyright © 2025 vBulletin Solutions Inc. All rights reserved.
    Search Engine Optimisation provided by DragonByte SEO (Pro) - vBulletin Mods & Addons Copyright © 2025 DragonByte Technologies Ltd.
    Copyright © 2008 - 2024 VaultWiki Team, Cracked Egg Studios, LLC.