Miscellaneous Q's about forum stats

Discussion in 'Feedback and Suggestions' started by WBahn, Mar 12, 2013.

  1. WBahn

    Thread Starter Moderator

    Mar 31, 2012
    17,743
    4,795
    So, for no clear and good reason other than I'm approaching my one year anniversary here and got curious, I am trying to get a better handle on what is included in some of the various numbers reported on the forum. There are numerous discrepancies and I know where some of them arise from, but I'm curious if one of the mods can explain things a bit better.

    We are approaching 200,000 registered members. As of right now, there are 198,392 registered users. However, the most recent user's ID is 198536. The User IDs appear to start at 1, and so that means that, for some combination of reasons, there are 144 assigned User IDs that are not counted in the number of registered users. My guesses for reasons include people that have unregistered themselves (can people even do this?), people that have been kicked off, and people that have, for some reason, be reregistered with a new ID and their original ID is not being counted (perhaps someone that registered with an e-mail name and then the mods helped them get a different name?). Does that number (144) make sense to you mods and can you shed some light on the descrepancy? Does the total shown include people, like spammers, that have been banned?

    Next, if I total up all of the post counts in the forums I get 551248. The main page claims 561907, a difference of 10,659. Now, I can easily believe that there are that many posts in admin/mod forums. But I could also believe that those forums have collected more than that many posts in the last ten years. So, could someone shed some light on the kinds of posts that are included in the main page totals that are not shown in the forum totals? Some ideas include admin/mod posts and deleted posts (spam and such).

    Somewhat along the same lines, I know that posts to the Off-Topic forum do not count in a person's post totals. Are there any other types of posts, such as to those admin/mod forums, that aren't tallied, either?

    I'm kinda curious about this last one because I got a wild hair that got me wondering how many users account for 50% of all posts. But I need to compare apples to apples. Based on what I currently understand, there are presently just shy of half a million tallied posts and the top page of posters (i.e., top 30) account for over 35% of them and the top 2 pages account for 46%. I estimate (haven't actually totalled them) that the top 100 posters account for right at 50% of all the tallied posts.

    Now, this doesn't tell us anything that isn't common knowledge -- like most online forums a small fraction of members account for the lion's share of the activity. But it does underscore, particularly in the more Q&A-type forums, that a very large number of people with questions are being serviced by a quite small number of people. My rough guess is that, at any given time, probably about 20 to 30 members (the exact make-up, of course, ebbs and flows over time) account for half of the posts in those forum with the other half spread out among two or three thousand members (say over the course of a given month).
     
  2. thatoneguy

    AAC Fanatic!

    Feb 19, 2009
    6,357
    718
    Member stats broken down by posts

    I think that page is all that is available for stats, from this site software.

    Alexa shows demographics Demographics are based on information users entered on other sites, so usage patterns can be tracked.

    There is a private mod forum that counts posts, but that forum and it's posts aren't visible.

    There may be other areas, such as for e-book, but I've never seen it mentioned.
     
  3. Georacer

    Moderator

    Nov 25, 2009
    5,142
    1,266
    My guess is that when the admins where setting up the site, the hard-removed a number of test accounts.
    As of now, the Moderator forum has 10309 posts and the E-book Developers forum 335. That's the number you 're missing.
    Based on experience, I 'll say that only Offtopic doesn't count. Deleted or moderated (invisible) posts, naturally, don't count either.
     
    WBahn likes this.
  4. WBahn

    Thread Starter Moderator

    Mar 31, 2012
    17,743
    4,795
    Great! Thanks a bunch.

    Since I'm assuming you snagged the Moderator and E-book forum counts after I snagged my numbers, it would appear that there are probably a dozen or two posts somewhere else that aren't being tallied. If I were an admin I would be tempted to track down that descrepancy because it might reveal a bug in the database or the vBulletin code, but it's doubtful that it is a major issue.

    One of these days I'm going to write a data-miner to go explore some of these questions. But probably not any time soon.
     
  5. Georacer

    Moderator

    Nov 25, 2009
    5,142
    1,266
    Are you going to make something to automate page parsing and data storage in csv or tables?

    I 'm trying to find someone to parse 30x25 pages of data and store them in a format, which I can then use to perform statistical analysis.

    If you go down that lane and are interested, I 'd appreciate some help.
     
  6. thatoneguy

    AAC Fanatic!

    Feb 19, 2009
    6,357
    718
    sed and awk are your Best Friends when it comes to processing text formatted data. :D
     
  7. WBahn

    Thread Starter Moderator

    Mar 31, 2012
    17,743
    4,795
    I'm not sure what you mean by "30x25" pages of data.

    What I have in mind is probably going to be unnecessarily brute force from your standpoint because I have to work with teasing information out one user at a time with the access capabilities available to registered members. If I had the capabilities of a mod then I'm sure I could do things much more elegantly and if I was an admin then everything I have in mind would probably be pretty trivial. Though, in my case, each level would pose significant learning curves -- but that's the main reason I'm interested in doing anything at all.

    If I ever find the time to start on this little project, I'll be sure to let you know. There may be some things you are interested in that I can help out with. Realistically, it will me at least several months before I can justify the time.
     
  8. Georacer

    Moderator

    Nov 25, 2009
    5,142
    1,266
    Make that 84x25. I have 84 pages that each contains 25 entries of this sort:
    http://euw.leagueoflegends.com/tribunal/en/case/718715/

    Each such entry has some global information and info in up to 5 tabs.

    I 'm not really into programming of this sort (web design and scripts) but I love statistics.
    I want to capture that data pool, but I don't want to go through learning a programming language. I had asked a couple of friends who are more involved, but came up empty.
     
  9. thatoneguy

    AAC Fanatic!

    Feb 19, 2009
    6,357
    718
    Do you have the source for that page without the HTML formatting?

    I'm sure either tschuk or myself could write a script (regular expressions + sed and/or awk) to put the data into something like CSV format for import into a database.

    If the only source you have are those HTML pages, then it's still possible, just a bit more of a pain, since stripping the HTML formatting sometimes also messes up the data, especially when the sections are different lengths/entries, and the author didn't name their div tags.
     
  10. Georacer

    Moderator

    Nov 25, 2009
    5,142
    1,266
    It's all I got.

    But can't you check the source of the page through the browser? On Firefox it's Right Click -> view page source, and Chrome has the same feature too.

    If you 're talking about me providing the page source in a text file, then wouldn't that be pretty much the same as parsing all of the 2242 pages?
     
  11. WBahn

    Thread Starter Moderator

    Mar 31, 2012
    17,743
    4,795
    I'm gonna let thatoneguy take it from here, but I'm following along with interest.
     
  12. tshuck

    Well-Known Member

    Oct 18, 2012
    3,531
    675
    ....my experience with sed us a bit limited, but it doesn't seem like it's out my league.

    I usually go for a C(insert modifier of choice) program, but sed would shortcut a lot of the programming required.

    @Geo, wow! I had no idea you sifted through all of those pages for your monthly statistics!:eek: Kudos to you!
     
  13. Georacer

    Moderator

    Nov 25, 2009
    5,142
    1,266
    (Hint: I didn't do it manually)

    And if you open the link you will see that it has nothing to do with AAC.

    edit: Tracking the AAC stats however requires taking at least four samples daily, which I had done in the past manually, but then I had a friend write a script for me. Additionally, another script samples the site's online user number every half hour, every day, for one week per month.
     
    Last edited: Mar 13, 2013
  14. thatoneguy

    AAC Fanatic!

    Feb 19, 2009
    6,357
    718
    It amounts to gathering that information, if you don't already have it, using a utility such as wget to "scrape" each page.

    Then it's only a matter of writing match/replace/delete Regular Expressions to remove the parts you don't want to bother with, leaving you with an intermediate file of the data which has extra formatting. Then another regular expression and an awk script to combine the data, such as counting instances between files, averaging, etc, all done by pulling fields out with regular expressions.

    sed is "Stream EDitor", <addr>/RegEx/Operation Regular Expression syntax. <addr> can be a line number or a regular expression.

    awk is a more complete language, with loops and variables, based on Regular Expression "addresses" in each file as well, but breaks the text into "fields" like a database, each field is denoted by an HTML tag, a comma, or even a space. Then once you have the fields you want, they can be combined into a list, averaged, summed, etc, then printed out.

    awk was written to parse system logs and other text data files into reports prior to databases taking the logs in syslog servers.

    There's the history lesson. :)

    If you fully understand regular expressions, there isn't a limit on what you can do, they come in VERY handy when changing a bunch of Verilog (or other language) source files to reflect a global change, for example.
     
    Georacer likes this.
  15. Georacer

    Moderator

    Nov 25, 2009
    5,142
    1,266
    I don't see a problem with separating the data, once I have the source code. It's getting the code to begin with.

    The mentioned friend tried with a Java (I think) program to access the code fields directly, but he got timeouts that broke the execution.

    wget might be a solution, but I don't see myself installing Linux again for some time in the future. I guess I 'll just procrastinate a bit once more.

    Thanks for the info!
     
  16. Filox

    New Member

    Oct 11, 2011
    7
    4
    Hello, I am the 'mentioned friend' and I will add some details about the problem we are trying to solve.

    First of all, the pages contain elements that are Javascript-generated which means that I can't just GET the html. The Javascript has to be rendered first and then downloaded.

    For this purpose I used selenium for Python, which is used for testing web pages. This library contains webdrivers that open a browser window and you can control it via code. I ended up with a fully functioning script but some cases/results are missing.

    Do you know any way other than using a browser-engine to render the javascript? Or a totally different approach to the problem?

    EDIT: Here's part of the output(http://eune.leagueoflegends.com/tribunal/en/case/238344/):
    Code ( (Unknown Language)):
    1. 238344,Overwhelming Majority,Punish,0,1,Loss,9, 14, 11,41:20,Classic,56|0,1,Loss,12, 10, 0,29:07,Classic,23|0,0,,,,,0|0,0,,,,,0|0,0,,,,,0
     
    Last edited: Mar 14, 2013
  17. thatoneguy

    AAC Fanatic!

    Feb 19, 2009
    6,357
    718
    If that is a sample line, and all lines are like that, it would be quite easy to rip it into a mySQL database, at which point you can have your way with it in a million different fashions.

    The online MMORPG games have anti-page-scraping code to prevent people from getting "an edge", usually need some sort of cookie/login, session ID, etc to traverse pages, which dumb scrapers can't do. In addition many geeks who have written their own data mining have annoyed the admins by winning so the admins put a limit on the number of pages per second/minute/hour/day a certain login can look at.

    In the latter case, we used a client program, every member of the alliance would install it (about 600 of us), and each would use up their page views populating the database. The small app simply scraped the page and sent all the data to one central database. From that point, we essentially had a complete copy of the actual game database, with timestamps. That allows trends to be found over time, growth rates, whatever.

    In one game, we took it a step further and created an AI that was a hybrid of the scraping app, but it also played the game, based on info from the database. It won several rounds in a tournament, actually, copies of it with different usernames won all of the top 5 places, since it was online all the time. That was a fun time, back when MMORPG were new, about 2005 or 2006.

    I haven't done the app writing for a while now, I'm a database guy, but the methods are the same. Just don't create an AI player, you stick out too much and end up getting banned from games. :D
     
Loading...