Topic: "Double items in wordlists" (page 1 of 1)

1
Author Post
moose
groupmastergroupmastergroupmastergroupmastergroupmastergroupmaster
Hi,

I've just seen that some words are twice in wordlist all.txt (http://www.bright-shadows.net/download/wordlists/all.txt)
Example: disney, cisco
I guess if you checked it with a script you might find many more. It would be good if this would be corrected.

moose
private message EMail Website
quangntenemy
groupmastergroupmastergroupmastergroupmastergroupmastergroupmastergroupmastergroupmastergroupmastergroupmastergroupmastergroupmastergroupmaster
I think that's normal. I still remember reducing the Argon word list from 2 GB to like 500 MB once just by removing duplicates :D
private message EMail Website
moose
groupmastergroupmastergroupmastergroupmastergroupmastergroupmaster
Well, if the order isn't important I could create a new wordlist and give them to the admins.
(This task is so easy in Python :D
#!/usr/bin/python
# -*- coding: utf-8 -*-

import re

def natural_sort(l): 
    convert = lambda text: int(text) if text.isdigit() else text.lower() 
    alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ] 
    return sorted(l, key = alphanum_key)

print("Start reading file.")
file = open('tbswordlist2.txt','r')
words = []
for line in file:
    words.append(line)
file.close()
print("Finished reading file.")

print("%i Lines" % len(words))
words = set(words)
words = natural_sort(list(words)) # just to make it easier to see that no words are twice
print("Finished unifikation and sorting. %i Lines." % len(words))

file = open('wordlist.txt','w')
file.writelines(words)
file.close()

tbswordlist1.rar: before: 1,450,251 words and 4.32mb. After: 1,450,184 words and 3.8mb as tar.gz
tbswordlist2.rar: before: 1,301,376 words and 2.40mb. After: 650,688 words and 1.7mb as tar.gz
all-word.txt before and after: 53091 words

If any admin wants to upload these, I can send them to you.
Edited by moose on 26.09.2011 15:16:07
private message EMail Website
Erik
groupmastergroupmastergroupmastergroupmastergroupmaster
Hello,

I removed the duplicates in all.txt and tbswordlist2.

Best wishes,
Erik
private message EMail Website

Topic: "Double items in wordlists" (page 1 of 1)

1