compare file line by line found the longest parts… in python

This is the place for queries that don't fit in any of the other categories.

compare file line by line found the longest parts… in python

Postby boy157 » Thu Mar 14, 2013 7:21 pm

How I compare line by line text file and found the longest reapeated parts and their frequencies.

Example:
Code: Select all
A B C D
A B C D
A B C E
A B C F
A B C


Result would be list like that:

[['2','A B C D'],['3','A B C']]

this is what I have done http://pastebin.ca/2332147 but it transforms the text to the array and it's searching the most repeated chain in the whole text and I need to find the most repeated chain between the lines.

Help someone?
boy157
 
Posts: 2
Joined: Thu Mar 14, 2013 6:27 pm

Re: compare file line by line found the longest parts… in py

Postby tnknepp » Thu Mar 14, 2013 8:58 pm

I know some will frown on using numpy, but this is what I use most often.
You could build a dictionary using numpy. e.g.

Code: Select all
import numpy as np

a = np.array(['a b c', 'a b c', 'a b c', 'e f g'])

counts = {}

while np.shape(a)!=0:
    counts.update( {a[0]:np.size(a[a==a[0]],axis=0)} )
    a = a[a!=a[0]]


This will give you a dictionary like: counts = {'a b c':3, 'e f g':1}

You could also continue using lists (likely more favorable to more users):

Code: Select all
# First make a unique list. 
# If we have same list as above:
a = ['a b c', 'a b c', 'a b c', 'e f g']
b = list( set( ['a b c', 'a b c', 'a b c', 'e f g'] ) )

counts = {}

for r in b:
    counts.update( {r:a.count(r)} )


This should give the same dictionary as using numpy.
Python: 2.7 via Anaconda
Numpy: 1.7
Pandas: 0.11
OS: Windows 7
IDE: Spyder/IPython
User avatar
tnknepp
 
Posts: 114
Joined: Mon Mar 11, 2013 7:41 pm

Re: compare file line by line found the longest parts… in py

Postby Yoriz » Thu Mar 14, 2013 11:33 pm

Python has something in the standard library to do this for you.
Code: Select all
from collections import Counter

lines = ('A B C D', 'A B C D', 'A B C E', 'A B C F', 'A B C')

counter = Counter(lines)
print sorted(counter.most_common(), key=lambda item: len(item[0]),
             reverse=True)

[('A B C D', 2), ('A B C F', 1), ('A B C E', 1), ('A B C', 1)]
New Users, Read This
Join the #python-forum IRC channel on irc.freenode.net!
Spam topic disapproval technician
Windows7, Python 2.7.4., WxPython 2.9.5.0., some Python 3.3
User avatar
Yoriz
 
Posts: 570
Joined: Fri Feb 08, 2013 1:35 am
Location: UK

Re: compare file line by line found the longest parts… in py

Postby boy157 » Fri Mar 15, 2013 6:11 pm

Thanks guys

but yours solutions are searching only the lines which are the same, for example:

when i have this lines:

A B C D
A B C F

output is [(A B C D,1)(A B C F,1)]

but i need this output:

[(A B C,2)]

Can you help me?
boy157
 
Posts: 2
Joined: Thu Mar 14, 2013 6:27 pm

Re: compare file line by line found the longest parts… in py

Postby tnknepp » Mon Mar 18, 2013 4:11 pm

Are you always limiting yourself to the first three letters in the list, or will you eventually want to limit yourself to two, or expand to more?
Python: 2.7 via Anaconda
Numpy: 1.7
Pandas: 0.11
OS: Windows 7
IDE: Spyder/IPython
User avatar
tnknepp
 
Posts: 114
Joined: Mon Mar 11, 2013 7:41 pm


Return to General Coding Help

Who is online

Users browsing this forum: Google [Bot], Marbelous and 1 guest