after final import any hope for duplicate photo deletes

soaringeagle
@soaringeagle
9 years ago
3,304 posts
this is a big issue as im constantly clearing image cache to avoid the drive from being too full
im at 88% wich i dont know why it gained 3% in a week without awhole lot added (possibly new search database?)
and with image cache it gets up to 92%
im looking at having to add another drive before it really gains any trafic and thats gonna hurt bad


so is there any way to delete duplicates without having to manually delete 67,000 files


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities

updated by @soaringeagle: 03/03/15 05:42:46PM
brian
@brian
9 years ago
10,148 posts
Are you running the latest Ning Import? Paul has added code in there to watch for duplicates in the Ning Archives.


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
paul
@paul
9 years ago
4,326 posts
brian:
Are you running the latest Ning Import? Paul has added code in there to watch for duplicates in the Ning Archives.

I don't think that is what SE is talking about. The latest Ning Import release will trap duplicate items in the Ning archives, which is yet another Ning anomaly recently found .

SE - You talking about those photos left over in the archive after deleting all files that have been copied to JR? If so, those left should all be referenced in imported comments etc. I not sure what any others might be. If Ning has put a load there that are not needed for the import, there's not much JR can do about it.
Or am I mis-understanding?


--
Paul Asher - JR Developer and System Import Specialist
soaringeagle
@soaringeagle
9 years ago
3,304 posts
there are 2 of nearly all gallery images not all but most
and ive aleready done the final import so the fixed importer to trap them is no help asthey are already there

because it appears they are all in a secondary numbered folder is there any ssh trickery that can delete all directories in the extrsa numbered directories like find gallery/*numbers* delete /*numbers

anything that will delete all those extras

if thats even where those search errors are comingin in my sitremap crawler im also getting thousands of duplicate contents i can post those error logs or send to email to help u find where all the doubles are


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
paul
@paul
9 years ago
4,326 posts
So to be clear, you have run the Delete Media tool and you are looking at what's left?


--
Paul Asher - JR Developer and System Import Specialist
soaringeagle
@soaringeagle
9 years ago
3,304 posts
actualy inspect element revealed that those weird urls arent coming from the photos they are sequentiial so no way to bulk delete argh


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
soaringeagle
@soaringeagle
9 years ago
3,304 posts
paul:
So to be clear, you have run the Delete Media tool and you are looking at what's left?

yes
its like here duplicates

https://www.dreadlockssite.com/graciela-valderrama/gallery/80859/1-photo-by-mauricio-gomez-amoretti-12-08-14#gallery_img


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
soaringeagle
@soaringeagle
9 years ago
3,304 posts
same names diferent numbers 80859 80860


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
brian
@brian
9 years ago
10,148 posts
soaringeagle:
same names diferent numbers 80859 80860

That means the importer was given the image twice. I think the new Dupe detection code in the latest Ning Import module should take care of this... is that right Paul?


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net

updated by @brian: 01/30/15 01:10:28PM
soaringeagle
@soaringeagle
9 years ago
3,304 posts
if i was to do a delete and re-import it would
but once there ...

im running

find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate

to find alist of duplicates


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
Strumelia
Strumelia
@strumelia
9 years ago
3,603 posts
Just to confirm- this is something that is in the Ning archive to begin with.
see screenshot from my ning archive "Groups" folder.. I look in my ning archives (not the ones uploaded or imported to JR, but the original ones, on my computer)...and there are thousands of duplicate images, with only differing file numbers. They are not only in the Photos folder, but in Groups folder, Discussions, etc.
I always thought they were needed because of if someone used them or REFERENCED them in more than one place on my ning network. But I notice the duplicates are always in pairs, not 3 or 4 or 5...just 2. Odd. And some of them I know are not likely referenced in more than one place on my network...like a group avatar I used, or a boring pic of someone's lawn.
These pairs of identical photos are everywhere in our ning archives, and they take up a LOT of space- I can only imagine on SE's network. =8-o
dupes.jpg
dupes.jpg  •  425KB




--
...just another satisfied Jamroom customer.
Migrated from Ning to Jamroom June 2015

updated by @strumelia: 01/30/15 02:21:37PM
brian
@brian
9 years ago
10,148 posts
Honestly the Ning Archive "format" is pretty much a mess. Paul has done a great job getting code in place to "clean things up", but we run into new "gotchas" fairly frequently. I get a feeling that Ning doesn't really want you to take your data, so they "sabotage" it a bit.


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
soaringeagle
@soaringeagle
9 years ago
3,304 posts
its a nightmare impaying like300 a month for the server and thinking ill have to addanother 100+ for another drive before theres any significant trafic

speaking of that can u please take a look at my titles and meta suggestion for sitebuilder and up that priority a notch just cause i really desperatekly need to take care of the seo asap or ill be broke very soon

though removing the excess wasted space is also up there on priorites list

if i couldremove 100 gb of excess photos that would be huge


f'ing ning
from now on gonna call em fing
i know if i call em and explain the issue they will say we will look into it but never get a thing done


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
soaringeagle
@soaringeagle
9 years ago
3,304 posts
oh i was gonna ssk and forgot
if i did do a re-upload re-import with delete doesnt that just trunkate the database not remove the files

so that wont help either huh


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
Strumelia
Strumelia
@strumelia
9 years ago
3,603 posts
Adding to Soaring's question:
since these paired images have different file number IDs in our ning folders (see my previous screenshot), will Paul's delete duplicates Tool recognize and delete the dupes even though they have diff # ids?


--
...just another satisfied Jamroom customer.
Migrated from Ning to Jamroom June 2015
brian
@brian
9 years ago
10,148 posts
Strumelia:
Adding to Soaring's question:
since these paired images have different file number IDs in our ning folders (see my previous screenshot), will Paul's delete duplicates Tool recognize and delete the dupes even though they have diff # ids?

That's a good question. Looking at the code it looks like we are checking for duplicate ID's, which would not help here. Do the images have anything in common that we can key on? I don't have a Ning Archive to check it out.

Thanks!


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
Strumelia
Strumelia
@strumelia
9 years ago
3,603 posts
I did give you a copy of my ning archive about 3 months ago- maybe you still have it?


--
...just another satisfied Jamroom customer.
Migrated from Ning to Jamroom June 2015
brian
@brian
9 years ago
10,148 posts
Strumelia:
I did give you a copy of my ning archive about 3 months ago- maybe you still have it?

I likely don't - I will look for it though. Space is at a premium here for me on my macbook, so I don't hang on to large files unless I'm actively using them. Paul might have it though - I'll get a ticket open so we can come up with a solution for this.


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
Clay Gordon
Clay Gordon
@claygordon
9 years ago
733 posts
BTW -

Just to muddy the waters. I don't have any problems with duplicate images.
Strumelia
Strumelia
@strumelia
9 years ago
3,603 posts
WTH Clay- you lucky dog!


--
...just another satisfied Jamroom customer.
Migrated from Ning to Jamroom June 2015
soaringeagle
@soaringeagle
9 years ago
3,304 posts
brian:
Strumelia:
Adding to Soaring's question:
since these paired images have different file number IDs in our ning folders (see my previous screenshot), will Paul's delete duplicates Tool recognize and delete the dupes even though they have diff # ids?

That's a good question. Looking at the code it looks like we are checking for duplicate ID's, which would not help here. Do the images have anything in common that we can key on? I don't have a Ning Archive to check it out.

Thanks!

im running a file size and hash compare its taking hours

but thats what i think will be needed is to do a file size then mdr hash test

im using the 1st code from here
http://www.commandlinefu.com/commands/view/3555/find-duplicate-files-based-on-size-first-then-md5-hash

but that is basicly what needs to be run and i think it should be un as a maintenence cycle in the quue as its kinda cpu intensive (not severely)

buti would recomend that before deletion it should also check for a custom form designer feild *_featured

if i have to refeature em all so be it
but if it specificly chodse the version thats not featured when 2 versions are found that would be best


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
soaringeagle
@soaringeagle
9 years ago
3,304 posts
ps when this finaly finishes i will put the results in afile and attach but what file types can i attach


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
brian
@brian
9 years ago
10,148 posts
soaringeagle:
im running a file size and hash compare its taking hours

This is because you are piping the output of find to xargs, and then piping THAT to xargs - it may not even finish in 24 hours. We don't need to do that to detect the same file twice.


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
soaringeagle
@soaringeagle
9 years ago
3,304 posts
ok so can you come up with a good way to do this/ ill cancle this


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
brian
@brian
9 years ago
10,148 posts
soaringeagle:
ok so can you come up with a good way to do this/ ill cancle this

Paul is checking it out - I think we can do something.


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
soaringeagle
@soaringeagle
9 years ago
3,304 posts
god you guys are the best!
that would make me very happy and really be saving me as i cant afford another drive now

its really an honour to get to share all youve done for us over on ning creators


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
Strumelia
Strumelia
@strumelia
9 years ago
3,603 posts
I have not run a 'delete duplicates' since Paul first made that tool, and I think he's improved it since then, also the import tool is different now. Thus, I still have my Ning Archive folder with certain files that are needed. I see there are many duplicate images (all with sequential file number names though). And I see the same thing in my JR Media folders- many pairs of files with the same size and name, but different ID # at the start of their name. I'm thinking there are thousands of them.
Plus, I saw dupes of many video .flv as well in my JR folders- one with the name, and the twin with the same name plus "original" after the name...but same vid, same file size ...many MBs each. :(

Gling. The Glam monster that ate Ning.


--
...just another satisfied Jamroom customer.
Migrated from Ning to Jamroom June 2015
soaringeagle
@soaringeagle
9 years ago
3,304 posts
someof the vids maybe re-uploads though not nesacarily ning archive unless u see alot

but yea im thinking i must have 40-90 gigs of wasted space at least

flam f'ing glam
like the flimflam plan


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
Strumelia
Strumelia
@strumelia
9 years ago
3,603 posts
soaringeagle:
flam f'ing glam
like the flimflam plan

For a glimpse at the Ning Engineering team hard at work, fast forward to 1:35 in this undercover Ning film:
On the Ning Nang Nong (Good Quality)
lololol....


--
...just another satisfied Jamroom customer.
Migrated from Ning to Jamroom June 2015
soaringeagle
@soaringeagle
9 years ago
3,304 posts
wtf did u just make me watch
i lost at least 40 iq points in 20 seconds

are they trying to cause kids to9i be braindead?


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities

Tags