Website Copying/Dumping

General discussion about Linux, Linux distribution, using Linux etc.
Post Reply
MAJID
Naik
Posts: 90
Joined: Thu Oct 16, 2003 10:23 pm
Contact:

Website Copying/Dumping

Post by MAJID »

Hello,
I want to know is there any software in linux to copy or dump websites.
I will going to use in Fedora Core 2.
Please mention the link to download the software.
Waiting for a quick reply.
01101101 01100001 01101010 01101001 01100100
jargon
Lieutenant Colonel
Posts: 691
Joined: Mon Oct 13, 2003 9:40 am

Post by jargon »

I hope whatever you are doing is legal.
Check out 'wget' it should suffice your needs.
Read the manpage 'man wget'
Search for the keyword "mirror" which is the -m flag for the wget option.
jargon
LinuxFreaK
Site Admin
Posts: 5132
Joined: Fri May 02, 2003 10:24 am
Location: Karachi
Contact:

Re:

Post by LinuxFreaK »

Dear MAJID,
Salam,

Use HTTPTrack

Or

# wget -m www.example.com

Best Regards.
Farrukh Ahmed
MAJID
Naik
Posts: 90
Joined: Thu Oct 16, 2003 10:23 pm
Contact:

Post by MAJID »

My friend i am doing perfect legal activitiy actually i am trying to copy one the lectures in HTML format of some course the website allows to read it online or copy 1/1 page using save as or whatever you like it is time consuming so i decided to copy it using some Software.
01101101 01100001 01101010 01101001 01100100
LinuxFreaK
Site Admin
Posts: 5132
Joined: Fri May 02, 2003 10:24 am
Location: Karachi
Contact:

Re:

Post by LinuxFreaK »

Dear MAJID,
Salam,

No Problem, take a look at this http://www.linuxpakistan.net/forum2x/vi ... 3413#17360

Best Regards.
Farrukh Ahmed
MAJID
Naik
Posts: 90
Joined: Thu Oct 16, 2003 10:23 pm
Contact:

Re:

Post by MAJID »

LinuxFreaK wrote:Dear MAJID,
Salam,

No Problem, take a look at this http://www.linuxpakistan.net/forum2x/vi ... 3413#17360

Best Regards.
Dear All,
I tried WebHTTrack Copier.
[root@imrant root]# rpm -q httrack
httrack-3.32.03-FC2

It placed two icons in the Internet tab
that are
Internet > Browse Mirrored Websites
and
Internet > WebHTTrack Website Copier

But the problem is when i open the last one it opens Mozilla Browser with
the url http://imrant.server1:8080/
preinserted then i tried to copy the intented website that i want to copy
that is
http://www.cs.sfu.ca/CourseCentral/365/li/index.html

it simply opens it no option for copying PLEASE HELP

also i tried wget -m but not succeed
[root@imrant root]# wget -m http://www.cs.sfu.ca/CourseCentral/365/li/index.html --19:11:13-- http://www.cs.sfu.ca/CourseCentral/365/li/index.html
=> `www.cs.sfu.ca/CourseCentral/365/li/index.html'
Resolving www.cs.sfu.ca... 142.58.111.29
Connecting to www.cs.sfu.ca[142.58.111.29]:80... failed: Connection timed out.
Retrying.

--19:14:34-- http://www.cs.sfu.ca/CourseCentral/365/li/index.html
(try: 2) => `www.cs.sfu.ca/CourseCentral/365/li/index.html'
Connecting to www.cs.sfu.ca[142.58.111.29]:80...

PLEASE HELP
01101101 01100001 01101010 01101001 01100100
LinuxFreaK
Site Admin
Posts: 5132
Joined: Fri May 02, 2003 10:24 am
Location: Karachi
Contact:

Re:

Post by LinuxFreaK »

Dear MAJID,
Salam,

Probably bad connection :)

Best Regards.
Farrukh Ahmed
Faraz.Fazil
Major General
Posts: 1024
Joined: Thu Jul 04, 2002 5:31 pm
Location: Karachi/Pakistan/Earth/Universe

Post by Faraz.Fazil »

This is not a case of bad connection.

Majid, As u told me on msn,THe problem is that ur cable wala is using ntlm authentication.No problem...here is how to make it work.

In an earlier post i helped u set up the ntlmaps proxy which converted ntlm to basic.

now, All u need to do to get wget and others working is to use that ntlmaps proxy...to do that:

1.Open console /terminal

issue this command:

export http_proxy = http://username:password@127.0.0.1:5865/

make sure u replace user name and pass with ur user name and pass

2.use wget with this parameter:

--proxy=on

wget -m --proxy=on http://www.cs.sfu.ca/CourseCentral/365/li/index.html


specify other parameters such as the web address etc.

wget fetches proxy location from the environment variable

make sure u do this while the ntlmaps .py script is running and the proxy is active.(./main.py runs the script an activates the aps proxy)

it should work 1000 % if u follow the steps correctly.
Linux for Life!
MAJID
Naik
Posts: 90
Joined: Thu Oct 16, 2003 10:23 pm
Contact:

Post by MAJID »

GREAT GREAT GREAT it worked yaar Great tips faraz.
I am currently dumping the website using wget as you suggested

Now solve WebHTTrack software problem as well.
01101101 01100001 01101010 01101001 01100100
Dr-Munir
Naik
Posts: 80
Joined: Sun Oct 31, 2004 11:48 pm

Post by Dr-Munir »

well Httrack is easy to configure, once you know what you are going to do .. its very flexible, only you will have some problem while applying afilter for a particular website.. i dunno if wget has that option of filtering, but if you want only html files, for example, and want to exclude everything else , you can do that in Httrack... I love its flexibility..

its my favorite for both win and Linux..

for further info please follow the links..

Httrack FAQs

Httrack Documentation

Httrack Forum

and finally

How To Use Httrack
When EveryThing Is Meant To Be Broken , I Just Want To Know Who I Am
MAJID
Naik
Posts: 90
Joined: Thu Oct 16, 2003 10:23 pm
Contact:

Post by MAJID »

It placed two icons in the Internet tab
that are
Internet > Browse Mirrored Websites
and
Internet > WebHTTrack Website Copier

But the problem is when i open the last one it opens Mozilla Browser with
the url http://imrant.server1:8080/
preinserted then i tried to copy the intented website that i want to copy
that is
http://www.cs.sfu.ca/CourseCentral/365/li/index.html
THANKS for Help DR SAHAB but would you please like to tell me how to invoke this WEB HTT TRACK software i have installed is succesfully but when i click on the icon it open MOZILLA browser i do not know from where i can get the FANCY gui where i create NEW PROJECT or whatsoever that is written in the TUTORIAL you provided. I AM UNABLE TO GET THE WEBHTTRACK GUI only icons are there that are happily invoking MOZILLA BROWSER instead of WEBHTTRACK :).

I cannot understand this
[root@imrant root]# webhttrack
/usr/bin/webhttrack(3489): launching /usr/bin/mozilla
Error: No running window found.
/usr/bin/webhttrack(3489): spawning browser..

PLEASE HELP

IS THERE ANY CMD to invoke the GUI of WEBHTTRACK to get rid of MOZILLA brower OPENING REOPENING ???

PLEASE HELP
01101101 01100001 01101010 01101001 01100100
Dr-Munir
Naik
Posts: 80
Joined: Sun Oct 31, 2004 11:48 pm

Post by Dr-Munir »

well Majid, ,

I am afraid you havent followed the links totally,

this httrack actually opens in a brwoser window, the fany gui u are talking about is inside a browser window in httrack for linux.

if everything went fine during installation and you faced no trouble , then you can start with the command webhttrack in Run menu,

or you can type webhttrack in any terminal , either su or not...

but make sure you installed it with out any trouble..

i dunno what went wrong,

for the proxy trouble you might face using webhttrack, use the same suggestions as you are with wget .
When EveryThing Is Meant To Be Broken , I Just Want To Know Who I Am
MAJID
Naik
Posts: 90
Joined: Thu Oct 16, 2003 10:23 pm
Contact:

Post by MAJID »

Thanks Dr Sahab for replying in such a short time well it is working now.

Actually in the mean time i have installed a front end ghttrack and it is easy. Do not be so afraid :)
01101101 01100001 01101010 01101001 01100100
Post Reply