Hondo's Cabin - Print Page

Hondo's Cabin
http://www.hondosackett.com/yabb/YaBB.pl
The Cabin >> The Computer Room >> How to build a Search Engne WebServer
http://www.hondosackett.com/yabb/YaBB.pl?num=1482906874

Message started by Fernando on Dec 28^th, 2016, 1:34am

Title: How to build a Search Engne WebServer
Post by Fernando on Dec 28^th, 2016, 1:34am

Its going on almost 15 years since my "so-called" friends and I created BizInfo Plus search engine. It was to be a business search engine for the B2B model. The hardware and software I created and modified worked perfectly. for almost 10 years it ran perfectly, then greed began to set in and then the friends because stupid idiots

Back in the 90s and 2KY, there were many search engine companies, some of them still in business. Ask Jeeves, Ask.com, AOL.com. Lycos, and many others. And many of them, including Google and Yahoo used the free search engine software that was out at the time. In the simplest of terms, if you have a computer or server and it runs Perl or some other web accessible language it can run the search engine software.

Search Engine Software comes in two parts - the Spider or Web crawler with the Indexer and the Query Engine with the Interface. The Sppider searches the internet for websites links programmed into it and sends the information to the Indexer to build the database. The Query Engine searches through the database for requested information asked by the Interface - usually a Webpage like Google's front page.

Like I stated, many search engines used the same free search engine software out there and some still do to this day! Though there are many search engine software made, the top 3 I used because the other used then were Ht:Dig, Juggernaunt Search Engine, Fluid Dynamic Search Engine, Swish-E and WebGlimpse. Except for Ht:Dig, they all ran on Perl. Ht:Dig was compiled into machine language, but there was a version called Ht://Dig which was made in Perl. There were many others but I can't find the link for them. I did find:
http://www.searchtools.com/analysis/free-search-engine-comparison.html

First off, you need a list of links to crawl through. DMOZ has a list their volunteers search the web and put into this huge 8GB text file and one has to select and comb through what they want from it. The file is so big, no editor can read it. You need various file tools to strip the URLs from it, making a 10GB+ file to 800MB File. Still too big for any text editor to read. But you can continue to filter it and divide up the files to .com, .net, .org, .edu, and .mil websites. Then you can eliminate the trailing URL from the .com, .etc ending and now the file becomes manageable. With these separate links lists, you can begin spidering the URLs and indexing the results into your database.

But this can take days, is not weeks to do, especially the .com list. Back in '02, it took 6 weeks to do all the .com websites. Now, I can guess it can take longer. You would need to break down the list and spread it across several machines at once. Then you can merge the results together

But then 2 problems come up. The Indexed database becomes huge. In '02, the indexed database I created for BIP was 32GB in size. OK So it is huge. Then came the second problem, try accessing it. Using a typical hard drive, a simple search on such a database took 20 minutes to do. No one was going to go to your search engine and wait 20 minutes for a response. So how was the online search engines doing it?

A RAM Drive. Take RAM from the computer and making it into a virtual drive, accessing the file only took micro-seconds! But not all system can support large RAM spaces. The second best option is to use to a RAM Drive connected to the Hard Drive Port. Hyper OS has one, similar to the one I built long ago:
http://www.hyperossystems.com/

A Third option is using a FlashRAM Drive based on CF or SSD Drive set up. This third set up is the slowest of the three but still faster than a hard drive by many times.

With the database in the RAM or SSD Drive, one can access the information within micro-seconds. All one has to do is see if their system can take a million hits a month and a week. There are test sites out here you pay to test your webserver on. With the database on the RAM Drive, it can handle over a million hits a week.

Once you have this much done, than you can put your search engine online to a high-speed connection.

There is a lot more to do like securing the server, but as for the search engine itself, this is it. You're done. At least once a month you should update your database and check your logs.

This can work on any system, the original server for the BIP search engine was on my PowerBook G3 laptop and on a G4 Mac Tower co-located at a data warehouse. We moved up to XServe Mac Servers running in the same space with Google and Yahoo on a OSD23 line.

Th funny thing is, if done right, a very basic and limited search engine can run on a Raspberry / Banana / Orange / Nano Pi board or any other small board system. In fact I would like to see if it can run as in theory it can. It would be limited to its database size.

The demise of BIP was from my "so-called" friends, they got offers from Google, Yahoo, IBM and others to sell the search engine and its code to them. They were offered several million for this as I would find out. But I held onto the code I modified, and I was offered nothing. So I took the code and left. They were never able to recover from that.

Title: Re: How to build a Search Engne WebServer
Post by Fernando on Jan 1^st, 2017, 10:01pm

History as I remember it, AltaVista (bought out by Google in '98) used Fluid Dynamic Search Engine as did Ask Jeeves. Yahoo and Ask Jeeves used a modified version of Ht:Dig. Google experimented with several search engines before making one of their on based on the code of Ht:Dig and Juggernaunt.

There were many other search engine sites, and they used modified code from the software that was out there. I forget the link but there was a large site that had all the old search engine software. I need to find that website again. A lot of valuable software is (or was) on there.

Title: Re: How to build a Search Engne WebServer
Post by Hondo I. Sackett on Jan 2^nd, 2017, 10:22am

Good to know!

Title: Re: How to build a Search Engne WebServer
Post by Fernando on Jan 2^nd, 2017, 1:08pm

Fernando wrote:

History as I remember it, AltaVista (bought out by Google in '98) used Fluid Dynamic Search Engine as did Ask Jeeves. Yahoo and Ask Jeeves used a modified version of Ht:Dig. Google experimented with several search engines before making one of their on based on the code of Ht:Dig and Juggernaunt.

There were many other search engine sites, and they used modified code from the software that was out there. I forget the link but there was a large site that had all the old search engine software. I need to find that website again. A lot of valuable software is (or was) on there.

Mistake: That second Ask Jeeves is Ask.com, not Ask Jeeves.