Why would you do a search anyway? There is already a search engine to rule them all. You can use Google to find anything in the Internet and I doubt you’ll ever have the same computing and storage capacity as the general G. So why do your own search engine? To earn money of course!. . . and become famous as the creator of the next big search engine or because as a programmer or if you like engineering challenges. Doing a search engine for the Internet audience is tricky and if you’re like me you love solving difficult problems. The third application is a measure of high-speed search site for you largethousands site pages. A search engine will be indexed much faster thana full text search function and if the search site Google is not flexible enoughfor your site, you can do your own search feature. THE BASICS OF SEARCHThe basis of any search engine is a BIG word index web page, essentially a long list of words and how they relate to different Web pages. To make a search engine that you have to do four things: Decide what pages to extract and recover analyze words, phrases and links to the Assign a score to each word or phrase indicating how the phrase refers to the pages and store the index scores search engine provides a way for users to query the index and a list of relevant Web pages is not difficult for an experienced programmer. It can be done in one day, if you know regular expressions and have some experience with HTML and databases. Now you have a working search engine, simply add a large number of computers and hard drives and you’ll soon have all the indexes of the Internet. If you’re not ready to go to that one terabyte disk will hold a rating of about 50 million pages. How to score PAGESAfter fulfill a research base there are a lot of work before anyone does want to use your new machine. An index is not enough. What challenge is how to score pages give the user the search results most relevant to his idea of what salvation is looking for. You’ll need to decide how much weight to put on keywords in the tag tile, description and main content of the webpage. To make a good score, you’ll also want to stimulate keywords in the URL of the page and check the anchor text of incoming links. Keep track of incoming links is the most useful and most demanding of the above, you’ll need to keep a table separate database with information on all links between your index pages. WHAT index and DO NOT INDEXOther obstacles you find when you start indexing the actual content on the Internet is the fact that there are unnecessary quantities of scrap were floating around everywhere and finally by the index will become full-mails, affiliate pages, parked domains, work in progress’ home pages without content, link farms used by search engine optimizers, mirror sites using data feeds to create thousands of pages with product announcements or other content reproduced etc, etc.. . During indexing of the Internet, you will find ways to filter unwanted content from what people are actually reading and research. For starters, you could limit the depth in subdirectories you crawl, hop how to link from an index page of domain you explore and how many links per page to allow web. ANALYSIS WEBSITESThere a million ways, both right and wrong to write in HTML and when you index the Internet, you’ll need to manage each of them. When analysis of keywords from the pages you need to manage not only the standard full HTML, but also all non-standard means which is informally supported by Internet browsers. To be able to read all the pages you will also need to analyze javascript client side handle frames, CSS and iframes. This is a great deal of work on a general search engine, to be able to read all kinds of content. Why so many URLs? Finally, you’ll need to face the fact that many websites have many links pointing to the same Web page. Just look at this example: dmoz. orgwww. DMOZ. orgdmoz. org / index. htmlwww. DMOZ. org / index. htmlall these URLs point to the same Web page. If you are not a special code to manage that you will soon have 4 results in the search engines (one for each URL) everything is on the same page. Users will not like you. There is also the possibility of query strings where a session ID after the question mark in the URL will create almost infinite URL for the same Web page. Google. com? SID = 4434324325325google. com? SID = 4387483748377google. com? SID = 7654565644466To the search engine, there will be a very large number of pages all containing the same content. The quick solution is of course not to index pages that include a query string. Or to remove the query string from the pages. This works but also removes a lot of legitimate content (think forums) from your index. You now have all the information you need to do a search engine site. If you go to an Internet search engine in general there are many more details that you need to include. Like robots. txt, site map, redirects, proxies, recognizing the types of content, advanced algorithms, classification and processing terabytes of data. I’ll cover in more detail in a forthcoming article. Good luck with your project following search engine. engine algorithms.