gpuse search engine #DRAFT - document is a bit unstructured. also i need to make a html version. #By Rene Tegel (nanobit) 2004 #This document is licensed as part of the GPU project documentation. * frontend statistics explained * mysql installation ** setting up mysql ** create database and user * crawler startup * installing web server frontend * set up a php-based 'forwarding' script * setting up a (dsl) router port forwarding ==frontend statistics explained== http code: response code of the server: 200: Ok 404: Not found 30x: redirected 40x: some error 50x: internal server error All requests are preceded by a HTTP HEAD. SO, it is possible you see binary types (.exe, .rpm etc) tagged as '200'. This is perfectly normal. The crawler now knows the url exists, and also that it's mimetype is not text based (text/plain or text/html). In this case, it will not be fetched. Database time: The total time the thread was interacting with the database. SHould be as low as possible. If you see excessive database times all the time, consider lowering the number of crawlers. your computer can't handle them anyhow. Sync time: This is the time a thread waits for other threads for the database to become accesable. should be low. Please note, that database time _includes_ the sync time. Occasianly higher sync times is not a problem, if it is always high, consider lowering the number of crawlers. HTTP time: the time needed to GET (retrieve) the document. Doc size: The size (in bytes) of the retrieved document word count: number of unique words in this document url count: number of unique urls found in this document other params: indicate (no more than that) how much time was spent on sopecific transactions. URL time for example is (about) the time needed to insert the new urls. Q. Why is the crawler slow after i restarted GPU A. It will need to re-cache the id's of all words and common urls. Q. Why does the amount of memory increase A. The crawler caches words and urls. COnsider restarting now and then. ==installation on mysql== mysql installation download and install mysql after installing mysql, create a my.cnf in c:\ (c:\my.cnf). It should look like this: [mysqld] basedir=N:/mysql/ datadir=N:/mysql/data/ warning: if you use mysql 4.1, there may be connection incompatability issues. see here for what to do: http://dev.mysql.com/doc/mysql/en/Old_client.html gpuse search engine currenlty comes with libmysql 4.0 (client) mysql 4.1 issues: currently, start mysql 4.1 with --old-password parameter n:\mysql\bin\mysqld-nt.exe --old-password mysql 4.0 issues: none (at the moment). mysql 3.2x: might work, but not recommended. == setting up mysql == after you made the mysql config (or are satisfied with default params), install and launch the mysql server: c:\> mysql\bin\mysqld-nt install c:\> net start mysql --you now should get a message like mysql server succesfully started-- logon to mysql: c:\> cd mysql\bin c:\> mysql mysql prompts you, if all is fine. first thing you must do is set up a user (and read the mysql manual how to protect mysql). let's say, we create a user account for ourselves: mysql> grant all on *.* to 'me'@localhost identified by 'mysecret'; mysql> flush privileges; this will set up a 'power' user with all privileges, with user name me. logout mysql, and test: mysql> quit; c:\> mysql -u me -p Enter password: ******* mysql> connect with the mysql client and perform the following queries: mysql> create database gpuse; mysql> grant all on gpuse to 'gpuse'@'locahost' identified by 'gpuse'; mysql> flush privileges; database name, username and password are currently hardcoded in the crawler. this may change in the future. the search plugin will create the ==crawler startup== after this, launch the gpu engine with the new crawler interface. Also launch the frontend. In the frontend, select 'local status - logbook' and verify the crawler is conencted to the mysql database and started. Now it is time to submit some urls. Use the searchfrontend for that. Insert a few of your favourite urls. ==installing the web server frontend== The searchfrontend has a build-in webserver. for convenience, the url is named 'search.php'. This is not for nothing, it is to allow php forwarding, i discuss that later. At the moment, teh templates are hardcoded, i'll change that later. However, there is room for additional content. Currently, the search page links to three external files: the image bandeau-gpu.jpg, the faq.htm and about.htm. To serve those: create a directory web, that is a subfolder of where searchengine.exe is located (it will map the folder .\web, if exists). you can put web content in there. so, normally, the location would be (for example) c:\program files\gpu\web any other content you place here can be served as well. ==setting up a 'forwarded' search engine== Now the forwarding. in my situation, i have some other webserver running on a freebsd box. i cannot disbale it, nor leave port 80, i want that defaulted. the bsd box is set up to serve virtual directories. one of them is http://nanobit.is-a-geek.net on the bsd server i placed a tiny php script, that will 'forward' the request to my windows (or wined linux) box. example: i put the searchfrontend webserver to listen on port 81. my windows box has ip 10.0.0.5. both my windows server / workstation, my programming windows workstation and my BSD box are behind a firewall, that forwards port 80 to my bsd box. the php script looks like this: On loading, the php script will first check the parameters 'q' and 'i'. If exists, they are added (don't add '$q' empty, because searchfrontend see that as a search request and reports an invalid/empty query). After the url is builded, it will fetch the search.php document from my windows box running gpu. the htm is just included, you could fetch and echo it if you like. then, i also made a symbolic link to index.php, called search.php. # ln -s index.php search.php (note: for searchfrontend, search.php _is_ the default document). last but now least, i put the logo and the about and faqw html's also on my webserver. == putting a port forwarding == in cas eyou don't run any existing webservers, you can just set up your firewall to connect to searchfrontend immediately. note: i am not responsible for security issues. please watch the source of visual synapse server (http and serverbase), and the source of searchfrontend (part of the gpu project) if you are concerned about this issue. visual synapse server is in alpha state, so it probably has some issues. happy crawling.