Senga Home Page Information retrieval software GNU
Savannah
FSF France

January 15, 2001

webbase-5.16 is available.
  • -version option shows version number
  • Fix allocation error when updating full text index and name server timeout condition handling optimization.
  • Use /etc/my.cnf, ~/.my.cnf and datadir/my.cnf instead of ~/.my.cnf alone

January 14, 2000

unac-1.4.0, Text-Unaccent-1.04 and phpUnac-1.4.0 are available.

  • When unac_string finds an illegal sequence while converting, it replaces it with a space. For instance 1/4 ISO-8859-1 character is converted to 1 4 (one space four) because the fraction character does not exist in ISO-8859-1.
  • The new unac_version function returns the version number.

January 12, 2000

The debian distribution is now available for unac-1.3, thanks to Rémi Perrot (remi_perrot@users.sourceforge.net).

January 06, 2000

unac-1.3.0, Text-Unaccent-1.03 and phpUnac-1.3.0 are available.

  • Add support for systems that do not have UTF-16BE defined but only UTF-16 being implicitly big endian. It means that it will work with both glibc-2.1.3 and glibc-2.1.94.
  • Fix occasional allocation bug
  • Allocate returned buffer even if an empty string is given in input.
  • Add more regression tests

December 28, 2000

Wonderful day: sourceforge.net works so slow I spent half an hour releasing the files for webbase, sourceforge.net and slashdot.org cannot be reached. They had a T1 broken, I guess it explains all this troubles.

webbase-5.15 is available.

  • Implement dynamic updating of the fulltext index.
  • Fix last modified time update bug.
  • Fix mysql-3.23.19a-gamma namespace conflict.
  • Fix bug that left start point in virgin state artificialy.

December 27, 2000

The sourceforge.net files have been updated, at last. The dmoz.org mirror was not updated since December 17 on the main repository. It was updated today and will be automatically synchronized tonight.

Text-Query-SQL-0.09 is available.

Basic support for mifluz, at alpha stage, portability changes for perl-5.6.

December 23, 2000

I've not been able to upload the distributions on sourceforge.net for two days now. First they were under a DOS attack and the web was down and today the publishing process hangs forever.

webbase-5.14.0 is available.

  • -crawlers option added to run simultaneous crawlers.
  • signal handling function for graceful interuption of the crawlers.
  • enable url,url_complete and url_content tables to grow over 4Gb
  • The hook library is now dynamicaly loadable with the -hook option so that specific full indexing strategies can be implemented as plugins.
  • The -where_url option is taken in account when rebuilding the full text index with -rebuild
  • Extensions and mime types have been added to the list of known mime types.
  • The auth field of the start table was removed because it was not used.
mifluz-0.21 is available.

  • Fix accent handling bug (unac-1.2.0).
  • Add wordlist_locale attribute
  • Better mifluzsearch error handling
  • Improve search traces readability
  • Fix various compilation bugs and warnings
  • mifluzsearch produces better XML output

uri-2.11 is available.

A boundary bug prevented proper additions of new SCHEMES, reported by Guillaume Pernot gpernot@free.fr.

phpUnac-1.2.0 is available.

  • Syncrhonize with unac-1.2.0.
  • The CVS tree is properly populated and the README was updated.

December 22, 2000

The migration of www.senga.org is finished. The dmoz.org mirror is now updated daily.

unac-1.2.0 is available.

  • Fix endianess problem that shows on RedHat-7.0. UTF-16 strings are always handled as BigEndian strings using UTF-16BE.
  • Fix prototype mistake (int used where size_t required)
Text-Unaccent-1.02 is available.

  • Syncrhonize with unac-1.2.0 and some code cleanup.

December 12, 2000

Senga was relocated and is now hosted by Lolix. The migration is not not over yet and the search and dmoz demo are broken. I'm working to rebuild them.

November 22, 2000

uri-2.10 is available.

Fix bugs that prevented adding new schemes in SCHEMES list.

Fix documentation bug about uri_all_path.

October 27, 2000

All these new releases have a RPM source and binary package and were checked to install properly.

webbase-5.13.0 is available.

  • The crawler manual page was completely reviewed for correctness.
  • Bug fixes in the mifluz interface.
  • Implemented the -agent option
  • Added the -show option family to display all URLs information from an exploration starting point.
  • Improved configuration script.
  • Fixed major leaks and concurency problemes in the langrec interface.
  • Widen the scope of allow/disallow comparison to include cgi parameters.
  • Restore code to use .my.cnf files if any.

mifluz-0.20 is available.

  • Full documentation for the WordType class, including attributes wordlist_allow_numbers, wordlist_mimimun_word_length, wordlist_maximum_word_length, wordlist_allow_numbers, wordlist_truncate, wordlist_lowercase, wordlist_valid_punctuation
  • Added the mifluzsearch application/cgi-bin
  • Fix cache estimation bug that inhibited the cache_max parameter
  • Fix important entry deletion bug
  • Minor documentation enhancements
  • Minor fixes for FreeBSD-4.1 and redhat-7.0 ports
  • rpm package generation scripts

uri-2.9 is available.

In this minor release there only are packaging enhancements.

October 18, 2000

unac debian packages upgraded.

Rémi Perrot (remi_perrot@users.sourceforge.net) provided the final touch in the debian packaging. A separate branch was created and a README.debian, included in the sources shortly explains the methodology and the conventions.

unac php3 and php4 interface released.

With help from Andreas Hochsteger (e9625392@student.tuwien.ac.at), a distribution for unac integrated to php3 or php4 is now available. It only works if the installed php allows dynamic loading of extensions. This minor drawback allows to have a clean distribution for which you do not need to recompile php or apache.

September 30, 2000

unac debian packages are available.

Rémi Perrot (remi_perrot@users.sourceforge.net) kindly provided debian packages for unac. Information to generate them has been included in the CVS tree and will appear in the next version.

September 29, 2000

Text-Unaccent-1.01 is available.

Text-Unaccent is a perl module that provides access to the functions of the unac library.

September 28, 2000

unac-1.1.0 is available.

A bug in error handling was fixed and the configure script has better support for the iconv. It has been tested under Solaris-2.6.

September 23, 2000

The creation of the unac project on sourceforge is almost complete. I'm only waiting for the crontab to be kind enough to create the root of the CVS directory :-) While killing time I discovered that senga.net was registered early this year by a restaurant in Japan. Amazing when you know that Agnes loves japanese food. I guess I'll have to take her to japan some day and try it. Since I don't read Japanese, someone may help and tell me where it is located ? For those wondering what Agnes has to do with senga.org, try reverse(agnes). Senga is also a strawberry variety. I never got a chance to buy some but a friend of mine bought some Senga Strawberry jam and it looked good. I did not dare to eat the jam, superstition maybe.

September 22, 2000

unac-1.0.0 is available.

unac is a C library and command that removes accents from a string. For instance the string été will become ete. It provides a command line interface that removes accents from a input flow or a string given in argument (unaccent command). In the library function and the command, the charset of the input is specified as an argument. The input is converted to UTF-16 using iconv(3), accents are stripped and the result is converted back to the original charset. The iconv --list command on GNU/Linux will show all charset supported.

You will find that the CVS tree and other services that rely on sourceforge are not yet available. The unac project was submitted on Wed Sept, 20 and is still sitting in the queue.

September 13, 2000

langrec-1.1.0 is available.

Langrec is a language recognition library based on public domain dictionaries. Originaly written by Marc Leguistin two years ago, it was packaged and ported to Linux by Benoit Orihuela (borihuela@idealx.com). Given a string or a file, the langrec library will return the main language of the text, chosing among spanish, german, english, italian and french.

The langrec library is not maintained anymore. However, the full CVS tree is available in the webbase CVS repository. The webbase program has been modified to use langrec, should you chose to (configure time option).

September 12, 2000

webbase-5.12 is available.

An interface with the langrec library has been added to allow language recognition of the crawled documents. The --with-langrec configure flag includes the library. By default the library is not used. The iso code of the language is stored in the database record describing the URL.

The documentation has been enriched with many drawings depicting the internal architecture of the code.

The interface to the full text libaray mifluz has been upgraded to mifluz-0.19

Manual pages for consistentc has been added and the man page of of crawler reviewed.

All known leaks have been fixed.

The installation scheme was redesigned to match the GNU/Linux standards

The configure script was reviewed and enhanced.

Some extensions (.tap, .rar ...) were added to the list of known mime types.

A cache for hostnames lookup has been added to reduce the DNS traffic.

A usage option was added to the crawler program to replace the terse usage.

A -where_url option was added to control the scope of URLs given to the indexer when rebuilding the full text index (mifluz) from scratch.

Filtering of URLs during the crawl can now be made using regular expressions.

August 24, 2000

uri-2.8 is available.

In some places isalnum or isspace were used with char instead of unsigned char and negative lookups in tables occured. The type was changed to prevent this problem.

The WLROOT global variable was permanently replaced by the uri_set_root and uri_get_root functions.

July 21, 2000

mifluz-0.19 is available.

  • Add SWIG friendly defines
  • Fix the WordList::Prefix implementation that was bugous
  • Add -p option to mifluzdict to dump dictionary entries matching a given prefix.
  • The config.h and db.h headers were missing in installation.
  • The mifluz.h had a reference to the old htconfig.h header that was replaced by the config.h header
  • The configuration process now bombs if zlib was required by the user and not found.

July 20, 2000

Search-Mifluz-0.08 and DBD-Mifluz-0.04 are available.

Both have been modified to match the new API of mifluz-0.18. I've not encountered major problems in the process. The new class WordDict has an interface in Search-Mifluz but I did not feel the need to give access to it from DBD-Mifluz. I had to fix two bugs in mifluz-0.18, therefore it's better to use these modules with the version from the CVS tree, or to wait for mifluz-0.19.

July 17, 2000

mifluz-0.18 is available.

The internal re-architecture of mifluz is finished. A lot of testing is yet to be done, but at this point I decided to keep the architecture of the index for at least one year and focus on polishing the API, testing and interfacing with foreign languages.

The index is now self contained (no more extra files to hold compression information) and uses Berkeley DB sub-databases facilities to separate the various logical parts. This choice will make it easy to write additional modules, simply by adding functions and a new sub-database to the index.

Here is, shortly, the list of modifications since 0.17:

  • Upgrade to Berkeley DB 3.1.14
  • weakcmpr file integrated in inverted index.
  • inverted index now contains many logical files (dictionnary, meta information, inverted index, list of temporary files)
  • merge configure.in files from top level source directory and db directory.
  • New class WordDict assign unique numbers to words and keep statistical information.
  • New class WordMeta handle serial numbers and locks.
  • New class WordDead holds the list of deleted documents for defered deletion.
  • WordKey format changed, now hold only numbers.
  • WordRecord format changed, can hold a single integer or a string.

June 21, 2000

mifluz-0.17 is available.

Mifluz is being reworked deeply, learning lessons from large indexing attempts (around 12 Gb). The index is now a single file, even if compressed. Internaly it contains logical files (the index and the dictionary) as well as some meta information such as serial numbers current values.

A major speedup for bulk loading has been implemented using temporary files merged in the manner of the sort command. A typicall bulk insertion that took 9h to run now takes around 1h.

The rearchitectural work is not finished and the next step is to change the key structure from word/number/number to number/number/number, getting rid of actual words in the inverted index since they are now stored and assigned a serial number in the dictionnary. This version is an step before this major change.

The documentation has been proofreaded by the GNU volunteers and the reference manual was included in the texinfo document to fit the GNU conventions. Makeinfo is now used to generate the HTML documentation.

A synchronization point was done with the Ht://Dig group at the end of May. It introduced in Ht://Dig the new Berkeley DB 3.0.55. A new syncrhonization point will be done when the new architecture is ready. The synchronization scripts have been modified to only show a diff, all patches must be applied by hand. This is longer but prevent mistakes. Applying patches requires thinking and review and should not be done blindly.

New utilities to manipulate the index have been added (mifluzdump, mifluzload, mifluzdict) and the htdb_* utilities should now be used with the fact that subdatabases are used in mind (-s and -l options typically).

The compression has been re-implemented from scratch using the same design ideas. It is more robust and much less hairy. The WordBitCompress class could be used in various context with success. All the half implemented *Vector* classes have been removed.

As of yesterday the regressions tests ran 100% purify clean. Using quantify allowed to improve performances by 20% by using the WordList::Override method instead of WordList::Insert method where possible.

The interface was changed in a major way to help thread safety. The WordContext object is now central and all other objects are created from it. All objects have a pointer to their environment so that they can access global information. This adds a small space overhead but completely prevents the use of static variables.

The search example (test/search.cc) was enhanced with an estimation of the total number of matches and the ability to ask for the last valid interval of matches, even if the search offset is way beyond the last possible match.

The structure of a record (WordRecord) was completely changed. It can now be either nothing (NONE), a single integer value (DATA) or an string (STR). The previous format was much too restrictive for general use.

The WordList class is now abstract and the original implementation may be found in the WordListOne class. The WordListMulti is not implemented yet and will allow to manipulate many indexes as one.

May 3, 2000

We are glad to announce that mifluz was adopted by the GNU project this week. It will be referenced as a GNU product in the www.gnu.org pages. The review process by the FSF took around three weeks. RMS asked to fix some legal issues and suggested that mifluz provides a C interface as soon as possible.

mifluz-0.16 is available.

This is a maintainance release that contains updates of the documentation and the copyright notices.

A manual page was added for the Configuration class. all the manual pages are integrated in the texinfo documentation because RMS asked for it. We keep the manual pages since it's often more convenient. Both are generated using the ad-hoc man/man_generate perl script.

The LICENSE file was missing for the Berkeley DB files and a copyright section was added to the README file.

April 21, 2000

Text-Query-SQL-0.07 is available.

A driver was added for Postgres by Benjamin Drieu (bdrieu@april.org) and the mifluz support is complete. Some documentation was added to describe the internal structure of the syntax tree.

April 20, 2000

Search::Mifluz-0.07 and DBD::Mifluz-0.03 are available.

They both follow the mifluz-0.15 release and integrate the new version of the search algorithms. A full set of options and parameters has been added to control every aspect of the query mechanism. A manual page is also provided for both modules.

Search::Mifluz has been checked with purify and does not contain memory leaks when running the full test suite.

mifluz-0.15 is available.

Most of the work was dedicated to the search algorithm implementation. It is still in the test directory (test/search.cc) but has evolved to a stable state. It is able to resolve structured queries (boolean) and simple queries using a semantic similar to the simple AltaVista syntax. When resolving a query, the search process uses classes that are derived from the WordCursor class and that have a similar semantic. A few changes have been made to the WordCursor class to support derivation.

A major performance enhancement was done by using the WordList::Override method instead of the WordList::Insert method. Berkeley DB does a lot of work when the DB_NOOVERWRITE flag is set. Always use Override instead of Insert if possible, it saves around 30% of CPU cycles on an insertion intensive process.

Some compilation/linking problems were fixed with the help of Orion (FreeBSD) and Peter Marelas (Solaris). The official (www.sleepycat.com) patch for Berkeley DB 3.0.55 was applied.

bert@senga.org found that the combination of RedHat-6.2 + linux-2.3.51 provides transparent support for large files (>2Gb) and is reasonably stable. Although recent tests show that failure occurs when creating a inverted index larger than 4Gb, there is hope.

Replacement functions (memmove etc..) were added in the new clib directory and are used only if needed with a simple scheme. The functions and methodology were taken from Berkeley DB. The htconfig.h is now included by every source file so that replacements are activated only when needed.

March 22, 2000

mifluz-0.14 is available.

The main changes in this version are distribution architecture, index sharing between processes, documentation, benchmarks, regression tests and bug fixes.

Upgrade to Berkeley DB 3.0.55. The db directory is now a flat directory. This is no longer a traditional Berkeley DB distribution. Renamed all external symbols of Berkeley DB to avoid conflicts with installed versions. This problem was a pain in the neck for programs linked (statically or dynamically) with an original Berkeley DB distribution, such as Perl. There is now only one library to link (-lmifluz instead of -lmifluz and -lhtdb). To prevent linkage errors with C programs (static and dynamic), mifluz does not use iostream anymore, only stdio.

The description of the key has been greatly simplified and is now properly documented. The internal structure of the key has also been simplified and mifluz-0.14 indexes are not compatible with mifluz-0.13 indexes. No conversion program is provided.

The zlib library is not longer required to compile mifluz.

The WordSearchDescription was renamed WordCursor for clarity.

Support was added for resources sharing among multiple processes. Two distinct processes may use the same inverted index at the same time and share the same cache. This is specially important when running cgi-bin for query (either standalone or with fast-cgi or mod_perl).

The API is now fully documented in manual pages, the entry point is the mifluz manual page. This page contains an explanation of all the possible parameters to tune mifluz. They are also repeated in the manual pages of the classes that implement each specific feature. The texinfo guide has been reviewed completely for accuracy, to remove redundancy with manual pages and a chapter on cache tuning was added.

The monitoring feature has been re-implemented and is now used at the Berkeley DB level. The benchmarks can use the monitoring class make MONITOR=-m dobench and the output is automatically fed to the new benchmark-report utility to build graphical benchmark reports. Four benchmark reports are distributed with the package in the test/benchmark directory. We encourage everyone to send their results on various architectures/machines.

Regression tests were added for the htdb* utilities, the example, shared index files, readonly index files. They are far from complete but it's progressing.

A few very tricky bugs in the compression were fixed, refer to the ChangeLog for details on this subject and other bug fixes.

Last but not least and CONTRIBUTORS section was added to the README file to list people who significantly helped mifluz to progress.

A new release was built for Search-Mifluz and webbase so that they compile with this new version.

February 24, 2000

Search-Mifluz-0.05 and DBD-Mifluz-0.01 are available.

Search-Mifluz-0.05 matches mifluz-0.13.0. The major addition is search capabilities (see t/search.t) whose implementation is based on the example in mifluz-0.13.0 (test/search.cc).

The new DBD-Mifluz-0.01 is a DBI driver for Search-Mifluz, mainly usefull to allow persistent connections using Apache::DBI when running cgi-bin scripts under mod_perl.

Those two modules do not provide all the functionalities one may dream of. However, it allows to start integration with Catalog and other cgi-bin as needed. The building blocks are here, at last.

Text-Query-SQL-0.06 is available.

A driver was added for the mifluz inverted index. It simply returns the syntax tree built from the query: the Search::Mifluz module knows its structure and is able to resolve a query using it.

February 23, 2000

webbase-5.9 is available.

Maintainance release to keep in sync with mifluz-0.13.0. The arguments URL given to crawler are now cannonicalized so that it's not possible to forget a trailing slash or provide an ill formated URL.

February 22, 2000

mifluz-0.13 is available.

An elaborated example of search algorithm taking advantage of the inverted index structure is now provided in the test/search.cc file. It uses few resources and even provides relevance ranking.

The Walk function has now reached a mature form. It has been splitted in functions that look like iterator methods (WalkInit, WalkNext, WalkFinish) The unerlying mechanism has been reviewed thoroughly and partially re-implemented to overcome limitations or inefficiencies.

The WordDB class was dramatically simplified by removing the hideous mixture of return code conventions.

Some important bugs were fixed. When using the WordList::Override method the reference count was not updated correctly. Ascii number respresentation in WordKey were parsed with atoi instead of strtoul, forbidding values with high bit set. The compression and WordKey packing were not endian clean.

The monitoring (WordMonitor) is now turned off by default and not even allocated. This isolates the rest of the code for potential bugs in the monitoring and allow WordMonitor to evolve in a more independant way.

The allocation method of the Berkeley DB compression scheme was clarified. It is no longer allocated by WordDB but by WordDBCompress which makes a lot more sense.

Some obsolete debugging variables and classes were removed.

The headers are SWIG friendly. For those who don't know SWIG yet, it's a wonderfull tool that makes it easier to provide scripting language interfaces.

The distribution has been changed completely. Headers are now hidden in a mifluz subdirectory to prevent polution. The former libhtdb and libhtword have been merged in libmifluz. In the sources, the CVS tree no longer depends on htdig and a set of script was written to keep in sync. This requires more discipline from the developper point of view but makes mifluz less dependent on the current htdig state. As a side effect, the ChangeLog of mifluz now contains all the information related to modifications made in the former htlib and htword directories. It's not necessary to navigate from ChangeLog to ChangeLog.htdig anymore.

January 30, 2000

webbase-5.8 is available.

Lots of warnings and a memory leak was fixed. A nasty configuration problem related to socklen_t type was fixed with the implementation of the AC_PROTOTYPE macro.

January 27, 2000

Catalog-1.02 is available.

The dmoz loading process has been dramatically simplified. It is now only available as a command. No more fancy web interface that confuses everyone. In addition the convert_dmoz script now generates text files that can be directly loaded into Catalog instead of the intermediate XML file. The whole loading process now takes from one to two hours depending on your machine. It took around 10 hours with the previous version.

The -exclude option was added to convert_dmoz to get rid of a whole branch of the catalog at load time. Typical usage would be convert_dmoz -exclude '^/Adult' -what content content.rdf.gz.

A lot more sanity checks and repair have been added to deal with duplicates, category id conflicts and the like.

Hopefully this new method will also be more understandable and generate less traffic on the mailing list. There is room for improvements and contributors are welcome.

January 23, 2000

The bug tracking lists are now hosted on SourceForge. Existing entries have been moved. Since SourceForge also provides a Task Management database, the bug lists have been split between real bugs and tasks. Each product has a link to the Task Manager and Bug Tracker in the left menu.

January 22, 2000

For the benefit of everyone (but mainly because it's painfull to maintain :-) we've moved all the CVS trees to SourceForge. The immediate benefit is that you can get anonymous access to the CVS tree. You will also be able to browse it on the web. I can't believe this has not been done on Senga already. But that's all the story : maintaining the technical tools behind Senga is a time consuming job. During next week we will be moving the bug tracking and downloads to SourceForge and at last the home page of each product.

A link to the CVS page has been added in the left menu of every product, feel free to try it.

January 17, 2000

mifluz-0.12, webbase-5.7 and uri-2.7 are available.

This set of versions must be used together. See each product page for more information on the modifications.


The older news are available.

January 12, 2000

uri-2.7 is available.

Renamed uri struct member to _uri because some compilers do not like that and think that's a name clash.

January 3, 2000

mifluz-0.11

Several bug fixes, speedups, and code cleanups. Added possibility to monitor what's going on inside the indexing. Preparing for full scale, real-world tests.

December 16, 1999

mifluz-0.10, webbase-5.6 and uri-2.6 are available.

This set of versions must be used together. See each product page for more information on the modifications. We've fixed memory leaks, configuration errors and bugs.

December 09, 1999

mifluz-0.9 is available.

A new compression algorithm was implemented. It reduces the index size by a factor of 8 compared to an uncompressed index. It works in the same context as the previously implemented compression (it compresses/uncompresses pages within Berkeley DB when they are written/read to the db file), but the comperssion algorithm is specifically designed for compressing DB pages (th previous compression used zlib). Since pages are generally full of redundant data this can achieve good compression ratios.

December 8, 1999

Search-Mifluz-0.01 is available.

This is the pre-release version of the Perl interface to mifluz. It was generated using SWIG. We had to patch SWIG in order to achieve proper package encapsulation. The patches will be integrated in the next SWIG version but at present they are included in the Search-Mifluz distribution.

The release of Search-Mifluz was also the opportunity to use SourceForge as a repository for the project. SourceForge provides all facilities available on Senga for OpenSource projects. If we're satisfied with SourceForge for Search-Mifluz, we consider moving all the products to SourceForge. It's much easier to contribute to a shared source distribution environment than dealing with it on our own :-)

December 7, 1999

webbase-5.5 is available.

In this minor maintainance release we've fixed a few leaks and memory overrun. It has been tested on a set of 150 000 URLs, some of them containing really weird data.

November 29, 1999

webbase-5.4 is available.

The most important thing is that many memory leaks have been removed. The crawler has been extensively tested (around 2 million URLs crawled on 150 000 different web sites). The mifluz full text indexing library is now integrated. It generates very big indexes at present but will improve dramaticaly next week thanks to Marcel Bosc. For more information on this subject refer to the mifluz mailing list and the htdig3-dev mailing list (on htdig). The hook to the full text indexing library is located in the new hooks library.

In order to definitely fix the problems related to long URLs, the url field is now a text field. To resolve the indexing issue, a field was added to the url and start table: url_md5. Following the same idea, the directory tree that contains the temporary copies of the pages (WLROOT) now contains cryptic MD5 based file names. It's activated by default with the version 2.4 of the uri library.

The MySQL connection functions have been upgraded so that it takes in account a ~/.my.cnf file. Always using -user, -password etc. is not mandatory anymore.

The -schema option was added to crawler and displays the builtin database schema. It's usefull if you want to add fields of your own in the start table.

Thanks to Bertrand Demiddelaer who fixed a timeout problem. Many other small bugs were fixed while testing, refer to ChangeLog for detailed information.

November 05, 1999

mifluz-0.8 is available.

Version 0.7.0 forgot to include examples subdirectory... Some portability and bug fixes. The docs on the API were extended, some examples were added to help starting up with mifluz.

The storage key (WordKey) class has evolved a bit: accesors for getting numerical fields were added. Input operators for streaming were added to WordKey,WordList,WordReference...

A speed-up for skiping useless sequential walking when using partialy defined searchkeys was added, as well as tests.

The use of the (important) WordList::Walk method was simplified.

October 12, 1999

mifluz-0.6 is available.

After two months of maturation and coding, the first working version of mifluz-0.6 is finaly available. It is in alpha stage but we stronly believe that the architectural choices are appropriate and will allow mifluz to reach maturity rapidly. It provides very few functionalities and is merely an inverted index manipulation library. It knows nothing about parsing documents or displaying search results.

We worked very closely with the Ht://dig Group and Berkeley DB staff. mifluz-0.6 is used in the 3.2 version of Ht://dig (or mifluz-0.6 is a packaging of the Ht://dig indexing library, depending on your point of view :-). We implemented a transparent compression layer in Berkeley DB 2.2.7 that will (maybe) be included in future releases of Berkeley DB.

A new developper, Marcel Bosc (bosc@senga.org), joined Senga two days ago. He will eventually take over on mifluz. The work required is huge and having someone working full time on this subject is great news. The immediate future is to integrate mifluz with the crawler and Catalog.

September 7, 1999

Catalog-1.01 is available.

This is a maintainance release.

  • Various bug fixes. All easy to fix bugs have been fixed. Take a look at bugzilla to see what hasn't been fixed.
  • The _PATHTEXT_ and _PATHFILE_ tags syntax has been extended to specify a range of path component.
  • Graham Barr added a recursive template feature for a catalog root page. This allows to show sub-categories of the root categories in the root page of a catalog.

Don't hesitate to submit bugs or ideas to bugzilla. Hopefully the next version of Catalog will have a fast full text indexing mechanism and I'll be able to implement new functionalities.

Have fun !

July 13, 1999

The first release of the URI manipulation C library (uri) and the internet crawler C library (webbase) are available. These two libraries are core component of our search engine. One would say : what ? another internet crawler ? we already have dozens ! Of course there is a difference with this one : it is able to efficiently crawl millions URLs. The crawler information is stored in a MySQL database.

July 6, 1999

The whole www.senga.org site has been restructured. It now contains general information about Senga, at the home page level. The top level menu on the left gives access to the bug tracking system for all the products (Bug Track), a catalog of resources that we use for development (Links). The Products page points to all the products or development projects at Senga. This is where you will find Catalog.

July 3, 1999

Catalog-1.00 is available.

This release includes PHP3 code to display a catalog. The author is Weston Bustraan (weston@infinityteldata.net). The main motivation to jump directly to version 1.00 is to avoid version number problems on CPAN.

July 2, 1999

Catalog-0.19 is available.

This is a minor release. The most noticeable addition is the new search mechanism.

  • Searching : two search modes are now available. AltaVista simple syntax and AltaVista advanced syntax. Both use the Text-Query and Text-Query-SQL perl modules.
  • Dmoz loading is much more fault tolerant. In addition it can handle compressed versions of content.rdf and structure.rdf. The comments are now stored in text fields instead of char(255).
  • The template system was extended with the pre_fill and post_fill parameters.
  • Searching associated to a catalog dumped to static pages is now possible using the 'static' mode.
  • Fixed two security weakness in confedit and recursive cgi handling.
  • Many sql queries have been optimized.
  • The configuration was changed a bit to fix bugs and to isolate database dependencies.
  • The tests were updated to isolate database dependencies.
  • Fixed numerous minor bugs, check ChangeLog if you're interested in details.

Many thanks to Tim Bunce for his numerous contributions and ideas. He is the architect of the Text-Query and Text-Query-SQL modules, Eric Bohlman and Loic Dachary did the programming.

Thanks to Eric Bohlman for his help on the Text-Query module. He was very busy but managed to spend the time needed to release it.

There is not yet anything usable for full text indexing but we keep working on it. The storage management is now handled by the reiserfs file system thanks to Hans Reiser who is working full time on this. Loic Dachary does his best to get something working, if you're interested go to http://www.senga.org/mifluz/.

For some mysterious reason CPAN lost track of Catalog name. In order to install catalog you should use perl -MCPAN -e 'install Catalog::db'. Weird but temporary.

Have fun !

May 26, 1999

There currently are four contributors to Catalog. Here they are:

  • Tim Bunce (Tim.Bunce@ig.co.uk) is working on a commercial project involving Catalog. He fixes bugs, change the programming interface and has ideas on how to do things.
  • Christophe Le Bars (clb@alcove.fr) is packaging Catalog for Debian.
  • David Walker (dwalker@c-wheeler.agelena.net) is adding Postgres support.
  • Weston Bustraan (weston@infinityteldata.net) works on PHP3 code to display the content of Catalog.
Of course I won't be posting this list on the home page every month. If you want to know who's working on what you can bookmark the list of assigned tasks.

May 18, 1999

Catalog-0.10 replaces the Catalog-0.9 version published yesterday because of an installation bug that makes it completely unusable except for people ugrading from Catalog-0.5. Thank you for your patience.

May 17, 1999

Catalog-0.10 is available.

This is a maintainance release. We are happy to announce that Catalog is now available at your nearest CPAN mirror. The bug tracking system installed two weeks ago proved very usefull. It allows anyone to enter bug reports, ideas and suggestions about Catalog. If you are in need of commercial support on Catalog, two new companies are entering the business : Alcove and Atrid. (for details go to the support page).

  • The Bundle::Catalog module has been changed to include Catalog to simplify the installation process.
  • The installation procedure has been simplified a bit and now includes the possibility to re-use an existing configuration and to specify the installation root of MySQL.
  • The dmoz.org loading process is better documented and the interface now clearly explains the loading steps.
  • The Catalog directory containing the documentation is now created by the installation process.
  • Tim Bunce bug fixes and enhancements have been integrated.
  • A FreeBSD 3.1 section was added to the installation process. The makefiles no longer depend on GNU Make, except for the documentation makefile. We strongly suggest using GNU Make :-)
  • Contributions guidelines and script have been added (CONTRIBUTIONS file). It provides a framework to easily contribute to the software, using the latest development branch.
  • A memory leak has been found in XML-Parser-2.23, we strongly recommend using XML-Parser-2.22 instead, if you manipulate big amount of data such as dmoz.org.
  • The loading of dmoz.org is now resistant to duplicates in the author section.
  • A bug in the _PATH_ tag handling was fixed. Additional tags have been added to have access to individual path components (_PATH0_, _PATH1 ...).
  • A first step was made to make the code database independant. There is still some work to be done. If you have experience on Oracle, Informix, Postgres, you could already provide the table definitions and the database configuration procedures.
  • The verbosity of the error messages has been reduced.

For more details on bug fixes you can search the bug tracking system at (bugzilla). We are working hard on the full text indexing library. There will be more on this subject very soon.

Have fun !

May 2, 1999

The Bugzilla bug tracking system is installed in http://www.senga.org/bugzilla/. It is used not only to report bugs of Catalog but also to suggest enhancements or new features. Anyone can add an entry, go ahead !

April 19, 1999

Catalog-0.5 is available.

The main features added to this version are:

  • XML external representation of a thematic catalog. This allow easy export and import of existing catalogs. The XML format is a custom one and you could argue that we should have used XML/RDF instead. The lack of tools handling XML/RDF prevented this.
  • A new module has been derived from Catalog to display and manage dmoz (www.dmoz.org) catalog. This effectively allow anyone to run a mirror of dmoz. The database is only 400Mb big for 400 000 URLs and 65 000 categories. Response time is really fast provided you've installed Apache + mod_perl.
  • The Makefiles and installation procedures have been rebuilt from scratch for more flexibility and clarity.
  • A Perl bundle was added to automate the installation of dependent modules. This became really necessary since Catalog now depends on 9 external modules found on CPAN.

Altough Catalog was added last month to CPAN, the module list has not been re-generated since then and we impatiently wait for it.

A mirror of dmoz.org has been loaded to show that Catalog is able to handle a large number of records and categories.

March 16, 1999

Catalog-0.4 is available.

The main features added to this version are:

  • Intuitive browsing : /cgi-bin/Catalog/Sport/Events/Tennis/ will display the expected category content. This is much more readable than the name=catalog&context=cbrowse&id=3 parameters.
  • Static dump : the whole catalog can be dumped in a directory tree that replicate the category structure. The result may be copied and browsed using only static HTML pages. This can be very convenient if your web site is not cgi-bin enabled.
  • Search function : the thematic catalogs may now be searched in full text. Category names and record contents are searched. The search may be limited only to the category names or only to the record contents.
  • A complete example is installed with Catalog. A chapter was added to the documentation to comment the example. It is a step by step guide to configure the catalogs. The example contains a thematic catalog, a chronological catalog and an alphabetical catalog.
  • Option in configuration files for nph scripts.
  • The configuration generated by Makefile.PL is saved and reused in the config.cache file so that repeated installations do not require answering the same questions multiple times.

Catalog now depends on the MD5 Perl module. A copy of this module is kept on the www.senga.org download page. We have upgraded the MySQL distribution to 3.22.19 because it is now stable. Some users may have noticed formating errors in the HTML version of the documentation : it has been fixed.

Two real world usage of Catalog may be seen at Ghana International Trade Fair (english) and Interbat (french). The example delivered with Catalog is also available on www.senga.org for browsing only: a thematic catalog, a chronological catalog and an alphabetical catalog.

Last but not least, the Catalog name space was approved by Perl maintainers and Catalog should appear at your nearest CPAN site in the following weeks.

February 24, 1999

Catalog-0.3 is available.

The main features added to this version are:

  • A new kind of catalog has been added : the chronological catalog. As expected it shows the entries of a catalog ordered by date. That's what you want to add a What's New section to your existing catalog.
  • The context_allow instruction has been added to the sqledit.conf configuration file to allow only a specific set of actions. You must use this instruction if you want to publish a Catalog, otherwise the users will have the ability to alter the catalog by changing the parameters manually in the URL.
  • Fix a security hole implying eval.
  • The catalog management interface has been improved, allowing editing of category properties, editing of the entries in a category. The display is nicer, graphic buttons are used instead of links.
  • The installation now requires a directory to put the HTML documentation and images used by the catalog management interface. This directory will also be used later on for examples.
  • The tests run when make test is used now cover most of the cases.
  • The documentation has been updated and improved, many typos have been fixed.
  • Some memory leaks have been found/fixed and the processes have a reasonable size when running Apache and mod_perl.
  • The dir file is automatically modified by the installation process if you've chose to install the info format.
  • New tags are available in all templates : _SCRIPT_ and _HTMLPATH_.
  • A few common errors that may occur when using the catalog management interface in the wrong way now show explicit error messages in an HTML page instead of crashing. That prevents looking in the HTTP logs to find out what was wrong.
  • A mixture of POST and GET in the catalog management interface confused caches. It has been fixed.

Since a subtle bug was found in mysql-3.22.8-beta, we have switched to the latest version, mysql-3.22.16a-gamma. At the same time we've upgraded the DBI version used and mysql module. Those upgrades are not mandatory.

Catalog now uses the Test module to run tests. This requires perl 5.005. If you were running perl 5.004 (native on RedHat 5.2), you will have to compile the perl 5.005. There is not rpm at the moment.

February 10, 1999

Catalog-0.2 is available. It fixes installation problems, the documentation and some bugs.

The installation process has been made simpler by removing the need set the password and user of the MySQL database after the installation. This was confusing because most people thought it was a fatal error message.

The make test now works with a local invocation of the MySQL daemon to prevent possible corruption of an existing database.

At the request of Lynx users, all images of this site now have alt tags.

 
Projects
Catalog
GNU Mifluz
unac
uri
webbase
Senga
Home
Old News
Credits
Team
Ducks
XHTML Source  |   XSL Style Sheet  
 webmaster@senga.org
Copyright (C) 2002 Loic Dachary, 12 bd Magenta, 75010 Paris, France
Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved.