January 15,
2001
webbase-5.16 is
available.
- -version option shows
version number
- Fix allocation error
when updating full text index and name server timeout condition
handling optimization.
- Use /etc/my.cnf,
~/.my.cnf and datadir/my.cnf instead of ~/.my.cnf alone
January 14,
2000
unac-1.4.0, Text-Unaccent-1.04
and phpUnac-1.4.0 are available.
- When unac_string finds an illegal sequence while converting, it
replaces it with a space. For instance 1/4 ISO-8859-1 character is
converted to 1 4 (one space four) because the fraction character
does not exist in ISO-8859-1.
- The new unac_version function returns the version number.
January 12,
2000
The debian distribution is
now available for unac-1.3, thanks to Rémi Perrot
(remi_perrot@users.sourceforge.net).
January 06,
2000
unac-1.3.0, Text-Unaccent-1.03
and phpUnac-1.3.0 are available.
- Add support for systems that do not have UTF-16BE defined but
only UTF-16 being implicitly big endian. It means that it will work
with both glibc-2.1.3 and glibc-2.1.94.
- Fix occasional allocation bug
- Allocate returned buffer even if an empty string is given in
input.
- Add more regression tests
December 28,
2000
Wonderful day: sourceforge.net works so slow I spent half an
hour releasing the files for webbase, sourceforge.net and
slashdot.org cannot be reached. They had a T1 broken, I guess it
explains all this troubles.
webbase-5.15 is
available.
- Implement dynamic updating of the fulltext index.
- Fix last modified time update bug.
- Fix mysql-3.23.19a-gamma namespace conflict.
- Fix bug that left start point in virgin state artificialy.
December 27,
2000
The sourceforge.net files have been updated, at last. The dmoz.org
mirror was not updated since December 17 on the main repository. It
was updated today and will be automatically synchronized tonight.
Text-Query-SQL-0.09 is
available.
Basic support for mifluz, at alpha stage, portability changes
for perl-5.6.
December 23,
2000
I've not been able to upload the distributions on
sourceforge.net for two days now. First they were under a DOS
attack and the web was down and today the publishing process hangs
forever.
webbase-5.14.0 is
available.
- -crawlers option added to run simultaneous crawlers.
- signal handling function for graceful interuption of the
crawlers.
- enable url,url_complete and url_content tables to grow over
4Gb
- The hook library is now dynamicaly loadable with the -hook
option so that specific full indexing strategies can be implemented
as plugins.
- The -where_url option is taken in account when rebuilding the
full text index with -rebuild
- Extensions and mime types have been added to the list of known
mime types.
- The auth field of the start table was removed because it was
not used.
mifluz-0.21 is available.
- Fix accent handling bug (unac-1.2.0).
- Add wordlist_locale attribute
- Better mifluzsearch error handling
- Improve search traces readability
- Fix various compilation bugs and warnings
- mifluzsearch produces better XML output
uri-2.11 is available.
A boundary bug prevented proper additions of new SCHEMES,
reported by Guillaume Pernot gpernot@free.fr.
phpUnac-1.2.0 is
available.
- Syncrhonize with unac-1.2.0.
- The CVS tree is properly populated and the README was
updated.
December 22,
2000
The migration of www.senga.org is finished. The dmoz.org mirror is now
updated daily.
unac-1.2.0 is
available.
- Fix endianess problem that shows on RedHat-7.0. UTF-16 strings
are always handled as BigEndian strings using UTF-16BE.
- Fix prototype mistake (int used where size_t required)
Text-Unaccent-1.02 is
available.
- Syncrhonize with unac-1.2.0 and some code cleanup.
December 12,
2000
Senga was relocated and is now hosted by Lolix. The migration is not not
over yet and the search and dmoz demo are broken. I'm working to
rebuild them.
November 22,
2000
uri-2.10 is available.
Fix bugs that prevented adding new schemes in SCHEMES list.
Fix documentation bug about uri_all_path.
October 27,
2000
All these new releases have a RPM source and binary package and
were checked to install properly.
webbase-5.13.0 is
available.
- The crawler manual page was completely reviewed for
correctness.
- Bug fixes in the mifluz interface.
- Implemented the -agent option
- Added the -show option family to display all URLs information
from an exploration starting point.
- Improved configuration script.
- Fixed major leaks and concurency problemes in the langrec
interface.
- Widen the scope of allow/disallow comparison to include cgi
parameters.
- Restore code to use .my.cnf files if any.
mifluz-0.20 is
available.
- Full documentation for the WordType class, including attributes
wordlist_allow_numbers, wordlist_mimimun_word_length,
wordlist_maximum_word_length, wordlist_allow_numbers,
wordlist_truncate, wordlist_lowercase,
wordlist_valid_punctuation
- Added the mifluzsearch application/cgi-bin
- Fix cache estimation bug that inhibited the cache_max
parameter
- Fix important entry deletion bug
- Minor documentation enhancements
- Minor fixes for FreeBSD-4.1 and redhat-7.0 ports
- rpm package generation scripts
uri-2.9 is available.
In this minor release there only are packaging enhancements.
October 18,
2000
unac debian packages
upgraded.
Rémi Perrot (remi_perrot@users.sourceforge.net) provided
the final touch in the debian packaging. A separate branch was
created and a README.debian, included in the sources shortly
explains the methodology and the conventions.
unac php3 and php4
interface released.
With help from Andreas Hochsteger
(e9625392@student.tuwien.ac.at), a distribution for unac integrated
to php3 or php4 is now available. It only works if the installed
php allows dynamic loading of extensions. This minor drawback
allows to have a clean distribution for which you do not need to
recompile php or apache.
September
30, 2000
unac debian packages are
available.
Rémi Perrot (remi_perrot@users.sourceforge.net) kindly
provided debian packages for unac. Information to generate them has
been included in the CVS tree and will appear in the next
version.
September
29, 2000
Text-Unaccent-1.01 is
available.
Text-Unaccent is a perl module that provides access to the
functions of the unac library.
September
28, 2000
unac-1.1.0 is available.
A bug in error handling was fixed and the configure script has
better support for the iconv. It has been tested under
Solaris-2.6.
September
23, 2000
The creation of the unac project on sourceforge is almost complete.
I'm only waiting for the crontab to be kind enough to create the
root of the CVS directory :-) While killing time I discovered that
senga.net was registered early
this year by a restaurant in Japan. Amazing when you know that Agnes loves japanese food. I
guess I'll have to take her to japan some day and try it. Since I
don't read Japanese, someone may help and tell me where it is
located ? For those wondering what Agnes has to do with senga.org,
try reverse(agnes). Senga is also a strawberry variety. I never got
a chance to buy some but a friend of mine bought some Senga
Strawberry jam and it looked good. I did not dare to eat the jam,
superstition maybe.
September
22, 2000
unac-1.0.0 is available.
unac is a C library and command that removes accents from a
string. For instance the string été will become ete.
It provides a command line interface that removes accents from a
input flow or a string given in argument (unaccent command). In the
library function and the command, the charset of the input is
specified as an argument. The input is converted to UTF-16 using
iconv(3), accents are stripped and the result is converted back to
the original charset. The iconv --list command on GNU/Linux will
show all charset supported.
You will find that the CVS tree and other services that rely on
sourceforge are not yet available. The unac project was submitted
on Wed Sept, 20 and is still sitting in the queue.
September
13, 2000
langrec-1.1.0 is available.
Langrec is a language recognition library based on public domain
dictionaries. Originaly written by Marc Leguistin two years ago, it
was packaged and ported to Linux by Benoit Orihuela
(borihuela@idealx.com). Given a string or a file, the langrec
library will return the main language of the text, chosing among
spanish, german, english, italian and french.
The langrec library is not maintained anymore. However, the full
CVS tree is available in the webbase CVS repository. The webbase
program has been modified to use langrec, should you chose to
(configure time option).
September
12, 2000
webbase-5.12 is available.
An interface with the langrec library has been added to allow
language recognition of the crawled documents. The --with-langrec
configure flag includes the library. By default the library is not
used. The iso code of the language is stored in the database record
describing the URL.
The documentation has been enriched with many drawings depicting
the internal architecture of the code.
The interface to the full text libaray mifluz has been upgraded
to mifluz-0.19
Manual pages for consistentc has been added and the man page of
of crawler reviewed.
All known leaks have been fixed.
The installation scheme was redesigned to match the GNU/Linux
standards
The configure script was reviewed and enhanced.
Some extensions (.tap, .rar ...) were added to the list of known
mime types.
A cache for hostnames lookup has been added to reduce the DNS
traffic.
A usage option was added to the crawler program to replace the
terse usage.
A -where_url option was added to control the scope of URLs given
to the indexer when rebuilding the full text index (mifluz) from
scratch.
Filtering of URLs during the crawl can now be made using regular
expressions.
August 24,
2000
uri-2.8 is available.
In some places isalnum or isspace were used with char instead of
unsigned char and negative lookups in tables occured. The type was
changed to prevent this problem.
The WLROOT global variable was permanently replaced by the
uri_set_root and uri_get_root functions.
July 21,
2000
mifluz-0.19 is available.
- Add SWIG friendly defines
- Fix the WordList::Prefix implementation that was bugous
- Add -p option to mifluzdict to dump dictionary entries matching
a given prefix.
- The config.h and db.h headers were missing in
installation.
- The mifluz.h had a reference to the old htconfig.h header that
was replaced by the config.h header
- The configuration process now bombs if zlib was required by the
user and not found.
July 20,
2000
Search-Mifluz-0.08 and
DBD-Mifluz-0.04 are available.
Both have been modified to match the new API of mifluz-0.18.
I've not encountered major problems in the process. The new class
WordDict has an interface in Search-Mifluz but I did not feel the
need to give access to it from DBD-Mifluz. I had to fix two bugs in
mifluz-0.18, therefore it's better to use these modules with the
version from the CVS tree, or to wait for mifluz-0.19.
July 17,
2000
mifluz-0.18 is available.
The internal re-architecture of mifluz is finished. A lot of
testing is yet to be done, but at this point I decided to keep the
architecture of the index for at least one year and focus on
polishing the API, testing and interfacing with foreign
languages.
The index is now self contained (no more extra files to hold
compression information) and uses Berkeley DB sub-databases
facilities to separate the various logical parts. This choice will
make it easy to write additional modules, simply by adding
functions and a new sub-database to the index.
Here is, shortly, the list of modifications since 0.17:
- Upgrade to Berkeley DB 3.1.14
- weakcmpr file integrated in inverted index.
- inverted index now contains many logical files (dictionnary,
meta information, inverted index, list of temporary files)
- merge configure.in files from top level source directory and db
directory.
- New class WordDict assign unique numbers to words and keep
statistical information.
- New class WordMeta handle serial numbers and locks.
- New class WordDead holds the list of deleted documents for
defered deletion.
- WordKey format changed, now hold only numbers.
- WordRecord format changed, can hold a single integer or a
string.
June 21,
2000
mifluz-0.17 is available.
Mifluz is being reworked deeply, learning lessons from large
indexing attempts (around 12 Gb). The index is now a single file,
even if compressed. Internaly it contains logical files (the index
and the dictionary) as well as some meta information such as serial
numbers current values.
A major speedup for bulk loading has been implemented using
temporary files merged in the manner of the sort command. A
typicall bulk insertion that took 9h to run now takes around
1h.
The rearchitectural work is not finished and the next step is to
change the key structure from word/number/number to
number/number/number, getting rid of actual words in the inverted
index since they are now stored and assigned a serial number in the
dictionnary. This version is an step before this major change.
The documentation has been proofreaded by the GNU volunteers and
the reference manual was included in the texinfo document to fit
the GNU conventions. Makeinfo is now used to generate the HTML
documentation.
A synchronization point was done with the Ht://Dig group at the
end of May. It introduced in Ht://Dig the new Berkeley DB 3.0.55. A
new syncrhonization point will be done when the new architecture is
ready. The synchronization scripts have been modified to only show
a diff, all patches must be applied by hand. This is longer but
prevent mistakes. Applying patches requires thinking and review and
should not be done blindly.
New utilities to manipulate the index have been added
(mifluzdump, mifluzload, mifluzdict) and the htdb_* utilities
should now be used with the fact that subdatabases are used in mind
(-s and -l options typically).
The compression has been re-implemented from scratch using the
same design ideas. It is more robust and much less hairy. The
WordBitCompress class could be used in various context with
success. All the half implemented *Vector* classes have been
removed.
As of yesterday the regressions tests ran 100% purify clean.
Using quantify allowed to improve performances by 20% by using the
WordList::Override method instead of WordList::Insert method where
possible.
The interface was changed in a major way to help thread safety.
The WordContext object is now central and all other objects are
created from it. All objects have a pointer to their environment so
that they can access global information. This adds a small space
overhead but completely prevents the use of static variables.
The search example (test/search.cc) was enhanced with an
estimation of the total number of matches and the ability to ask
for the last valid interval of matches, even if the search offset
is way beyond the last possible match.
The structure of a record (WordRecord) was completely changed.
It can now be either nothing (NONE), a single integer value (DATA)
or an string (STR). The previous format was much too restrictive
for general use.
The WordList class is now abstract and the original
implementation may be found in the WordListOne class. The
WordListMulti is not implemented yet and will allow to manipulate
many indexes as one.
May 3,
2000
We are glad to announce that mifluz was adopted by the GNU project
this week. It will be referenced as a GNU product in the www.gnu.org pages. The review
process by the FSF took around three weeks. RMS asked to fix some
legal issues and suggested that mifluz provides a C interface as
soon as possible.
mifluz-0.16 is
available.
This is a maintainance release that contains updates of the
documentation and the copyright notices.
A manual page was added for the Configuration class. all the
manual pages are integrated in the texinfo documentation because
RMS asked for it. We keep the manual pages since it's often more
convenient. Both are generated using the ad-hoc man/man_generate
perl script.
The LICENSE file was missing for the Berkeley DB files and a
copyright section was added to the README file.
April 21,
2000
Text-Query-SQL-0.07
is available.
A driver was added for Postgres by Benjamin Drieu
(bdrieu@april.org) and the mifluz support is complete. Some
documentation was added to describe the internal structure of the
syntax tree.
April 20,
2000
Search::Mifluz-0.07 and DBD::Mifluz-0.03 are available.
They both follow the mifluz-0.15 release and integrate
the new version of the search algorithms. A full set of options and
parameters has been added to control every aspect of the query
mechanism. A manual page is also provided for both modules.
Search::Mifluz has been checked with purify and
does not contain memory leaks when running the full test suite.
mifluz-0.15 is
available.
Most of the work was dedicated to the search algorithm
implementation. It is still in the test directory (test/search.cc)
but has evolved to a stable state. It is able to resolve structured
queries (boolean) and simple queries using a semantic similar to
the simple AltaVista syntax. When resolving a query, the search
process uses classes that are derived from the WordCursor class and
that have a similar semantic. A few changes have been made to the
WordCursor class to support derivation.
A major performance enhancement was done by using the
WordList::Override method instead of the WordList::Insert method.
Berkeley DB does a lot of work when the DB_NOOVERWRITE flag is set.
Always use Override instead of Insert if possible, it saves around
30% of CPU cycles on an insertion intensive process.
Some compilation/linking problems were fixed with the help of
Orion (FreeBSD) and Peter Marelas (Solaris). The official
(www.sleepycat.com) patch for Berkeley DB 3.0.55 was applied.
bert@senga.org found that the combination of RedHat-6.2 +
linux-2.3.51 provides transparent support for large files (>2Gb)
and is reasonably stable. Although recent tests show that failure
occurs when creating a inverted index larger than 4Gb, there is
hope.
Replacement functions (memmove etc..) were added in the new clib
directory and are used only if needed with a simple scheme. The
functions and methodology were taken from Berkeley DB. The
htconfig.h is now included by every source file so that
replacements are activated only when needed.
March 22,
2000
mifluz-0.14 is available.
The main changes in this version are distribution architecture,
index sharing between processes, documentation, benchmarks,
regression tests and bug fixes.
Upgrade to Berkeley DB 3.0.55. The db directory is now a flat
directory. This is no longer a traditional Berkeley DB
distribution. Renamed all external symbols of Berkeley DB to avoid
conflicts with installed versions. This problem was a pain in the
neck for programs linked (statically or dynamically) with an
original Berkeley DB distribution, such as Perl. There is now only
one library to link (-lmifluz instead of -lmifluz and -lhtdb). To
prevent linkage errors with C programs (static and dynamic),
mifluz does not use iostream anymore, only stdio.
The description of the key has been greatly simplified and is
now properly documented. The internal structure of the key has also
been simplified and mifluz-0.14 indexes are not compatible with
mifluz-0.13 indexes. No conversion program is provided.
The zlib library is not longer required to compile mifluz.
The WordSearchDescription was renamed WordCursor for
clarity.
Support was added for resources sharing among multiple
processes. Two distinct processes may use the same inverted index
at the same time and share the same cache. This is specially
important when running cgi-bin for query (either standalone or with
fast-cgi or mod_perl).
The API is now fully documented in manual pages, the entry point
is the mifluz manual page. This page contains an explanation
of all the possible parameters to tune mifluz. They are also
repeated in the manual pages of the classes that implement each
specific feature. The texinfo guide has been reviewed completely
for accuracy, to remove redundancy with manual pages and a chapter
on cache tuning was added.
The monitoring feature has been re-implemented and is now used
at the Berkeley DB level. The benchmarks can use the monitoring
class make MONITOR=-m dobench and the output is
automatically fed to the new benchmark-report utility to
build graphical benchmark reports. Four benchmark reports are
distributed with the package in the test/benchmark
directory. We encourage everyone to send their results on various
architectures/machines.
Regression tests were added for the htdb* utilities, the
example, shared index files, readonly index files. They are far
from complete but it's progressing.
A few very tricky bugs in the compression were fixed, refer to
the ChangeLog for details on this subject and other bug fixes.
Last but not least and CONTRIBUTORS section was added to the
README file to list people who significantly helped mifluz
to progress.
A new release was built for Search-Mifluz and webbase so that they compile with this new
version.
February 24,
2000
Search-Mifluz-0.05 and DBD-Mifluz-0.01 are available.
Search-Mifluz-0.05 matches mifluz-0.13.0. The major addition is
search capabilities (see t/search.t) whose implementation is based
on the example in mifluz-0.13.0 (test/search.cc).
The new DBD-Mifluz-0.01 is a DBI driver for Search-Mifluz,
mainly usefull to allow persistent connections using Apache::DBI
when running cgi-bin scripts under mod_perl.
Those two modules do not provide all the functionalities one may
dream of. However, it allows to start integration with Catalog and
other cgi-bin as needed. The building blocks are here, at last.
Text-Query-SQL-0.06 is
available.
A driver was added for the mifluz
inverted index. It simply returns the syntax tree built from the
query: the Search::Mifluz
module knows its structure and is able to resolve a query using
it.
February 23,
2000
webbase-5.9 is available.
Maintainance release to keep in sync with mifluz-0.13.0. The
arguments URL given to crawler are now cannonicalized so that it's
not possible to forget a trailing slash or provide an ill formated
URL.
February 22,
2000
mifluz-0.13 is available.
An elaborated example of search algorithm taking advantage of
the inverted index structure is now provided in the test/search.cc
file. It uses few resources and even provides relevance
ranking.
The Walk function has now reached a mature form. It has been
splitted in functions that look like iterator methods (WalkInit,
WalkNext, WalkFinish) The unerlying mechanism has been reviewed
thoroughly and partially re-implemented to overcome limitations or
inefficiencies.
The WordDB class was dramatically simplified by removing the
hideous mixture of return code conventions.
Some important bugs were fixed. When using the
WordList::Override method the reference count was not updated
correctly. Ascii number respresentation in WordKey were parsed with
atoi instead of strtoul, forbidding values with high bit set. The
compression and WordKey packing were not endian clean.
The monitoring (WordMonitor) is now turned off by default and
not even allocated. This isolates the rest of the code for
potential bugs in the monitoring and allow WordMonitor to evolve in
a more independant way.
The allocation method of the Berkeley DB compression scheme was
clarified. It is no longer allocated by WordDB but by
WordDBCompress which makes a lot more sense.
Some obsolete debugging variables and classes were removed.
The headers are SWIG
friendly. For those who don't know SWIG yet, it's a wonderfull tool
that makes it easier to provide scripting language interfaces.
The distribution has been changed completely. Headers are now
hidden in a mifluz subdirectory to prevent polution. The former
libhtdb and libhtword have been merged in libmifluz. In the
sources, the CVS tree no longer depends on htdig and a set of
script was written to keep in sync. This requires more discipline
from the developper point of view but makes mifluz less dependent
on the current htdig state. As a side effect, the ChangeLog of
mifluz now contains all the information related to modifications
made in the former htlib and htword directories. It's not necessary
to navigate from ChangeLog to ChangeLog.htdig anymore.
January 30,
2000
webbase-5.8 is available.
Lots of warnings and a memory leak was fixed. A nasty
configuration problem related to socklen_t type was fixed with the
implementation of the AC_PROTOTYPE macro.
January 27,
2000
Catalog-1.02 is
available.
The dmoz loading process has been dramatically simplified. It is
now only available as a command. No more fancy web interface that
confuses everyone. In addition the convert_dmoz script now
generates text files that can be directly loaded into Catalog
instead of the intermediate XML file. The whole loading process now
takes from one to two hours depending on your machine. It took
around 10 hours with the previous version.
The -exclude option was added to convert_dmoz to get rid of a
whole branch of the catalog at load time. Typical usage would be
convert_dmoz -exclude '^/Adult' -what content content.rdf.gz.
A lot more sanity checks and repair have been added to deal with
duplicates, category id conflicts and the like.
Hopefully this new method will also be more understandable and
generate less traffic on the mailing list. There is room for
improvements and contributors are welcome.
January 23,
2000
The bug tracking lists are now hosted on SourceForge. Existing
entries have been moved. Since SourceForge also provides a Task
Management database, the bug lists have been split between real
bugs and tasks. Each product has a link to the Task Manager and Bug
Tracker in the left menu.
January 22,
2000
For the benefit of everyone (but mainly because it's painfull to
maintain :-) we've moved all the CVS trees to SourceForge. The immediate
benefit is that you can get anonymous access to the CVS tree. You
will also be able to browse it on the web. I can't believe this has
not been done on Senga already. But that's all the story :
maintaining the technical tools behind Senga is a time consuming
job. During next week we will be moving the bug tracking and
downloads to SourceForge and at last the home page of each product.
A link to the CVS page has been added in the left menu of every
product, feel free to try it.
January 17,
2000
mifluz-0.12, webbase-5.7 and uri-2.7 are available.
This set of versions must be used together. See each product
page for more information on the modifications.
The older news are
available.
January 12,
2000
uri-2.7 is available.
Renamed uri struct member to _uri because some compilers do not
like that and think that's a name clash.
January 3,
2000
mifluz-0.11
Several bug fixes, speedups, and code cleanups. Added
possibility to monitor what's going on inside the indexing.
Preparing for full scale, real-world tests.
December 16,
1999
mifluz-0.10, webbase-5.6 and uri-2.6 are available.
This set of versions must be used together. See each product
page for more information on the modifications. We've fixed memory
leaks, configuration errors and bugs.
December 09,
1999
mifluz-0.9 is available.
A new compression algorithm was implemented. It reduces the
index size by a factor of 8 compared to an uncompressed index. It
works in the same context as the previously implemented compression
(it compresses/uncompresses pages within Berkeley DB when they are
written/read to the db file), but the comperssion algorithm is
specifically designed for compressing DB pages (th previous
compression used zlib). Since pages are generally full of redundant
data this can achieve good compression ratios.
December 8,
1999
Search-Mifluz-0.01
is available.
This is the pre-release version of the Perl interface to mifluz. It was generated using SWIG. We had to patch SWIG in order
to achieve proper package encapsulation. The patches will be
integrated in the next SWIG version but at present they are
included in the Search-Mifluz distribution.
The release of Search-Mifluz was also the opportunity to use SourceForge as a repository
for the project. SourceForge provides all facilities available on
Senga for OpenSource projects. If we're satisfied with SourceForge
for Search-Mifluz, we consider moving all the products to
SourceForge. It's much easier to contribute to a shared source
distribution environment than dealing with it on our own :-)
December 7,
1999
webbase-5.5 is available.
In this minor maintainance release we've fixed a few leaks and
memory overrun. It has been tested on a set of 150 000 URLs, some
of them containing really weird data.
November 29,
1999
webbase-5.4 is available.
The most important thing is that many memory leaks have been
removed. The crawler has been extensively tested (around 2 million
URLs crawled on 150 000 different web sites). The mifluz full text indexing library is now
integrated. It generates very big indexes at present but will
improve dramaticaly next week thanks to Marcel Bosc. For more information
on this subject refer to the mifluz mailing list and the htdig3-dev
mailing list (on htdig). The
hook to the full text indexing library is located in the new
hooks library.
In order to definitely fix the problems related to long URLs,
the url field is now a text field. To resolve the
indexing issue, a field was added to the url and
start table: url_md5. Following the same idea, the
directory tree that contains the temporary copies of the pages
(WLROOT) now contains cryptic MD5 based file names. It's activated
by default with the version 2.4 of the uri
library.
The MySQL connection
functions have been upgraded so that it takes in account a
~/.my.cnf file. Always using -user, -password etc. is not mandatory
anymore.
The -schema option was added to crawler and
displays the builtin database schema. It's usefull if you want to
add fields of your own in the start table.
Thanks to Bertrand
Demiddelaer who fixed a timeout problem. Many other small bugs
were fixed while testing, refer to ChangeLog for detailed
information.
November 05,
1999
mifluz-0.8 is available.
Version 0.7.0 forgot to include examples subdirectory... Some
portability and bug fixes. The docs on the API were extended, some
examples were added to help starting up with mifluz.
The storage key (WordKey) class has evolved a bit: accesors for
getting numerical fields were added. Input operators for streaming
were added to WordKey,WordList,WordReference...
A speed-up for skiping useless sequential walking when using
partialy defined searchkeys was added, as well as tests.
The use of the (important) WordList::Walk method was
simplified.
October 12,
1999
mifluz-0.6 is available.
After two months of maturation and coding, the first working
version of mifluz-0.6 is
finaly available. It is in alpha stage but we stronly believe that
the architectural choices are appropriate and will allow mifluz to
reach maturity rapidly. It provides very few functionalities and is
merely an inverted index manipulation library. It knows nothing
about parsing documents or displaying search results.
We worked very closely with the Ht://dig Group and Berkeley DB staff. mifluz-0.6 is used in the 3.2
version of Ht://dig (or mifluz-0.6 is a packaging of
the Ht://dig indexing library, depending on your point of view :-).
We implemented a transparent compression layer in Berkeley DB 2.2.7
that will (maybe) be included in future releases of Berkeley
DB.
A new developper, Marcel Bosc (bosc@senga.org), joined Senga two
days ago. He will eventually take over on mifluz. The work required
is huge and having someone working full time on this subject is
great news. The immediate future is to integrate mifluz with the
crawler and Catalog.
September 7,
1999
Catalog-1.01 is
available.
This is a maintainance
release.
- Various bug fixes. All easy to fix bugs have been fixed. Take a
look at
bugzilla to see what hasn't been fixed.
- The _PATHTEXT_ and
_PATHFILE_ tags syntax has been extended to specify a range of path
component.
- Graham Barr added a
recursive template feature for a catalog root page. This allows to
show sub-categories of the root categories in the root page of a
catalog.
Don't hesitate to
submit bugs or ideas to bugzilla. Hopefully the next version of
Catalog will have a fast full text indexing mechanism and I'll be
able to implement new functionalities.
Have fun !
July 13,
1999
The first release of
the URI manipulation C library (uri) and the
internet crawler C library (webbase) are
available. These two libraries are core component of our search
engine. One would say : what ? another internet crawler ? we
already have dozens ! Of course there is a difference with this one
: it is able to efficiently crawl millions URLs. The crawler
information is stored in a MySQL database.
July 6,
1999
The whole www.senga.org site has been restructured. It now contains
general information about Senga, at the home page level. The top
level menu on the left gives access to the bug tracking system for
all the products (Bug Track), a catalog of resources that we
use for development (Links). The Products page points to all the products
or development projects at Senga. This is where you will find Catalog.
July 3,
1999
Catalog-1.00 is available.
This release includes
PHP3 code to display a catalog. The author is Weston Bustraan
(weston@infinityteldata.net). The main motivation to jump directly
to version 1.00 is to avoid version number problems on
CPAN.
July 2,
1999
Catalog-0.19 is available.
This is a minor
release. The most noticeable addition is the new search
mechanism.
- Searching : two
search modes are now available. AltaVista simple syntax and
AltaVista advanced syntax. Both use the Text-Query and
Text-Query-SQL perl modules.
- Dmoz loading is much
more fault tolerant. In addition it can handle compressed versions
of content.rdf and structure.rdf. The comments are now stored in
text fields instead of char(255).
- The template system
was extended with the pre_fill and post_fill
parameters.
- Searching associated
to a catalog dumped to static pages is now possible using the
'static' mode.
- Fixed two security
weakness in confedit and recursive cgi handling.
- Many sql queries have
been optimized.
- The configuration was
changed a bit to fix bugs and to isolate database
dependencies.
- The tests were
updated to isolate database dependencies.
- Fixed numerous minor
bugs, check ChangeLog if you're interested in details.
Many thanks to Tim
Bunce for his numerous contributions and ideas. He is the architect
of the Text-Query and Text-Query-SQL modules, Eric Bohlman and Loic
Dachary did the programming.
Thanks to Eric Bohlman
for his help on the Text-Query module. He was very busy but managed
to spend the time needed to release it.
There is not yet
anything usable for full text indexing but we keep working on it.
The storage management is now handled by the reiserfs file system
thanks to Hans Reiser who is working full time on this. Loic
Dachary does his best to get something working, if you're
interested go to http://www.senga.org/mifluz/.
For some mysterious
reason CPAN lost track of Catalog name. In order to install catalog
you should use perl -MCPAN -e 'install Catalog::db'. Weird but
temporary.
Have fun !
May 26,
1999
There currently are four contributors to Catalog. Here they
are:
- Tim Bunce
(Tim.Bunce@ig.co.uk) is working on a commercial project involving
Catalog. He fixes bugs, change the programming interface and has
ideas on how to do things.
- Christophe Le Bars
(clb@alcove.fr) is packaging Catalog for Debian.
- David Walker
(dwalker@c-wheeler.agelena.net) is adding Postgres
support.
- Weston Bustraan
(weston@infinityteldata.net) works on PHP3 code to display the
content of Catalog.
Of course I won't be posting this list on the home page every
month. If you want to know who's working on what you can bookmark
the list of assigned
tasks.
May 18,
1999
Catalog-0.10 replaces the Catalog-0.9
version published yesterday because of an installation bug that
makes it completely unusable except for people ugrading from
Catalog-0.5. Thank you for your patience.
May 17,
1999
Catalog-0.10 is available.
This is a maintainance
release. We are happy to announce that Catalog is now available at
your nearest CPAN mirror. The bug tracking system
installed two weeks ago proved very usefull. It allows anyone to
enter bug reports, ideas and suggestions about Catalog. If you are
in need of commercial support on Catalog, two new companies are
entering the business : Alcove
and Atrid. (for details go to
the support page).
- The Bundle::Catalog
module has been changed to include Catalog to simplify the
installation process.
- The installation
procedure has been simplified a bit and now includes the
possibility to re-use an existing configuration and to specify the
installation root of MySQL.
- The dmoz.org loading
process is
better documented and the interface now clearly explains the
loading steps.
- The Catalog directory
containing the documentation is now created by the installation
process.
- Tim Bunce bug fixes
and enhancements have been integrated.
- A FreeBSD 3.1 section
was added to the installation process. The makefiles no longer
depend on GNU Make, except for the documentation makefile. We
strongly suggest using GNU Make :-)
- Contributions
guidelines and script have been added (CONTRIBUTIONS file). It
provides a framework to easily contribute to the software, using
the latest development branch.
- A memory leak has
been found in XML-Parser-2.23, we strongly recommend using
XML-Parser-2.22 instead, if you manipulate big amount of data such
as dmoz.org.
- The loading of
dmoz.org is now resistant to duplicates in the author
section.
- A bug in the _PATH_
tag handling was fixed. Additional tags have been added to have
access to individual path components (_PATH0_, _PATH1
...).
- A first step was made
to make the code database independant. There is still some work to
be done. If you have experience on Oracle, Informix, Postgres, you
could already provide the table definitions and the database
configuration procedures.
- The verbosity of the
error messages has been reduced.
For more details on
bug fixes you can search the bug tracking system at (bugzilla). We are working
hard on the full text indexing library. There will be more on this
subject very soon.
Have fun
!
May 2,
1999
The Bugzilla bug tracking
system is installed in http://www.senga.org/bugzilla/.
It is used not only to report bugs of Catalog but also to suggest
enhancements or new features. Anyone can add an entry, go ahead
!
April 19,
1999
Catalog-0.5 is available.
The main features
added to this version are:
- XML external
representation of a thematic catalog. This allow easy export and
import of existing catalogs. The XML format is a custom one and you
could argue that we should have used XML/RDF instead. The lack of
tools handling XML/RDF prevented this.
- A new module has been
derived from Catalog to display and manage dmoz (www.dmoz.org) catalog. This
effectively allow anyone to run a mirror of dmoz. The database is
only 400Mb big for 400 000 URLs and 65 000 categories. Response
time is really fast provided you've installed Apache +
mod_perl.
- The Makefiles and
installation procedures have been rebuilt from scratch for more
flexibility and clarity.
- A Perl bundle was
added to automate the installation of dependent modules. This
became really necessary since Catalog now depends on 9 external
modules found on CPAN.
Altough Catalog was
added last month to CPAN, the module list has not been re-generated
since then and we impatiently wait for it.
A mirror of dmoz.org has been loaded to show
that Catalog is able to handle a large number of
records and categories.
March 16,
1999
Catalog-0.4 is available.
The main features
added to this version are:
- Intuitive browsing :
/cgi-bin/Catalog/Sport/Events/Tennis/ will display the
expected category content. This is much more readable than the
name=catalog&context=cbrowse&id=3
parameters.
- Static dump : the
whole catalog can be dumped in a directory tree that replicate the
category structure. The result may be copied and browsed using only
static HTML pages. This can be very convenient if your web site is
not cgi-bin enabled.
- Search function : the
thematic catalogs may now be searched in full text. Category names
and record contents are searched. The search may be limited only to
the category names or only to the record contents.
- A complete example is
installed with Catalog. A chapter was added to the documentation to
comment the example.
It is a step by step guide to configure the catalogs. The example
contains a thematic catalog,
a chronological
catalog and an alphabetical
catalog.
- Option in
configuration files for nph scripts.
- The configuration
generated by Makefile.PL is saved and reused in the config.cache
file so that repeated installations do not require answering the
same questions multiple times.
Catalog now depends on
the MD5 Perl module. A copy of this module is kept on the
www.senga.org download page. We have upgraded the MySQL
distribution to 3.22.19 because it is now stable. Some users may
have noticed formating errors in the HTML version of the
documentation : it has been fixed.
Two real world usage
of Catalog may be seen at
Ghana International Trade Fair (english) and Interbat (french). The example
delivered with Catalog is also available on www.senga.org for
browsing only: a thematic
catalog, a chronological
catalog and an alphabetical
catalog.
Last but not least,
the Catalog name space was approved by Perl maintainers and Catalog
should appear at your nearest CPAN site in the following
weeks.
February 24,
1999
Catalog-0.3 is available.
The main features
added to this version are:
- A new kind of catalog
has been added : the chronological catalog. As expected it shows
the entries of a catalog ordered by date. That's what you want to
add a What's New section to your existing
catalog.
- The context_allow
instruction has been added to the sqledit.conf configuration file
to allow only a specific set of actions. You must use this
instruction if you want to publish a Catalog, otherwise the users
will have the ability to alter the catalog by changing the
parameters manually in the URL.
- Fix a security hole
implying eval.
- The catalog
management interface has been improved, allowing editing of
category properties, editing of the entries in a category. The
display is nicer, graphic buttons are used instead of
links.
- The installation now
requires a directory to put the HTML documentation and images used
by the catalog management interface. This directory will also be
used later on for examples.
- The tests run when
make test is used now cover most of the cases.
- The documentation has
been updated and improved, many typos have been fixed.
- Some memory leaks
have been found/fixed and the processes have a reasonable size when
running Apache and mod_perl.
- The dir file is
automatically modified by the installation process if you've chose
to install the info format.
- New tags are
available in all templates : _SCRIPT_ and _HTMLPATH_.
- A few common errors
that may occur when using the catalog management interface in the
wrong way now show explicit error messages in an HTML page instead
of crashing. That prevents looking in the HTTP logs to find out
what was wrong.
- A mixture of POST and
GET in the catalog management interface confused caches. It has
been fixed.
Since a subtle bug was
found in mysql-3.22.8-beta, we have switched to the latest version,
mysql-3.22.16a-gamma. At the same time we've upgraded the DBI
version used and mysql module. Those upgrades are not
mandatory.
Catalog now uses the
Test module to run tests. This requires perl 5.005. If you were
running perl 5.004 (native on RedHat 5.2), you will have to compile
the perl 5.005. There is not rpm at the moment.
February 10, 1999
Catalog-0.2 is
available. It fixes installation problems, the documentation and
some bugs.
The installation
process has been made simpler by removing the need set the password
and user of the MySQL database after the installation. This was
confusing because most people thought it was a fatal error
message.
The make test now
works with a local invocation of the MySQL daemon to prevent
possible corruption of an existing database.
At the request of Lynx
users, all images of this site now have alt tags.