solr search server with tika and manifoldcf

I finally managed to get my search server running using solr as main engine and tika for extraction. The setup is competed by a manifoldcf for access to files, emails, wiki, rss and web.

solr

A short overview on the basic file structure of solr is shown below:

filestructure


<solr-home-directory/
solr.xml
core_name1/
core.properties
conf/
solrconfig.xml
managed-schema
data/

And here is my core.properties file without cloud on a single server and very basic as well.

core.properties


Name=collection name
Config=solrconfig.xml
dataDir=collection name/data

schema fields from tika

The following fields are essential for my setup:

  • id – the identifier unique for solr
  • _version_ – also some internal stuff for solr
  • content – the text representation of the extraction results from tika
  • ignored_* – as a catchall for any metadata that is not covered by a field in the index

The solr install is following the instructions given by the project team. As I am using debian the solr.in.sh is barely standard. Here are the settings:


SOLR_PID_DIR="/var/solr"
SOLR_HOME="/var/solr/data"
LOG4J_PROPS="/var/solr/log4j2.xml"
SOLR_LOGS_DIR="/var/solr/logs"
SOLR_PORT="8983"

Solr is started via old init.d style script from the project team. No modifications here.

The specific managed-schema and solrconfig.xml files are not listed here but took the most time to get them running. Some comments:

  • grab some information on the metadata extracted by tika to find the fields that should be worth a second look
  • check for the configuration given in /var/solr/data/conf/
  • especially the solr log at /var/solr/logs/solr.log
  • managed-schema shoud be adjusted for the metadata retrived through tika
  • delete any old collection files by removing /var/solr/data/collection name/collection name/index/
  • solr cell is responsible for importing/indexing files in foreign formats like PDF, Word, etc
  • set stored false as often as possible
  • set indexed false as much as possible
  • remove copyfields as far as possible
  • set indexed false for text_general
  • use catchall field for indexing
  • start JVM in server mode
  • set logging on higher level only
  • integrate everything in tomcat
  • set indexed or docValues to true but not both
  • some field type annottations: Solr Manual 8.11

some interesting commands

  • /bin/solr start
  • /bin/solr stop -all
  • /bin/post -c collection input
  • /bin/solr delete -c collection
  • /bin/solr create -c collection -d configdir
  • velocity setup

    velocity may be used as a search interface for solr but my setup is not completed yet.

    tika

    The tika server version is also installed as described by the project team. I only added a start script for systemd as follows:


    [Unit]
    Description=Apache Tika Server
    After=network.target

    [Service]
    Type=simple
    User=tika
    Environment="TIKA_INCLUDE=/etc/default/tika.in.sh"
    ExecStart=/usr/bin/java -jar /opt/tika/tika-server-standard-2.3.0.jar --port 9998 --config /opt/tika/tika-config.xml
    Restart=always

    [Install]
    WantedBy=multi-user.target

    The tika.in.sh is once again copied from project team suggestion without modifications:


    TIKA_PID_DIR="/var/tika"
    LOG4J_PROPS="/var/tika/log4j.properties"
    TIKA_LOGS_DIR="/var/tika/logs"
    TIKA_PORT="9998"
    TIKA_FORKED_OPTS=""

    The tika-config.xml is quit empty at the moment but I hope to get logging running soon.

    ManifoldCF

    And finally the manifoldcf installation from scratch as the interface to the various information resources.

    and here is my systemd start script:

    [Unit]
    Description=ManifoldCF service
    [Service]
    WorkingDirectory=/opt/manifoldcf/example
    ExecStart=/usr/bin/java -Xms512m -Xmx512m -Dorg.apache.manifoldcf.configfile=./properties.xml -Dorg.apache.manifoldcf.jettyshutdowntoken=secret_token -Djava.security.auth.login.config= -cp .:../lib/mcf-core.jar:../lib/mcf-agents.jar:../lib/mcf-pull-agent.jar:../lib/mcf-ui-core.jar:../lib/mcf-jetty-runner.jar:../lib/jetty-client-9.4.25.v20191220.jar:../lib/jetty-continuation-9.4.25.v20191220.jar:../lib/jetty-http-9.4.25.v20191220.jar:../lib/jetty-io-9.4.25.v20191220.jar:../lib/jetty-jndi-9.4.25.v20191220.jar:../lib/jetty-jsp-9.2.30.v20200428.jar:../lib/jetty-jsp-jdt-2.3.3.jar:../lib/jetty-plus-9.4.25.v20191220.jar:../lib/jetty-schemas-3.1.M0.jar:../lib/jetty-security-9.4.25.v20191220.jar:../lib/jetty-server-9.4.25.v20191220.jar:../lib/jetty-servlet-9.4.25.v20191220.jar:../lib/jetty-util-9.4.25.v20191220.jar:../lib/jetty-webapp-9.4.25.v20191220.jar:../lib/jetty-xml-9.4.25.v20191220.jar:../lib/commons-codec-1.10.jar:../lib/commons-collections-3.2.2.jar:../lib/commons-collections4-4.2.jar:../lib/commons-discovery-0.5.jar:../lib/commons-el-1.0.jar:../lib/commons-exec-1.3.jar:../lib/commons-fileupload-1.3.3.jar:../lib/commons-io-2.5.jar:../lib/commons-lang-2.6.jar:../lib/commons-lang3-3.9.jar:../lib/commons-logging-1.2.jar:../lib/ecj-4.3.1.jar:../lib/gson-2.8.0.jar:../lib/guava-25.1-jre.jar:../lib/httpclient-4.5.8.jar:../lib/httpcore-4.4.10.jar:../lib/jasper-6.0.35.jar:../lib/jasper-el-6.0.35.jar:../lib/javax.servlet-api-3.1.0.jar:../lib/jna-5.3.1.jar:../lib/jna-platform-5.3.1.jar:../lib/json-simple-1.1.1.jar:../lib/jsp-api-2.1-glassfish-2.1.v20091210.jar:../lib/juli-6.0.35.jar:../lib/log4j-1.2-api-2.4.1.jar:../lib/log4j-api-2.4.1.jar:../lib/log4j-core-2.4.1.jar:../lib/mail-1.4.5.jar:../lib/serializer-2.7.1.jar:../lib/slf4j-api-1.7.25.jar:../lib/slf4j-simple-1.7.25.jar:../lib/velocity-1.7.jar:../lib/xalan-2.7.1.jar:../lib/xercesImpl-2.10.0.jar:../lib/xml-apis-1.4.01.jar:../lib/zookeeper-3.4.10.jar:../lib/javax.activation-1.2.0.jar:../lib/javax.activation-api-1.2.0.jar: -jar start.jar
    User=solr
    Type=simple
    SuccessExitStatus=143
    TimeoutStopSec=10
    Restart=on-failure
    RestartSec=10
    [Install]
    WantedBy=multi-user.target

Leave a Reply