I finally managed to get my search server running using solr as main engine and tika for extraction. The setup is competed by a manifoldcf for access to files, emails, wiki, rss and web.
solr
A short overview on the basic file structure of solr is shown below:
filestructure
<solr-home-directory/
solr.xml
core_name1/
core.properties
conf/
solrconfig.xml
managed-schema
data/
And here is my core.properties file without cloud on a single server and very basic as well.
core.properties
Name=collection name
Config=solrconfig.xml
dataDir=collection name/data
schema fields from tika
The following fields are essential for my setup:
- id – the identifier unique for solr
- _version_ – also some internal stuff for solr
- content – the text representation of the extraction results from tika
- ignored_* – as a catchall for any metadata that is not covered by a field in the index
The solr install is following the instructions given by the project team. As I am using debian the solr.in.sh is barely standard. Here are the settings:
SOLR_PID_DIR="/var/solr"
SOLR_HOME="/var/solr/data"
LOG4J_PROPS="/var/solr/log4j2.xml"
SOLR_LOGS_DIR="/var/solr/logs"
SOLR_PORT="8983"
Solr is started via old init.d style script from the project team. No modifications here.
The specific managed-schema and solrconfig.xml files are not listed here but took the most time to get them running. Some comments:
- grab some information on the metadata extracted by tika to find the fields that should be worth a second look
- check for the configuration given in /var/solr/data/conf/
- especially the solr log at /var/solr/logs/solr.log
- managed-schema shoud be adjusted for the metadata retrived through tika
- delete any old collection files by removing /var/solr/data/collection name/collection name/index/
- solr cell is responsible for importing/indexing files in foreign formats like PDF, Word, etc
- set stored false as often as possible
- set indexed false as much as possible
- remove copyfields as far as possible
- set indexed false for text_general
- use catchall field for indexing
- start JVM in server mode
- set logging on higher level only
- integrate everything in tomcat
- set indexed or docValues to true but not both
- some field type annottations: Solr Manual 8.11
some interesting commands
- /bin/solr start
- /bin/solr stop -all
- /bin/post -c collection input
- /bin/solr delete -c collection
- /bin/solr create -c collection -d configdir
velocity setup
velocity may be used as a search interface for solr but my setup is not completed yet.
tika
The tika server version is also installed as described by the project team. I only added a start script for systemd as follows:
[Unit]
Description=Apache Tika Server
After=network.target
[Service]
Type=simple
User=tika
Environment="TIKA_INCLUDE=/etc/default/tika.in.sh"
ExecStart=/usr/bin/java -jar /opt/tika/tika-server-standard-2.3.0.jar --port 9998 --config /opt/tika/tika-config.xml
Restart=always
[Install]
WantedBy=multi-user.target
The tika.in.sh is once again copied from project team suggestion without modifications:
TIKA_PID_DIR="/var/tika"
LOG4J_PROPS="/var/tika/log4j.properties"
TIKA_LOGS_DIR="/var/tika/logs"
TIKA_PORT="9998"
TIKA_FORKED_OPTS=""
The tika-config.xml is quit empty at the moment but I hope to get logging running soon.
ManifoldCF
And finally the manifoldcf installation from scratch as the interface to the various information resources.
and here is my systemd start script:
[Unit]
Description=ManifoldCF service
[Service]
WorkingDirectory=/opt/manifoldcf/example
ExecStart=/usr/bin/java -Xms512m -Xmx512m -Dorg.apache.manifoldcf.configfile=./properties.xml -Dorg.apache.manifoldcf.jettyshutdowntoken=secret_token -Djava.security.auth.login.config= -cp .:../lib/mcf-core.jar:../lib/mcf-agents.jar:../lib/mcf-pull-agent.jar:../lib/mcf-ui-core.jar:../lib/mcf-jetty-runner.jar:../lib/jetty-client-9.4.25.v20191220.jar:../lib/jetty-continuation-9.4.25.v20191220.jar:../lib/jetty-http-9.4.25.v20191220.jar:../lib/jetty-io-9.4.25.v20191220.jar:../lib/jetty-jndi-9.4.25.v20191220.jar:../lib/jetty-jsp-9.2.30.v20200428.jar:../lib/jetty-jsp-jdt-2.3.3.jar:../lib/jetty-plus-9.4.25.v20191220.jar:../lib/jetty-schemas-3.1.M0.jar:../lib/jetty-security-9.4.25.v20191220.jar:../lib/jetty-server-9.4.25.v20191220.jar:../lib/jetty-servlet-9.4.25.v20191220.jar:../lib/jetty-util-9.4.25.v20191220.jar:../lib/jetty-webapp-9.4.25.v20191220.jar:../lib/jetty-xml-9.4.25.v20191220.jar:../lib/commons-codec-1.10.jar:../lib/commons-collections-3.2.2.jar:../lib/commons-collections4-4.2.jar:../lib/commons-discovery-0.5.jar:../lib/commons-el-1.0.jar:../lib/commons-exec-1.3.jar:../lib/commons-fileupload-1.3.3.jar:../lib/commons-io-2.5.jar:../lib/commons-lang-2.6.jar:../lib/commons-lang3-3.9.jar:../lib/commons-logging-1.2.jar:../lib/ecj-4.3.1.jar:../lib/gson-2.8.0.jar:../lib/guava-25.1-jre.jar:../lib/httpclient-4.5.8.jar:../lib/httpcore-4.4.10.jar:../lib/jasper-6.0.35.jar:../lib/jasper-el-6.0.35.jar:../lib/javax.servlet-api-3.1.0.jar:../lib/jna-5.3.1.jar:../lib/jna-platform-5.3.1.jar:../lib/json-simple-1.1.1.jar:../lib/jsp-api-2.1-glassfish-2.1.v20091210.jar:../lib/juli-6.0.35.jar:../lib/log4j-1.2-api-2.4.1.jar:../lib/log4j-api-2.4.1.jar:../lib/log4j-core-2.4.1.jar:../lib/mail-1.4.5.jar:../lib/serializer-2.7.1.jar:../lib/slf4j-api-1.7.25.jar:../lib/slf4j-simple-1.7.25.jar:../lib/velocity-1.7.jar:../lib/xalan-2.7.1.jar:../lib/xercesImpl-2.10.0.jar:../lib/xml-apis-1.4.01.jar:../lib/zookeeper-3.4.10.jar:../lib/javax.activation-1.2.0.jar:../lib/javax.activation-api-1.2.0.jar: -jar start.jar
User=solr
Type=simple
SuccessExitStatus=143
TimeoutStopSec=10
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target