{"id":716,"date":"2022-03-06T13:33:17","date_gmt":"2022-03-06T12:33:17","guid":{"rendered":"https:\/\/olkn.myvnc.com\/?p=716"},"modified":"2022-03-06T15:07:43","modified_gmt":"2022-03-06T14:07:43","slug":"solr-search-server-with-tika-and-manifoldcf","status":"publish","type":"post","link":"https:\/\/olkn.myvnc.com\/?p=716","title":{"rendered":"solr search server with tika and manifoldcf"},"content":{"rendered":"<p>I finally managed to get my search server running using solr as main engine and tika for extraction. The setup is competed by a manifoldcf for access to files, emails, wiki, rss and web.<\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_82_2 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<span class=\"ez-toc-title-toggle\"><a href=\"#\" class=\"ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle\" aria-label=\"Toggle Table of Content\"><span class=\"ez-toc-js-icon-con\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/span><\/a><\/span><\/div>\n<nav><ul class='ez-toc-list ez-toc-list-level-1 eztoc-toggle-hide-by-default' ><li class='ez-toc-page-1 ez-toc-heading-level-1'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/olkn.myvnc.com\/?p=716\/#solr\" >solr<\/a><ul class='ez-toc-list-level-2' ><li class='ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/olkn.myvnc.com\/?p=716\/#filestructure\" >filestructure<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/olkn.myvnc.com\/?p=716\/#coreproperties\" >core.properties<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/olkn.myvnc.com\/?p=716\/#schema_fields_from_tika\" >schema fields from tika<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/olkn.myvnc.com\/?p=716\/#some_interesting_commands\" >some interesting commands<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/olkn.myvnc.com\/?p=716\/#velocity_setup\" >velocity setup<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-1'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/olkn.myvnc.com\/?p=716\/#tika\" >tika<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-1'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/olkn.myvnc.com\/?p=716\/#ManifoldCF\" >ManifoldCF<\/a><\/li><\/ul><\/nav><\/div>\n<h1><span class=\"ez-toc-section\" id=\"solr\"><\/span>solr<span class=\"ez-toc-section-end\"><\/span><\/h1>\n<p>A short overview on the basic file structure of solr is shown below:<\/p>\n<h2><span class=\"ez-toc-section\" id=\"filestructure\"><\/span>filestructure<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><code><br \/>\n&lt;solr-home-directory\/<br \/>\nsolr.xml<br \/>\ncore_name1\/<br \/>\ncore.properties<br \/>\nconf\/<br \/>\nsolrconfig.xml<br \/>\nmanaged-schema<br \/>\ndata\/<br \/>\n<\/code><\/p>\n<p>And here is my core.properties file without cloud on a single server and very basic as well.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"coreproperties\"><\/span>core.properties<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><code><br \/>\nName=collection name<br \/>\nConfig=solrconfig.xml<br \/>\ndataDir=collection name\/data<br \/>\n<\/code><\/p>\n<h2><span class=\"ez-toc-section\" id=\"schema_fields_from_tika\"><\/span>schema fields from tika<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The following fields are essential for my setup:<\/p>\n<ul>\n<li>id &#8211; the identifier unique for solr<\/li>\n<li>_version_ &#8211; also some internal stuff for solr<\/li>\n<li>content &#8211; the text representation of the extraction results from tika<\/li>\n<li>ignored_* &#8211; as a catchall for any metadata that is not covered by a field in the index<\/li>\n<\/ul>\n<p>The solr install is following the instructions given by the project team. As I am using debian the solr.in.sh is barely standard. Here are the settings:<\/p>\n<p><code><br \/>\nSOLR_PID_DIR=\"\/var\/solr\"<br \/>\nSOLR_HOME=\"\/var\/solr\/data\"<br \/>\nLOG4J_PROPS=\"\/var\/solr\/log4j2.xml\"<br \/>\nSOLR_LOGS_DIR=\"\/var\/solr\/logs\"<br \/>\nSOLR_PORT=\"8983\"<br \/>\n<\/code><\/p>\n<p>Solr is started via old init.d style script from the project team. No modifications here.<\/p>\n<p>The specific managed-schema and solrconfig.xml files are not listed here but took the most time to get them running. Some comments:<\/p>\n<ul>\n<li>grab some information on the metadata extracted by tika to find the fields that should be worth a second look<\/li>\n<li>check for the configuration given in \/var\/solr\/data\/conf\/<\/li>\n<li>especially the solr log at \/var\/solr\/logs\/solr.log<\/li>\n<li>managed-schema shoud be adjusted for the metadata retrived through tika<\/li>\n<li>delete any old collection files by removing \/var\/solr\/data\/collection name\/collection name\/index\/<\/li>\n<li>solr cell is responsible for importing\/indexing files in foreign formats like PDF, Word, etc<\/li>\n<li>set stored false as often as possible<\/li>\n<li>set indexed false as much as possible<\/li>\n<li>remove copyfields as far as possible<\/li>\n<li>set indexed false for text_general<\/li>\n<li>use catchall field for indexing<\/li>\n<li>start JVM in server mode<\/li>\n<li>set logging on higher level only<\/li>\n<li>integrate everything in tomcat<\/li>\n<li>set indexed or docValues to true but not both<\/li>\n<li>some field type annottations: <a href=\"https:\/\/solr.apache.org\/guide\/8_11\/field-properties-by-use-case.html\">Solr Manual 8.11<\/a><\/li>\n<\/ul>\n<h2><span class=\"ez-toc-section\" id=\"some_interesting_commands\"><\/span>some interesting commands<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<ul>\n<li>\/bin\/solr start <\/li>\n<li>\/bin\/solr stop -all<\/li>\n<li>\/bin\/post -c collection input<\/li>\n<li>\/bin\/solr delete -c collection<\/li>\n<li>\/bin\/solr create -c collection -d configdir<\/li>\n<h2><span class=\"ez-toc-section\" id=\"velocity_setup\"><\/span> velocity setup<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>velocity may be used as a search interface for solr but my setup is not completed yet.<\/p>\n<h1><span class=\"ez-toc-section\" id=\"tika\"><\/span>tika<span class=\"ez-toc-section-end\"><\/span><\/h1>\n<p>The tika server version is also installed as described by the project team. I only added a start script for systemd as follows:<\/p>\n<p><code><br \/>\n[Unit]<br \/>\nDescription=Apache Tika Server<br \/>\nAfter=network.target<\/p>\n<p>[Service]<br \/>\nType=simple<br \/>\nUser=tika<br \/>\nEnvironment=\"TIKA_INCLUDE=\/etc\/default\/tika.in.sh\"<br \/>\nExecStart=\/usr\/bin\/java -jar \/opt\/tika\/tika-server-standard-2.3.0.jar --port 9998 --config \/opt\/tika\/tika-config.xml<br \/>\nRestart=always<\/p>\n<p>[Install]<br \/>\nWantedBy=multi-user.target<br \/>\n<\/code><\/p>\n<p>The tika.in.sh is once again copied from project team suggestion without modifications:<\/p>\n<p><code><br \/>\nTIKA_PID_DIR=\"\/var\/tika\"<br \/>\nLOG4J_PROPS=\"\/var\/tika\/log4j.properties\"<br \/>\nTIKA_LOGS_DIR=\"\/var\/tika\/logs\"<br \/>\nTIKA_PORT=\"9998\"<br \/>\nTIKA_FORKED_OPTS=\"\"<br \/>\n<\/code><\/p>\n<p>The tika-config.xml is quit empty at the moment but I hope to get logging running soon.<\/p>\n<h1><span class=\"ez-toc-section\" id=\"ManifoldCF\"><\/span>ManifoldCF<span class=\"ez-toc-section-end\"><\/span><\/h1>\n<p>And finally the manifoldcf installation from scratch as the interface to the various information resources.<\/p>\n<p>and here is my systemd start script:<br \/>\n<code><br \/>\n[Unit]<br \/>\nDescription=ManifoldCF service<br \/>\n[Service]<br \/>\nWorkingDirectory=\/opt\/manifoldcf\/example<br \/>\nExecStart=\/usr\/bin\/java -Xms512m -Xmx512m -Dorg.apache.manifoldcf.configfile=.\/properties.xml -Dorg.apache.manifoldcf.jettyshutdowntoken=secret_token -Djava.security.auth.login.config= -cp .:..\/lib\/mcf-core.jar:..\/lib\/mcf-agents.jar:..\/lib\/mcf-pull-agent.jar:..\/lib\/mcf-ui-core.jar:..\/lib\/mcf-jetty-runner.jar:..\/lib\/jetty-client-9.4.25.v20191220.jar:..\/lib\/jetty-continuation-9.4.25.v20191220.jar:..\/lib\/jetty-http-9.4.25.v20191220.jar:..\/lib\/jetty-io-9.4.25.v20191220.jar:..\/lib\/jetty-jndi-9.4.25.v20191220.jar:..\/lib\/jetty-jsp-9.2.30.v20200428.jar:..\/lib\/jetty-jsp-jdt-2.3.3.jar:..\/lib\/jetty-plus-9.4.25.v20191220.jar:..\/lib\/jetty-schemas-3.1.M0.jar:..\/lib\/jetty-security-9.4.25.v20191220.jar:..\/lib\/jetty-server-9.4.25.v20191220.jar:..\/lib\/jetty-servlet-9.4.25.v20191220.jar:..\/lib\/jetty-util-9.4.25.v20191220.jar:..\/lib\/jetty-webapp-9.4.25.v20191220.jar:..\/lib\/jetty-xml-9.4.25.v20191220.jar:..\/lib\/commons-codec-1.10.jar:..\/lib\/commons-collections-3.2.2.jar:..\/lib\/commons-collections4-4.2.jar:..\/lib\/commons-discovery-0.5.jar:..\/lib\/commons-el-1.0.jar:..\/lib\/commons-exec-1.3.jar:..\/lib\/commons-fileupload-1.3.3.jar:..\/lib\/commons-io-2.5.jar:..\/lib\/commons-lang-2.6.jar:..\/lib\/commons-lang3-3.9.jar:..\/lib\/commons-logging-1.2.jar:..\/lib\/ecj-4.3.1.jar:..\/lib\/gson-2.8.0.jar:..\/lib\/guava-25.1-jre.jar:..\/lib\/httpclient-4.5.8.jar:..\/lib\/httpcore-4.4.10.jar:..\/lib\/jasper-6.0.35.jar:..\/lib\/jasper-el-6.0.35.jar:..\/lib\/javax.servlet-api-3.1.0.jar:..\/lib\/jna-5.3.1.jar:..\/lib\/jna-platform-5.3.1.jar:..\/lib\/json-simple-1.1.1.jar:..\/lib\/jsp-api-2.1-glassfish-2.1.v20091210.jar:..\/lib\/juli-6.0.35.jar:..\/lib\/log4j-1.2-api-2.4.1.jar:..\/lib\/log4j-api-2.4.1.jar:..\/lib\/log4j-core-2.4.1.jar:..\/lib\/mail-1.4.5.jar:..\/lib\/serializer-2.7.1.jar:..\/lib\/slf4j-api-1.7.25.jar:..\/lib\/slf4j-simple-1.7.25.jar:..\/lib\/velocity-1.7.jar:..\/lib\/xalan-2.7.1.jar:..\/lib\/xercesImpl-2.10.0.jar:..\/lib\/xml-apis-1.4.01.jar:..\/lib\/zookeeper-3.4.10.jar:..\/lib\/javax.activation-1.2.0.jar:..\/lib\/javax.activation-api-1.2.0.jar: -jar start.jar<br \/>\nUser=solr<br \/>\nType=simple<br \/>\nSuccessExitStatus=143<br \/>\nTimeoutStopSec=10<br \/>\nRestart=on-failure<br \/>\nRestartSec=10<br \/>\n[Install]<br \/>\nWantedBy=multi-user.target<br \/>\n<\/code><\/p>\n","protected":false},"excerpt":{"rendered":"<p>I finally managed to get my search server running using solr as main engine and tika for extraction. The setup is competed by a manifoldcf for access to files, emails, wiki, rss and web. solr A short overview on the basic file structure of solr is shown below: filestructure &lt;solr-home-directory\/ solr.xml core_name1\/ core.properties conf\/ solrconfig.xml &hellip; <a href=\"https:\/\/olkn.myvnc.com\/?p=716\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">solr search server with tika and manifoldcf<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[16,18,4,7,11],"tags":[35,46,51,79,196,116,195],"class_list":["post-716","post","type-post","status-publish","format-standard","hentry","category-administration","category-configs","category-private","category-projects","category-software","tag-administration","tag-config","tag-debian","tag-linux","tag-search","tag-server","tag-solr"],"_links":{"self":[{"href":"https:\/\/olkn.myvnc.com\/index.php?rest_route=\/wp\/v2\/posts\/716","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/olkn.myvnc.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/olkn.myvnc.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/olkn.myvnc.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/olkn.myvnc.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=716"}],"version-history":[{"count":8,"href":"https:\/\/olkn.myvnc.com\/index.php?rest_route=\/wp\/v2\/posts\/716\/revisions"}],"predecessor-version":[{"id":736,"href":"https:\/\/olkn.myvnc.com\/index.php?rest_route=\/wp\/v2\/posts\/716\/revisions\/736"}],"wp:attachment":[{"href":"https:\/\/olkn.myvnc.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=716"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/olkn.myvnc.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=716"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/olkn.myvnc.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=716"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}